-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FR: Configurable MTU for TCP MSS clamping when using k8s-operator #11002
Comments
Any thoughts why clamp-mss-to-pmtu might be clamping to a value that's too high? Any additional details you could share about your environment where this happens that would make it easier to reproduce? |
The setup here is a baremetal Kubernetes cluster running Cilium, with 10Gbit NICs where the MTU is 9000 normally for jumbo frames, so within containers there's a network interface with an MTU of 9000 which clamp-mss-to-pmtu is presumably clamping onto. |
I don't really write Go but I've forked Tailscale and jankily added what I needed for iptables at clrxbl@2441f17 Clamping TCP MSS to 1280 solves my issues and makes the Tailscale operator usable again. Without this change, connections are either hanging on connecting (e.g. HTTPS connections) or just unusably slow. It'd be great to not rely on me maintaining a fork though. I'd be interested to clean it up and implement NFTables support if Tailscale was interested in a PR for this. |
Thanks for confirming that setting the MSS value manually addressed your issue. I think we'd want to avoid exposing this as a user-facing configuration parameter, instead improving the existing behaviour or determining the MSS value automatically. Do you think you could please share the output of the following commands from the pod running Tailscale that required this change?
Thank you! |
Sure thing.
|
I should clarify that clamping the tailscale0 interface MSS to a smaller MTU value doesn't resolve the issue. Instead I have to apply this rule onto the container's main network interface e.g. iptables -t mangle -A FORWARD -p tcp --tcp-flags SYN,RST SYN -o eth0 -j TCPMSS --set-mss 1240 Without this rule, traffic like HTTPS traffic just do not ever finish.
|
I don't have access to a bare metal cluster with jumbo frames, but I could not reproduce this on a GKE cluster with a higher MTU. E.g.:
A TCP connection from the cluster into the tailnet gets its MSS correctly adjusted (from 8856 to 1240).
Could you please share a bit more details about what feature of the operator you are using (egress, ingress via a Service or ingress via an Ingress resource) and about HTTPS client and server involved? Capturing MSS values of a connection (like I shared above) would be helpful too.
This is a bit odd - why is it necessary to clamp MSS to 1240 while egressing to eth0, which has mtu of 9000? |
I use Ingress via a Service only currently.
Server is in this case ingress-nginx, Let's Encrypt TLS certificate handled by cert-manager & client is just curl. If I curl something on the host, it seems to connect just fine. If I repeat the same within e.g. a Debian docker container, the connection breaks. Docker by default creates a network with 1500 MTU. Manually changing this to 1240 makes it work; so either I change it on every host to 1240 or I change it in the Tailscale operator to change the MTU for egress. The same thing happens when trying to curl a Tailscale service within a Debian container on Kubernetes running Cilium as CNI with Tailscale running on the node itself so all pods have access to Tailscale networking. |
I could not reproduce this on a GKE cluster using Dataplane V2, which I believe is based on Cilium. Are you able to share additional details about the cluster you are running this in that might help us reproduce the issue? At the minimum I think it would help to see your Cilium configuration and the version of Cilium you are using. |
Sorry for not converting it to yaml but you'll get the gist. In my case the config looked like this: {
helm.releases.cilium = {
namespace = "kube-system";
chart = lib.helm.downloadHelmChart {
repo = "https://helm.cilium.io/";
chart = "cilium";
version = "1.12.13";
chartHash = "sha256-VhgURNprXIZuuR0inWeUlRJPBGwbnXC8zzSxKNL1zzE=";
};
values = {
ipam = { mode = "cluster-pool"; };
k8s = { requireIPv4PodCIDR = true; };
hubble = {
tls.enabled = false;
relay = { enabled = true; };
ui = { enabled = true; };
};
operator = { replicas = 1; };
egressGateway = { enabled = true; };
bpf = {
masquerade = true;
lbExternalClusterIP = true;
};
kubeProxyReplacement = "strict";
# Works around https://github.com/tailscale/tailscale/issues/11002
extraConfig = { mtu = "1200"; };
socketLB.hostNamespaceOnly = true;
rollOutCiliumPods = true;
};
};
} Tailscale operator is deployed without any extra flags. The services leverage the annotation:
|
#11002 (comment) describes my setup devices: ["bond0.+", "tailscale0"]
rollOutCiliumPods: true
bandwidthManager:
enabled: true
hostFirewall:
enabled: true
nodePort:
enabled: true
nodeinit:
enabled: true
restartPods: true
socketLB:
enabled: true
hostNamespaceOnly: true
kubeProxyReplacement: true
k8sServiceHost: cluster-endpoint
k8sServicePort: 6443
hubble:
tls:
auto:
method: cronJob
enabled: true
relay:
enabled: true
ui:
enabled: true
metrics:
serviceMonitor:
enabled: true
enabled:
- dns:query;ignoreAAAA
- drop
- tcp
- icmp
- port-distribution
hostPort:
enabled: true
prometheus:
enabled: true
serviceMonitor:
enabled: true
operator:
prometheus:
enabled: true
serviceMonitor:
enabled: true
ipv6:
enabled: false
bpf:
hostRoutingLegacy: false
masquerade: true
l7Proxy: false |
MSS clamping for nftables was mostly not ran due to to an earlier rule in the FORWARD chain issuing accept verdict. This commit places the clamping rule into a chain of its own to ensure that it gets ran for all SYN packets. Updates #11002 Signed-off-by: Irbe Krumina <irbe@tailscale.com>
I was trying to reproduce the issue and was able to partially reproduce this if at least one of the two Tailscale nodes connecting configures its firewall using nftables and also is the one sending larger packets. It appears that our MSS clamping rule in nftables mode is not actually getting executed because it gets placed in a chain where an earlier rule jumps to another chain which returns with accept verdict. Some debugging details from a cross-cluster setup
- `tcpdump` on the workload Pod exposed via tailscale ingress proxy shows that caller sends MSS 8856, which is wrong:
// 10.233.65.114 - ingress proxy's Pod IP
// 10.233.64.192 - workload Pod's IP
# tcpdump: listening on eth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
IP (tos 0x0, ttl 60, id 4347, offset 0, flags [DF], proto TCP (6), length 60)
10.233.65.114.35290 > 10.233.64.192.8080: Flags [S], cksum 0x61d7 (correct), seq 2383755368, win 35424, options [mss 8856,sackOK,TS val 243476597 ecr 0,nop,wscale 7], length 0
- `nft monitor` on the ingress proxy Pod shows that the clamping rule is never executed (I assume because we jump to ts-forward chain in an earlier rule and the packet gets accepted there)
trace id 2a043d97 ip filter FORWARD packet: iif "eth0" oif "tailscale0" ether saddr 0a:6a:c2:45:5c:c9 ether daddr 96:5b:c2:7b:e4:dd ip saddr 10.233.31.12 ip daddr 100.75.96.30 ip dscp cs0 ip ecn not-ect ip ttl 62 ip id 0 ip length 60 tcp sport 8080 tcp dport 58346 tcp flags == 0x12 tcp window 61558
trace id 2a043d97 ip filter FORWARD rule counter packets 100 bytes 6178 meta nftrace set 1 jump ts-forward (verdict jump ts-forward)
trace id 2a043d97 ip filter ts-forward rule oifname "tailscale0*" counter packets 44 bytes 2426 meta nftrace set 1 accept (verdict accept)
- the workload is not able to access the exposed service via egress
/ # curl -vvv kuard-egress:8080
* Host kuard-egress:8080 was resolved.
* IPv6: (none)
* IPv4: 10.96.0.76
* Trying 10.96.0.76:8080...
* Connected to kuard-egress (10.96.0.76) port 8080
> GET / HTTP/1.1
> Host: kuard-egress:8080
> User-Agent: curl/8.5.0
> Accept: */*
- running `tcpdump` on the workload shows that the tailnet service sends back SYN/ACK packet that suggests that it's MSS is 8806 - this is already wrong.
/ # tcpdump -t -n -vvv src host 10.96.0.76 // 10.96.0.76 is the IP of the tailscale egress proxy Pod
tcpdump: listening on eth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
IP (tos 0x0, ttl 60, id 0, offset 0, flags [DF], proto TCP (6), length 60)
10.96.0.76.8080 > 10.96.0.18.51404: Flags [S.], cksum 0x7910 (correct), seq 4141237154, ack 3535746567, win 61558, options [mss 8806,sackOK,TS val 1121584187 ecr 242035578,nop,wscale 7], length 0
- running `nft monitor` on the egress proxy shows that we do not execute the clamping rule on the egress proxy side because the packet gets accepted by another in `ts-forward` chain where we jump to from `FORWARD`:
/ # nft list table ip filter
table ip filter {
chain FORWARD {
type filter hook forward priority filter; policy accept;
counter packets 51 bytes 3165 jump ts-forward
oifname "tailscale0*" tcp flags syn tcp option maxseg size set rt mtu
}
@wokalski , @clrxbl is it possible that either of you are running Tailscale client with firewall configured in nftables mode on either end of the failing connection? |
I believe I encountered this issue client-side on both my macOS laptop & Debian 12 server with nftables. I could try again with your branch if there's an operator image available for it. |
MSS clamping for nftables was mostly not ran due to to an earlier rule in the FORWARD chain issuing accept verdict. This commit places the clamping rule into a chain of its own to ensure that it gets ran for all SYN packets. Updates #11002 Signed-off-by: Irbe Krumina <irbe@tailscale.com>
MSS clamping for nftables was mostly not ran due to to an earlier rule in the FORWARD chain issuing accept verdict. This commit places the clamping rule into a chain of its own to ensure that it gets ran. Updates #11002 Signed-off-by: Irbe Krumina <irbe@tailscale.com>
Replaced all instances of my Tailscale Operator fork with your image & Helm chart. It does not seem to solve my issue where I cannot connect to Tailscale services on another host within a Kubernetes (+ Cilium) pod. It gets stuck after trying to send the TLS handshake when curl'ing something. Reverting back to my fork with customizable TCP MSS clamping and it fixes everything again. |
is the hypothesis that the host firewall interferes with tailscale? if tailscale runs inside of a pod, why would host firewall have any impact here? |
Appreciate you testing out the patch and thanks for confirming. If you get a chance to run some
no, this is purely about nftables rules set by tailscale/proxies (so, if running in a Pod then in Pod's network namespace) |
MSS seems to be the same unless I'm reading it wrong.
|
Thank you very much for that. |
Yeah, this is a curl GET request to a Tailscale pod exposing an ingress-nginx server, curl gets stuck after the TLS handshake attempt, see #11002 (comment) Using an ephemeral container I was able to tcpdump on the ingress-nginx pod itself. Interestingly enough, there is no traffic from/to the Tailscale proxy pod as soon as I switch to the unstable operator version and run the curl command. I have also ran tcpdump on the Tailscale proxy itself.
|
This is unexpected. If I understand correctly, So, it looks like what might be happening is:
We should figure out how the MSS on the packet received from client ended up being 8910. Would you be able to run something like Also, I noticed this in your setup:
Why do you need to add |
We certainly want to figure out what is happening here! The issue you originally had is not necessarily the same as the one in the original issue description. Have you tried the nftables clamping fix #11002 (comment) ? |
I add tailscale0 to Cilium devices on our baremetal development/staging clusters that run Tailscale on nodes. It makes it easier to access other Tailscale services within containers without having to create Tailscale egress proxies. #11002 (comment) I've ran the same curl request from within a debian:bookworm container on a baremetal server that runs Tailscale & Docker instead of Kubernetes + Cilium this time. The SSL handshake gets stuck again as expected. Here is said tcpdump taken from the Docker host.
|
Thank you for clarifying 👍🏼
So |
Here's a tcpdump while running the same curl request (that does function) on the host instead of within a Docker container.
Also providing the host's ifconfig results since those could be useful here.
ifconfig from within a container:
|
To clarify- that functions only with your fork or also with the latest tailscale? I do see that mss got set correctly there. |
On the host it'll function with the latest Tailscale (your unstable image). |
Thanks for clarifying that! So for now my understanding is that the issue in your case might be that when your client in container calls the exposed cluster workload, the packets somehow first get sent via an interface with an mtu that is higher than that of I would then assume that if, instead of setting mss on the proxy Pod's I also noticed that there is a |
I've tried that flag on Cilium before but it didn't seem to change anything.
Haven't thought of this before actually but you do seem to be right. |
Thanks for confirming. To be honest I don't really know what that flag does. Maybe improving path mtu discovery is something to be raised with cilium folks.
Ok, so then this is an issue that could ultimately be solved by us doing clamping in core Linux client. I would like to understand the Docker setup a bit better and try to reproduce this as a use case for clamping in core. If you get a chance to add some notes about why you need that client-side setup that would also help us. @wokalski do you also have a similar setup client-side? I guess your issue might be different. |
My fleet of servers consist out of mostly baremetal servers that already run Tailscale on the host itself. While I could probably workaround this by running a secondary Tailscale instance within a container with e.g. userspace networking and expose it other containers using HTTP_PROXY variables, this is just less ideal versus being able to leech off of the host's Tailscale instance. We have Docker workloads connecting to Tailscale services hosted within Kubernetes + Cilium clusters, exposed using the Tailscale k8s-operator. |
@irbekrm client side I have two types of clients:
|
Thanks @wokalski! Sounds like it might be either #11002 (comment) or another issue altogether. Keen to hear if the latest images with the nftables fix work for you, if not, we'll have to debug this separately I think |
I will test the new image this weekend. |
It seems that this issue no longer presents on |
Thank you for confirming this! |
What are you trying to do?
I'm trying to migrate from my own DIY Tailscale operator to the official one, but running into an issue with network connectivity.
I need to clamp MSS to a specific MTU size (e.g. 1280) otherwise Tailscale seems to be currently clamping onto MTU 9000 which seems to be breaking HTTPS connections
How should we solve this?
The ability to define a configurable MTU size in the Helm chart's proxyConfig, disabling clamping onto PMTU
What is the impact of not solving this?
I can work around this by forcing our Kubernetes CNI to use an MTU size that fits within Tailscale's 1280, disabling jumbo frames temporarily, but this is not ideal.
Anything else?
No response
The text was updated successfully, but these errors were encountered: