Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FR: Configurable MTU for TCP MSS clamping when using k8s-operator #11002

Open
clrxbl opened this issue Feb 1, 2024 · 37 comments
Open

FR: Configurable MTU for TCP MSS clamping when using k8s-operator #11002

clrxbl opened this issue Feb 1, 2024 · 37 comments
Labels
fr Feature request kubernetes L2 Few Likelihood P1 Nuisance Priority level T0 New feature Issue type

Comments

@clrxbl
Copy link

clrxbl commented Feb 1, 2024

What are you trying to do?

I'm trying to migrate from my own DIY Tailscale operator to the official one, but running into an issue with network connectivity.

I need to clamp MSS to a specific MTU size (e.g. 1280) otherwise Tailscale seems to be currently clamping onto MTU 9000 which seems to be breaking HTTPS connections

How should we solve this?

The ability to define a configurable MTU size in the Helm chart's proxyConfig, disabling clamping onto PMTU

What is the impact of not solving this?

I can work around this by forcing our Kubernetes CNI to use an MTU size that fits within Tailscale's 1280, disabling jumbo frames temporarily, but this is not ideal.

Anything else?

No response

@clrxbl clrxbl added fr Feature request needs-triage labels Feb 1, 2024
@clrxbl clrxbl changed the title FR: Clamp TCP MSS to specific MTU when using k8s-operator FR: Configurable MTU for TCP MSS clamping when using k8s-operator Feb 1, 2024
@knyar knyar added kubernetes L2 Few Likelihood P1 Nuisance Priority level T0 New feature Issue type and removed needs-triage labels Feb 1, 2024
@knyar
Copy link
Contributor

knyar commented Feb 1, 2024

Any thoughts why clamp-mss-to-pmtu might be clamping to a value that's too high? Any additional details you could share about your environment where this happens that would make it easier to reproduce?

@clrxbl
Copy link
Author

clrxbl commented Feb 1, 2024

Any thoughts why clamp-mss-to-pmtu might be clamping to a value that's too high? Any additional details you could share about your environment where this happens that would make it easier to reproduce?

The setup here is a baremetal Kubernetes cluster running Cilium, with 10Gbit NICs where the MTU is 9000 normally for jumbo frames, so within containers there's a network interface with an MTU of 9000 which clamp-mss-to-pmtu is presumably clamping onto.

@clrxbl
Copy link
Author

clrxbl commented Feb 1, 2024

I don't really write Go but I've forked Tailscale and jankily added what I needed for iptables at clrxbl@2441f17

Clamping TCP MSS to 1280 solves my issues and makes the Tailscale operator usable again. Without this change, connections are either hanging on connecting (e.g. HTTPS connections) or just unusably slow.

It'd be great to not rely on me maintaining a fork though. I'd be interested to clean it up and implement NFTables support if Tailscale was interested in a PR for this.

@knyar
Copy link
Contributor

knyar commented Feb 2, 2024

Thanks for confirming that setting the MSS value manually addressed your issue. I think we'd want to avoid exposing this as a user-facing configuration parameter, instead improving the existing behaviour or determining the MSS value automatically.

Do you think you could please share the output of the following commands from the pod running Tailscale that required this change?

  • ip addr
  • ip route show cache
  • sysctl -a | grep mtu

Thank you!

@clrxbl
Copy link
Author

clrxbl commented Feb 2, 2024

Thanks for confirming that setting the MSS value manually addressed your issue. I think we'd want to avoid exposing this as a user-facing configuration parameter, instead improving the existing behaviour or determining the MSS value automatically.

Do you think you could please share the output of the following commands from the pod running Tailscale that required this change?

  • ip addr
  • ip route show cache
  • sysctl -a | grep mtu

Thank you!

Sure thing.

/ # ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: tailscale0: <POINTOPOINT,MULTICAST,NOARP,UP,LOWER_UP> mtu 1280 qdisc fq state UNKNOWN group default qlen 500
    link/none
    inet 100.69.77.47/32 scope global tailscale0
       valid_lft forever preferred_lft forever
    inet6 fd7a:115c:a1e0::8745:4d2f/128 scope global
       valid_lft forever preferred_lft forever
    inet6 fe80::90ee:1993:783a:d025/64 scope link stable-privacy
       valid_lft forever preferred_lft forever
53584: eth0@if53585: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000
    link/ether b6:cd:db:a6:da:13 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 172.16.0.6/32 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::b4cd:dbff:fea6:da13/64 scope link
       valid_lft forever preferred_lft forever
/ # ip route show cache
/ # sysctl -a | grep mtu
sysctl: error reading key 'kernel.apparmor_display_secid_mode': Operation not permitted
sysctl: error reading key 'kernel.unprivileged_userns_apparmor_policy': Operation not permitted
net.ipv4.ip_forward_use_pmtu = 0
net.ipv4.ip_no_pmtu_disc = 0
net.ipv4.route.min_pmtu = 552
net.ipv4.route.mtu_expires = 600
net.ipv4.tcp_mtu_probe_floor = 48
net.ipv4.tcp_mtu_probing = 0
net.ipv6.conf.all.accept_ra_mtu = 1
net.ipv6.conf.all.mtu = 1280
sysctl: error reading key 'net.ipv6.conf.all.stable_secret': I/O error
net.ipv6.conf.default.accept_ra_mtu = 1
net.ipv6.conf.default.mtu = 1280
sysctl: error reading key 'net.ipv6.conf.default.stable_secret': I/O error
net.ipv6.conf.eth0.accept_ra_mtu = 1
net.ipv6.conf.eth0.mtu = 9000
sysctl: error reading key 'net.ipv6.conf.eth0.stable_secret': I/O error
net.ipv6.conf.lo.accept_ra_mtu = 1
net.ipv6.conf.lo.mtu = 65536
sysctl: error reading key 'net.ipv6.conf.lo.stable_secret': I/O error
net.ipv6.conf.tailscale0.accept_ra_mtu = 1
net.ipv6.conf.tailscale0.mtu = 1280
net.ipv6.route.mtu_expires = 600

@clrxbl
Copy link
Author

clrxbl commented Feb 4, 2024

I should clarify that clamping the tailscale0 interface MSS to a smaller MTU value doesn't resolve the issue. Instead I have to apply this rule onto the container's main network interface e.g.

iptables -t mangle -A FORWARD -p tcp --tcp-flags SYN,RST SYN -o eth0 -j TCPMSS --set-mss 1240

Without this rule, traffic like HTTPS traffic just do not ever finish.

curl https://100.94.66.26 -vvvvv
*   Trying 100.94.66.26:443...
* Connected to 100.94.66.26 (100.94.66.26) port 443 (#0)
* ALPN: offers h2,http/1.1
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
*  CAfile: /etc/ssl/certs/ca-certificates.crt
*  CApath: /etc/ssl/certs

@knyar
Copy link
Contributor

knyar commented Feb 7, 2024

I don't have access to a bare metal cluster with jumbo frames, but I could not reproduce this on a GKE cluster with a higher MTU. E.g.:

2: eth0@if13: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8896 qdisc noqueue state UP group default
    link/ether 32:ef:80:2d:64:80 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.44.0.11/24 brd 10.44.0.255 scope global eth0
       valid_lft forever preferred_lft forever
3: tailscale0: <POINTOPOINT,MULTICAST,NOARP,UP,LOWER_UP> mtu 1280 qdisc pfifo_fast state UNKNOWN group default qlen 500
    link/none
    inet 100.80.228.1/32 scope global tailscale0
       valid_lft forever preferred_lft forever

A TCP connection from the cluster into the tailnet gets its MSS correctly adjusted (from 8856 to 1240).

 # tcpdump -ni any port 8092
tcpdump: data link type LINUX_SLL2
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes
15:12:58.836981 eth0  In  IP 10.44.0.7.59534 > 10.44.0.11.8092: Flags [S], seq 619223444, win 17712, options [mss 8856,sackOK,TS val 4008564504 ecr 0,nop,wscale 7], length 0
15:12:58.837012 tailscale0 Out IP 100.80.228.1.59534 > 100.99.113.119.8092: Flags [S], seq 619223444, win 17712, options [mss 1240,sackOK,TS val 4008564504 ecr 0,nop,wscale 7], length 0

Could you please share a bit more details about what feature of the operator you are using (egress, ingress via a Service or ingress via an Ingress resource) and about HTTPS client and server involved? Capturing MSS values of a connection (like I shared above) would be helpful too.

I should clarify that clamping the tailscale0 interface MSS to a smaller MTU value doesn't resolve the issue. Instead I have to apply this rule onto the container's main network interface e.g.
iptables -t mangle -A FORWARD -p tcp --tcp-flags SYN,RST SYN -o eth0 -j TCPMSS --set-mss 1240

This is a bit odd - why is it necessary to clamp MSS to 1240 while egressing to eth0, which has mtu of 9000?

@knyar knyar closed this as completed Feb 7, 2024
@knyar knyar reopened this Feb 7, 2024
@clrxbl
Copy link
Author

clrxbl commented Feb 8, 2024

Could you please share a bit more details about what feature of the operator you are using (egress, ingress via a Service or ingress via an Ingress resource) and about HTTPS client and server involved? Capturing MSS values of a connection (like I shared above) would be helpful too.

I use Ingress via a Service only currently.

This is a bit odd - why is it necessary to clamp MSS to 1240 while egressing to eth0, which has mtu of 9000?
Honestly, not sure. But for some reason this has been my only solution to Tailscale networking issues. Without this, it seems to specifically break within any other Docker container.

Server is in this case ingress-nginx, Let's Encrypt TLS certificate handled by cert-manager & client is just curl.

If I curl something on the host, it seems to connect just fine. If I repeat the same within e.g. a Debian docker container, the connection breaks. Docker by default creates a network with 1500 MTU. Manually changing this to 1240 makes it work; so either I change it on every host to 1240 or I change it in the Tailscale operator to change the MTU for egress.

The same thing happens when trying to curl a Tailscale service within a Debian container on Kubernetes running Cilium as CNI with Tailscale running on the node itself so all pods have access to Tailscale networking.
It seems like something is going wrong when Docker (and Cilium) networking tries to access a service over Tailscale's network.

@knyar
Copy link
Contributor

knyar commented Mar 5, 2024

I could not reproduce this on a GKE cluster using Dataplane V2, which I believe is based on Cilium. Are you able to share additional details about the cluster you are running this in that might help us reproduce the issue? At the minimum I think it would help to see your Cilium configuration and the version of Cilium you are using.

@wokalski
Copy link

wokalski commented Mar 5, 2024

Sorry for not converting it to yaml but you'll get the gist. In my case the config looked like this:

{
    helm.releases.cilium = {
      namespace = "kube-system";
      chart = lib.helm.downloadHelmChart {
        repo = "https://helm.cilium.io/";
        chart = "cilium";
        version = "1.12.13";
        chartHash = "sha256-VhgURNprXIZuuR0inWeUlRJPBGwbnXC8zzSxKNL1zzE=";
      };
      values = {
        ipam = { mode = "cluster-pool"; };
        k8s = { requireIPv4PodCIDR = true; };
        hubble = {
          tls.enabled = false;
          relay = { enabled = true; };
          ui = { enabled = true; };
        };
        operator = { replicas = 1; };
        egressGateway = { enabled = true; };
        bpf = {
          masquerade = true;
          lbExternalClusterIP = true;
        };
        kubeProxyReplacement = "strict";
        # Works around https://github.com/tailscale/tailscale/issues/11002
        extraConfig = { mtu = "1200"; };
        socketLB.hostNamespaceOnly = true;
        rollOutCiliumPods = true;
      };
    };
  }

Tailscale operator is deployed without any extra flags. The services leverage the annotation:

"tailscale.com/tailnet-ip": ... ip

@clrxbl
Copy link
Author

clrxbl commented Mar 5, 2024

I could not reproduce this on a GKE cluster using Dataplane V2, which I believe is based on Cilium. Are you able to share additional details about the cluster you are running this in that might help us reproduce the issue? At the minimum I think it would help to see your Cilium configuration and the version of Cilium you are using.

#11002 (comment) describes my setup
My Cilium config/Helm values are for Cilium 1.15.0:

    devices: ["bond0.+", "tailscale0"]
    rollOutCiliumPods: true
    bandwidthManager:
      enabled: true
    hostFirewall:
      enabled: true
    nodePort:
      enabled: true
    nodeinit:
      enabled: true
      restartPods: true
    socketLB:
      enabled: true
      hostNamespaceOnly: true
    kubeProxyReplacement: true
    k8sServiceHost: cluster-endpoint
    k8sServicePort: 6443
    hubble:
      tls:
        auto:
          method: cronJob
      enabled: true
      relay:
        enabled: true
      ui:
        enabled: true
      metrics:
        serviceMonitor:
          enabled: true
        enabled:
          - dns:query;ignoreAAAA
          - drop
          - tcp
          - icmp
          - port-distribution
    hostPort:
      enabled: true
    prometheus:
      enabled: true
      serviceMonitor:
        enabled: true
    operator:
      prometheus:
        enabled: true
        serviceMonitor:
          enabled: true
    ipv6:
      enabled: false
    bpf:
      hostRoutingLegacy: false
      masquerade: true
    l7Proxy: false

irbekrm added a commit that referenced this issue Apr 2, 2024
MSS clamping for nftables was mostly not ran due to to an earlier rule in the FORWARD chain issuing accept verdict.
This commit places the clamping rule into a chain of its own to ensure that it gets ran for all SYN packets.

Updates #11002

Signed-off-by: Irbe Krumina <irbe@tailscale.com>
@irbekrm
Copy link
Contributor

irbekrm commented Apr 2, 2024

I was trying to reproduce the issue and was able to partially reproduce this if at least one of the two Tailscale nodes connecting configures its firewall using nftables and also is the one sending larger packets. It appears that our MSS clamping rule in nftables mode is not actually getting executed because it gets placed in a chain where an earlier rule jumps to another chain which returns with accept verdict.
I was testing this with a cross cluster setup where both clusters have high default MTU and one runs Cilium in kube proxy replacement mode. I think it's possible that we did not see the issue before because path MTU discovery kicked in- perhaps it does not work with Cilium in kube proxy replacement mode.

Some debugging details from a cross-cluster setup

  • Cluster A: a self hosted cluster with Cilium in kube proxy replacement mode with MTU 65536 of with a workload exposed via Tailscale Ingress Service, proxies firewall mode = nftables:
- `tcpdump` on the workload Pod exposed via tailscale ingress proxy shows that caller sends MSS 8856, which is wrong: // 10.233.65.114 - ingress proxy's Pod IP // 10.233.64.192 - workload Pod's IP # tcpdump: listening on eth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes IP (tos 0x0, ttl 60, id 4347, offset 0, flags [DF], proto TCP (6), length 60) 10.233.65.114.35290 > 10.233.64.192.8080: Flags [S], cksum 0x61d7 (correct), seq 2383755368, win 35424, options [mss 8856,sackOK,TS val 243476597 ecr 0,nop,wscale 7], length 0 - `nft monitor` on the ingress proxy Pod shows that the clamping rule is never executed (I assume because we jump to ts-forward chain in an earlier rule and the packet gets accepted there) trace id 2a043d97 ip filter FORWARD packet: iif "eth0" oif "tailscale0" ether saddr 0a:6a:c2:45:5c:c9 ether daddr 96:5b:c2:7b:e4:dd ip saddr 10.233.31.12 ip daddr 100.75.96.30 ip dscp cs0 ip ecn not-ect ip ttl 62 ip id 0 ip length 60 tcp sport 8080 tcp dport 58346 tcp flags == 0x12 tcp window 61558 trace id 2a043d97 ip filter FORWARD rule counter packets 100 bytes 6178 meta nftrace set 1 jump ts-forward (verdict jump ts-forward) trace id 2a043d97 ip filter ts-forward rule oifname "tailscale0*" counter packets 44 bytes 2426 meta nftrace set 1 accept (verdict accept)
  • Cluster B: a GKE cluster with Calico, MTU 65536, a workload that accesses a tailnet service from cluster B via egress proxy, proxies firewall mode = nftables:
- the workload is not able to access the exposed service via egress / # curl -vvv kuard-egress:8080 * Host kuard-egress:8080 was resolved. * IPv6: (none) * IPv4: 10.96.0.76 * Trying 10.96.0.76:8080... * Connected to kuard-egress (10.96.0.76) port 8080 > GET / HTTP/1.1 > Host: kuard-egress:8080 > User-Agent: curl/8.5.0 > Accept: */* - running `tcpdump` on the workload shows that the tailnet service sends back SYN/ACK packet that suggests that it's MSS is 8806 - this is already wrong. / # tcpdump -t -n -vvv src host 10.96.0.76 // 10.96.0.76 is the IP of the tailscale egress proxy Pod tcpdump: listening on eth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes IP (tos 0x0, ttl 60, id 0, offset 0, flags [DF], proto TCP (6), length 60) 10.96.0.76.8080 > 10.96.0.18.51404: Flags [S.], cksum 0x7910 (correct), seq 4141237154, ack 3535746567, win 61558, options [mss 8806,sackOK,TS val 1121584187 ecr 242035578,nop,wscale 7], length 0 - running `nft monitor` on the egress proxy shows that we do not execute the clamping rule on the egress proxy side because the packet gets accepted by another in `ts-forward` chain where we jump to from `FORWARD`: / # nft list table ip filter table ip filter { chain FORWARD { type filter hook forward priority filter; policy accept; counter packets 51 bytes 3165 jump ts-forward oifname "tailscale0*" tcp flags syn tcp option maxseg size set rt mtu }
    chain INPUT {
            type filter hook input priority filter; policy accept;
            counter packets 1664 bytes 3041979 jump ts-input
    }

    chain ts-forward {
            iifname "tailscale0*" counter packets 23 bytes 1265 meta mark set meta mark & 0xffff04ff | 0x00000400
            meta mark & 0x0000ff00 == 0x00000400 counter packets 23 bytes 1265 accept
            oifname "tailscale0*" ip saddr 100.64.0.0/10 counter packets 0 bytes 0 drop
            oifname "tailscale0*" counter packets 28 bytes 1900 accept
    }

    chain ts-input {
            iifname "lo*" ip saddr 100.75.96.30 counter packets 0 bytes 0 accept
            iifname != "tailscale0*" ip saddr 100.115.92.0/23 counter packets 0 bytes 0 return
            iifname != "tailscale0*" ip saddr 100.64.0.0/10 counter packets 0 bytes 0 drop
            iifname "tailscale0*" counter packets 0 bytes 0 accept
            udp dport 48148 counter packets 817 bytes 99132 accept
    }

}

nft monitor

...
trace id afb8fb18 ip filter FORWARD packet: iif "eth0" oif "tailscale0" ether saddr 82:22:b1:09:01:1c ether daddr 06:64:09:e4:5c:1e ip saddr 10.96.0.18 ip daddr 100.110.16.30 ip dscp cs0 ip ecn not-ect ip ttl 62 ip id 43585 ip length 52 tcp sport 34354 tcp dport 8080 tcp flags == ack tcp window 277
trace id afb8fb18 ip filter FORWARD rule counter packets 60 bytes 3669 meta nftrace set 1 jump ts-forward (verdict jump ts-forward)
trace id afb8fb18 ip filter ts-forward rule oifname "tailscale0*" counter packets 31 bytes 2068 meta nftrace set 1 accept (verdict accept)
...

@wokalski , @clrxbl is it possible that either of you are running Tailscale client with firewall configured in nftables mode on either end of the failing connection?
If that is not the case, then it might be that there is another bug as well- in that case, if you still have the setup around get a chance to try to run tcpdump on either end (either the exposed workload or the caller) like in the examples I posted and check whether the other end sends MSS larger than expected that would be awesome 🙏🏼

@clrxbl
Copy link
Author

clrxbl commented Apr 2, 2024

I was trying to reproduce the issue and was able to partially reproduce this if at least one of the two Tailscale nodes connecting configures its firewall using nftables and also is the one sending larger packets. It appears that our MSS clamping rule in nftables mode is not actually getting executed because it gets placed in a chain where an earlier rule jumps to another chain which returns with accept verdict. I was testing this with a cross cluster setup where both clusters have high default MTU and one runs Cilium in kube proxy replacement mode. I think it's possible that we did not see the issue before because path MTU discovery kicked in- perhaps it does not work with Cilium in kube proxy replacement mode.

Some debugging details from a cross-cluster setup

  • Cluster A: a self hosted cluster with Cilium in kube proxy replacement mode with MTU 65536 of with a workload exposed via Tailscale Ingress Service, proxies firewall mode = nftables:

  • Cluster B: a GKE cluster with Calico, MTU 65536, a workload that accesses a tailnet service from cluster B via egress proxy, proxies firewall mode = nftables:

@wokalski , @clrxbl is it possible that either of you are running Tailscale client with firewall configured in nftables mode on either end of the failing connection? If that is not the case, then it might be that there is another bug as well- in that case, if you still have the setup around get a chance to try to run tcpdump on either end (either the exposed workload or the caller) like in the examples I posted and check whether the other end sends MSS larger than expected that would be awesome 🙏🏼

I believe I encountered this issue client-side on both my macOS laptop & Debian 12 server with nftables. I could try again with your branch if there's an operator image available for it.

@irbekrm
Copy link
Contributor

irbekrm commented Apr 2, 2024

Thanks you very much @clrxbl ! I believe #11588 will fix the nftables clamping issue- once that merges, I will cut a new image and ping you.

irbekrm added a commit that referenced this issue Apr 2, 2024
MSS clamping for nftables was mostly not ran due to to an earlier rule in the FORWARD chain issuing accept verdict.
This commit places the clamping rule into a chain of its own to ensure that it gets ran for all SYN packets.

Updates #11002

Signed-off-by: Irbe Krumina <irbe@tailscale.com>
raggi added a commit that referenced this issue Apr 2, 2024
We now allow some more ICMP errors to flow, specifically:

- ICMP parameter problem in both IPv4 and IPv6 (corrupt headers)
- ICMP Packet Too Big (for IPv6 PMTU)

Updates #311
Updates #8102
Updates #11002

Signed-off-by: James Tucker <james@tailscale.com>
irbekrm added a commit that referenced this issue Apr 2, 2024
MSS clamping for nftables was mostly not ran due to to an earlier rule in the FORWARD chain issuing accept verdict.
This commit places the clamping rule into a chain of its own to ensure that it gets ran.

Updates #11002

Signed-off-by: Irbe Krumina <irbe@tailscale.com>
raggi added a commit that referenced this issue Apr 2, 2024
We now allow some more ICMP errors to flow, specifically:

- ICMP parameter problem in both IPv4 and IPv6 (corrupt headers)
- ICMP Packet Too Big (for IPv6 PMTU)

Updates #311
Updates #8102
Updates #11002

Signed-off-by: James Tucker <james@tailscale.com>
@irbekrm
Copy link
Contributor

irbekrm commented Apr 2, 2024

@clrxbl the PR that fixes the nftables clamping merged and I've cut a new image/Helm chart version with that fix as well as #11591
unstable-v1.63.72 image tag/1.63.72 Helm chart from our dev repo has both of these changes if you get a chance to try them out.

@clrxbl
Copy link
Author

clrxbl commented Apr 2, 2024

@clrxbl the PR that fixes the nftables clamping merged and I've cut a new image/Helm chart version with that fix as well as #11591 unstable-v1.63.72 image tag/1.63.72 Helm chart from our dev repo has both of these changes if you get a chance to try them out.

Replaced all instances of my Tailscale Operator fork with your image & Helm chart. It does not seem to solve my issue where I cannot connect to Tailscale services on another host within a Kubernetes (+ Cilium) pod. It gets stuck after trying to send the TLS handshake when curl'ing something.

Reverting back to my fork with customizable TCP MSS clamping and it fixes everything again.

@wokalski
Copy link

wokalski commented Apr 2, 2024

if at least one of the two Tailscale nodes connecting configures its firewall using nftables and also is the one sending larger packets.

is the hypothesis that the host firewall interferes with tailscale? if tailscale runs inside of a pod, why would host firewall have any impact here?

@irbekrm
Copy link
Contributor

irbekrm commented Apr 2, 2024

Replaced all instances of my Tailscale Operator fork with your image & Helm chart. It does not seem to solve my issue where I cannot connect to Tailscale services on another host within a Kubernetes (+ Cilium) pod. It gets stuck after trying to send the TLS handshake when curl'ing something.

Appreciate you testing out the patch and thanks for confirming. If you get a chance to run some tcpdumps client-side to check the MSS for packets coming in that would be awesome (similar to #11002 (comment) )

is the hypothesis that the host firewall interferes with tailscale

no, this is purely about nftables rules set by tailscale/proxies (so, if running in a Pod then in Pod's network namespace)

@clrxbl
Copy link
Author

clrxbl commented Apr 2, 2024

Replaced all instances of my Tailscale Operator fork with your image & Helm chart. It does not seem to solve my issue where I cannot connect to Tailscale services on another host within a Kubernetes (+ Cilium) pod. It gets stuck after trying to send the TLS handshake when curl'ing something.

Appreciate you testing out the patch and thanks for confirming. If you get a chance to run some tcpdumps client-side to check the MSS for packets coming in that would be awesome (similar to #11002 (comment) )

is the hypothesis that the host firewall interferes with tailscale

no, this is purely about nftables rules set by tailscale/proxies (so, if running in a Pod then in Pod's network namespace)

# fork of tailscale-operator where everything functions
debug:/# tcpdump -t -n -vvv src host 100.94.66.26
tcpdump: listening on eth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
IP (tos 0x0, ttl 61, id 0, offset 0, flags [DF], proto TCP (6), length 60)
    100.94.66.26.443 > 172.16.1.48.41602: Flags [S.], cksum 0xcec6 (correct), seq 3496847111, ack 374324766, win 62286, options [mss 1240,sackOK,TS val 1061588656 ecr 573788851,nop,wscale 7], length 0
IP (tos 0x0, ttl 61, id 39364, offset 0, flags [DF], proto TCP (6), length 52)
    100.94.66.26.443 > 172.16.1.48.41602: Flags [.], cksum 0xec18 (correct), seq 1, ack 518, win 483, options [nop,nop,TS val 1061588658 ecr 573788854], length 0
IP (tos 0x0, ttl 61, id 39365, offset 0, flags [DF], proto TCP (6), length 3204)
    100.94.66.26.443 > 172.16.1.48.41602: Flags [P.], cksum 0x602f (incorrect -> 0x77d4), seq 1:3153, ack 518, win 483, options [nop,nop,TS val 1061588660 ecr 573788854], length 3152
IP (tos 0x0, ttl 61, id 39368, offset 0, flags [DF], proto TCP (6), length 131)
    100.94.66.26.443 > 172.16.1.48.41602: Flags [P.], cksum 0xf43a (correct), seq 3153:3232, ack 758, win 482, options [nop,nop,TS val 1061588670 ecr 573788866], length 79
IP (tos 0x0, ttl 61, id 39369, offset 0, flags [DF], proto TCP (6), length 131)
    100.94.66.26.443 > 172.16.1.48.41602: Flags [P.], cksum 0x3a61 (correct), seq 3232:3311, ack 758, win 482, options [nop,nop,TS val 1061588670 ecr 573788866], length 79
IP (tos 0x0, ttl 61, id 39370, offset 0, flags [DF], proto TCP (6), length 123)
    100.94.66.26.443 > 172.16.1.48.41602: Flags [P.], cksum 0x1fff (correct), seq 3311:3382, ack 758, win 482, options [nop,nop,TS val 1061588671 ecr 573788866], length 71
IP (tos 0x0, ttl 61, id 39371, offset 0, flags [DF], proto TCP (6), length 285)
    100.94.66.26.443 > 172.16.1.48.41602: Flags [P.], cksum 0x4244 (correct), seq 3382:3615, ack 758, win 482, options [nop,nop,TS val 1061588671 ecr 573788866], length 233
IP (tos 0x0, ttl 61, id 39372, offset 0, flags [DF], proto TCP (6), length 83)
    100.94.66.26.443 > 172.16.1.48.41602: Flags [P.], cksum 0x7de3 (correct), seq 3615:3646, ack 758, win 482, options [nop,nop,TS val 1061588671 ecr 573788866], length 31
IP (tos 0x0, ttl 61, id 39373, offset 0, flags [DF], proto TCP (6), length 52)
    100.94.66.26.443 > 172.16.1.48.41602: Flags [.], cksum 0xdc9c (correct), seq 3646, ack 813, win 482, options [nop,nop,TS val 1061588671 ecr 573788866], length 0
IP (tos 0x0, ttl 61, id 39374, offset 0, flags [DF], proto TCP (6), length 52)
    100.94.66.26.443 > 172.16.1.48.41602: Flags [F.], cksum 0xdc9b (correct), seq 3646, ack 813, win 482, options [nop,nop,TS val 1061588671 ecr 573788866], length 0
IP (tos 0x0, ttl 61, id 39375, offset 0, flags [DF], proto TCP (6), length 52)
    100.94.66.26.443 > 172.16.1.48.41602: Flags [.], cksum 0xdc98 (correct), seq 3647, ack 814, win 482, options [nop,nop,TS val 1061588672 ecr 573788867], length 0
^C
11 packets captured
11 packets received by filter
0 packets dropped by kernel

# tailscale-operator unstable v1.63.72
debug:/# tcpdump -t -n -vvv src host 100.94.66.26
tcpdump: listening on eth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
IP (tos 0x0, ttl 61, id 0, offset 0, flags [DF], proto TCP (6), length 60)
    100.94.66.26.443 > 172.16.1.48.42414: Flags [S.], cksum 0xcc19 (correct), seq 1170718300, ack 3918485864, win 62286, options [mss 1240,sackOK,TS val 1165424866 ecr 573892247,nop,wscale 7], length 0
IP (tos 0x0, ttl 61, id 62299, offset 0, flags [DF], proto TCP (6), length 52)
    100.94.66.26.443 > 172.16.1.48.42414: Flags [.], cksum 0xe96d (correct), seq 1, ack 518, win 483, options [nop,nop,TS val 1165424867 ecr 573892249], length 0
^C
2 packets captured
2 packets received by filter
0 packets dropped by kernel

MSS seems to be the same unless I'm reading it wrong.
For context, my fork runs the following

iptables -t mangle -A FORWARD -p tcp --tcp-flags SYN,RST SYN -o eth0 -j TCPMSS --set-mss 1240 so clamping on the container network interface itself

@irbekrm
Copy link
Contributor

irbekrm commented Apr 2, 2024

tailscale-operator unstable v1.63.72
debug:/# tcpdump -t -n -vvv src host 100.94.66.26
100.94.66.26.443 > 172.16.1.48.42414: Flags [S.], cksum 0xcc19 (correct), seq 1170718300, ack 3918485864, win 62286,
options [mss 1240,sackOK,TS val 1165424866 ecr 573892247,nop,wscale 7], length 0

Thank you very much for that.
To clarify - this is for a connection attempt that nevertheless did not succeed?
Would you also be able to do it on the other end (if you are able to run tcpdump somewhere in between the proxy Pod and your exposed workload, ideally on the workload Pod, but not sure if you can install tcpdump there?)

@clrxbl
Copy link
Author

clrxbl commented Apr 2, 2024

tailscale-operator unstable v1.63.72
debug:/# tcpdump -t -n -vvv src host 100.94.66.26
100.94.66.26.443 > 172.16.1.48.42414: Flags [S.], cksum 0xcc19 (correct), seq 1170718300, ack 3918485864, win 62286,
options [mss 1240,sackOK,TS val 1165424866 ecr 573892247,nop,wscale 7], length 0

Thank you very much for that. To clarify - this is for a connection attempt that nevertheless did not succeed? Would you also be able to do it on the other end (if you are able to run tcpdump somewhere in between the proxy Pod and your exposed workload, ideally on the workload Pod, but not sure if you can install tcpdump there?)

Yeah, this is a curl GET request to a Tailscale pod exposing an ingress-nginx server, curl gets stuck after the TLS handshake attempt, see #11002 (comment)

Using an ephemeral container I was able to tcpdump on the ingress-nginx pod itself. Interestingly enough, there is no traffic from/to the Tailscale proxy pod as soon as I switch to the unstable operator version and run the curl command.

I have also ran tcpdump on the Tailscale proxy itself.

# Ingress-nginx pod while running curl command
/ # tcpdump -t -n -vvv dst host 172.18.0.203 or src 172.18.0.203
tcpdump: listening on eth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
^C
0 packets captured
0 packets received by filter
0 packets dropped by kernel

# Tailscale proxy pod while running curl command
/ # tcpdump -t -n -vvv
tcpdump: listening on tailscale0, link-type RAW (Raw IP), snapshot length 262144 bytes
IP (tos 0x0, ttl 62, id 24397, offset 0, flags [DF], proto TCP (6), length 60)
    100.72.5.55.52020 > 100.94.66.26.443: Flags [S], cksum 0xed9a (correct), seq 547449065, win 62370, options [mss 8910,sackOK,TS val 576438491 ecr 0,nop,wscale 7], length 0
IP (tos 0x0, ttl 62, id 0, offset 0, flags [DF], proto TCP (6), length 60)
    100.94.66.26.443 > 100.72.5.55.52020: Flags [S.], cksum 0x1026 (incorrect -> 0xd160), seq 1828502281, ack 547449066, win 62286, options [mss 1240,sackOK,TS val 2770955587 ecr 576438491,nop,wscale 7], length 0
IP (tos 0x0, ttl 62, id 24398, offset 0, flags [DF], proto TCP (6), length 52)
    100.72.5.55.52020 > 100.94.66.26.443: Flags [.], cksum 0xf0b6 (correct), seq 1, ack 1, win 488, options [nop,nop,TS val 576438492 ecr 2770955587], length 0
IP (tos 0x0, ttl 62, id 24399, offset 0, flags [DF], proto TCP (6), length 569)
    100.72.5.55.52020 > 100.94.66.26.443: Flags [P.], cksum 0x73bd (correct), seq 1:518, ack 1, win 488, options [nop,nop,TS val 576438493 ecr 2770955587], length 517
IP (tos 0x0, ttl 62, id 51990, offset 0, flags [DF], proto TCP (6), length 52)
    100.94.66.26.443 > 100.72.5.55.52020: Flags [.], cksum 0x101e (incorrect -> 0xeeb3), seq 1, ack 518, win 483, options [nop,nop,TS val 2770955589 ecr 576438493], length 0
^C
5 packets captured
5 packets received by filter
0 packets dropped by kernel

@irbekrm
Copy link
Contributor

irbekrm commented Apr 3, 2024

IP (tos 0x0, ttl 62, id 24397, offset 0, flags [DF], proto TCP (6), length 60)
100.72.5.55.52020 > 100.94.66.26.443: Flags [S], cksum 0xed9a (correct), seq 547449065, win 62370, options [mss 8910,sackOK,TS val 576438491 ecr 0,nop,wscale 7], length 0

This is unexpected. If I understand correctly, 100.72.5.55 is the tailnet IP of the client you are calling from? Packets received from the tailnet client should have the MSS of 1240. I ran the same tcpdump in my proxy Pod and wasn't able to reproduce this.

So, it looks like what might be happening is:

  1. Your client sends a tcp syn packet, that ends up at the cluster workload with MSS 8910 (this is wrong)
  2. Cluster workload sends back a packet that presumably starts off with a MSS 8960, but we clamp it to 1240 in the proxy Pod as it's sent out via tailscale0 (from your logs, this part appears to be working)
  3. Cluster workload thinks that the client can accept packets of size 8910 + the path MTU discovery is not working, so it tries to send packets that are too big and cannot recover.

We should figure out how the MSS on the packet received from client ended up being 8910. Would you be able to run something like tcpdump -t -n -vvv -i tailscale0 dst host 100.94.66.26 on your client and check that the initial packet MSS is 1240?

Also, I noticed this in your setup:

devices: ["bond0.+", "tailscale0"]

Why do you need to add tailscale0 to Cilium devices? Does that mean that you are running tailscale on nodes? Have you tried whether it works without this?

@irbekrm
Copy link
Contributor

irbekrm commented Apr 3, 2024

Hey, just a quick question; are you guys gonna keep working on this now? It's a longer story but I'm asking because setting the MTU setting in cilium had some consequences on IPSec tunnels not working.

We certainly want to figure out what is happening here! The issue you originally had is not necessarily the same as the one in the original issue description. Have you tried the nftables clamping fix #11002 (comment) ?

@clrxbl
Copy link
Author

clrxbl commented Apr 3, 2024

IP (tos 0x0, ttl 62, id 24397, offset 0, flags [DF], proto TCP (6), length 60)
100.72.5.55.52020 > 100.94.66.26.443: Flags [S], cksum 0xed9a (correct), seq 547449065, win 62370, options [mss 8910,sackOK,TS val 576438491 ecr 0,nop,wscale 7], length 0

This is unexpected. If I understand correctly, 100.72.5.55 is the tailnet IP of the client you are calling from? Packets received from the tailnet client should have the MSS of 1240. I ran the same tcpdump in my proxy Pod and wasn't able to reproduce this.

So, it looks like what might be happening is:

  1. Your client sends a tcp syn packet, that ends up at the cluster workload with MSS 8910 (this is wrong)
  2. Cluster workload sends back a packet that presumably starts off with a MSS 8960, but we clamp it to 1240 in the proxy Pod as it's sent out via tailscale0 (from your logs, this part appears to be working)
  3. Cluster workload thinks that the client can accept packets of size 8910 + the path MTU discovery is not working, so it tries to send packets that are too big and cannot recover.

We should figure out how the MSS on the packet received from client ended up being 8910. Would you be able to run something like tcpdump -t -n -vvv -i tailscale0 dst host 100.94.66.26 on your client and check that the initial packet MSS is 1240?

Also, I noticed this in your setup:

devices: ["bond0.+", "tailscale0"]

Why do you need to add tailscale0 to Cilium devices? Does that mean that you are running tailscale on nodes? Have you tried whether it works without this?

I add tailscale0 to Cilium devices on our baremetal development/staging clusters that run Tailscale on nodes. It makes it easier to access other Tailscale services within containers without having to create Tailscale egress proxies.

#11002 (comment)
I should clarify that this is incorrect. This issue only actually occurs when attempting to connect to a Tailscale service from within a container. This issue also happens outside of Kubernetes and Cilium in e.g. a server that runs Docker & Tailscale on the host.

I've ran the same curl request from within a debian:bookworm container on a baremetal server that runs Tailscale & Docker instead of Kubernetes + Cilium this time. The SSL handshake gets stuck again as expected. Here is said tcpdump taken from the Docker host.

# tcpdump -t -n -vvv -i tailscale0 dst host 100.94.66.26
tcpdump: listening on tailscale0, link-type RAW (Raw IP), snapshot length 262144 bytes
IP (tos 0x0, ttl 63, id 12055, offset 0, flags [DF], proto TCP (6), length 60)
    100.86.95.53.39282 > 100.94.66.26.443: Flags [S], cksum 0x6a32 (incorrect -> 0x8ed6), seq 4128420992, win 64240, options [mss 1460,sackOK,TS val 3013761745 ecr 0,nop,wscale 7], length 0
IP (tos 0x0, ttl 63, id 12056, offset 0, flags [DF], proto TCP (6), length 52)
    100.86.95.53.39282 > 100.94.66.26.443: Flags [.], cksum 0x6a2a (incorrect -> 0x1ff5), seq 4128420993, ack 1760119431, win 502, options [nop,nop,TS val 3013761746 ecr 1404014455], length 0
IP (tos 0x0, ttl 63, id 12057, offset 0, flags [DF], proto TCP (6), length 569)
    100.86.95.53.39282 > 100.94.66.26.443: Flags [P.], cksum 0x6c2f (incorrect -> 0xa810), seq 0:517, ack 1, win 502, options [nop,nop,TS val 3013761747 ecr 1404014455], length 517
IP (tos 0x0, ttl 63, id 12058, offset 0, flags [DF], proto TCP (6), length 64)
    100.86.95.53.39282 > 100.94.66.26.443: Flags [.], cksum 0x6a36 (incorrect -> 0x6947), seq 517, ack 1, win 502, options [nop,nop,TS val 3013761760 ecr 1404014457,nop,nop,sack 1 {2897:3153}], length 0
IP (tos 0x0, ttl 63, id 12059, offset 0, flags [DF], proto TCP (6), length 64)
    100.86.95.53.39282 > 100.94.66.26.443: Flags [F.], cksum 0x6a36 (incorrect -> 0x63d9), seq 517, ack 1, win 502, options [nop,nop,TS val 3013763149 ecr 1404014457,nop,nop,sack 1 {2897:3153}], length 0
IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto TCP (6), length 40)
    100.86.95.53.39282 > 100.94.66.26.443: Flags [R], cksum 0xfe14 (correct), seq 4128421511, win 0, length 0
^C
6 packets captured
6 packets received by filter
0 packets dropped by kernel

@irbekrm
Copy link
Contributor

irbekrm commented Apr 3, 2024

#11002 (comment)
I should clarify that this is incorrect. This issue only actually occurs when attempting to connect to a Tailscale service from within a container. This issue also happens outside of Kubernetes and Cilium in e.g. a server that runs Docker & Tailscale on the host.

Thank you for clarifying 👍🏼

Here is said tcpdump taken from the Docker host
IP (tos 0x0, ttl 63, id 12055, offset 0, flags [DF], proto TCP (6), length 60)
100.86.95.53.39282 > 100.94.66.26.443: Flags [S], cksum 0x6a32 (incorrect -> 0x8ed6), seq 4128420992, win 64240, options [mss 1460,sackOK,TS val 3013761745 ecr 0,nop,wscale 7], length 0

So 100.86.95.53 is a tailscale client running in container? The mss here is also unexpected. What version of tailscale are these clients running? Would you be able to rerun the same tcpdump from a non-containerized client (it does not matter if the proxy is working or not, I just want to see the mss on the initial syn packet

@clrxbl
Copy link
Author

clrxbl commented Apr 3, 2024

#11002 (comment)
I should clarify that this is incorrect. This issue only actually occurs when attempting to connect to a Tailscale service from within a container. This issue also happens outside of Kubernetes and Cilium in e.g. a server that runs Docker & Tailscale on the host.

Thank you for clarifying 👍🏼

Here is said tcpdump taken from the Docker host
IP (tos 0x0, ttl 63, id 12055, offset 0, flags [DF], proto TCP (6), length 60)
100.86.95.53.39282 > 100.94.66.26.443: Flags [S], cksum 0x6a32 (incorrect -> 0x8ed6), seq 4128420992, win 64240, options [mss 1460,sackOK,TS val 3013761745 ecr 0,nop,wscale 7], length 0

So 100.86.95.53 is a tailscale client running in container? The mss here is also unexpected. What version of tailscale are these clients running? Would you be able to rerun the same tcpdump from a non-containerized client (it does not matter if the proxy is working or not, I just want to see the mss on the initial syn packet

100.86.95.53 is the Docker host's Tailscale client which then gets used within every other Docker container running on that host.

Here's a tcpdump while running the same curl request (that does function) on the host instead of within a Docker container.

# tcpdump -t -n -vvv -i tailscale0 dst host 100.94.66.26
tcpdump: listening on tailscale0, link-type RAW (Raw IP), snapshot length 262144 bytes
IP (tos 0x0, ttl 64, id 62725, offset 0, flags [DF], proto TCP (6), length 60)
    100.86.95.53.33470 > 100.94.66.26.443: Flags [S], cksum 0x6a32 (incorrect -> 0x9cf6), seq 2843078703, win 64480, options [mss 1240,sackOK,TS val 642580885 ecr 0,nop,wscale 7], length 0
IP (tos 0x0, ttl 64, id 62726, offset 0, flags [DF], proto TCP (6), length 52)
    100.86.95.53.33470 > 100.94.66.26.443: Flags [.], cksum 0x6a2a (incorrect -> 0x29cb), seq 2843078704, ack 4249475621, win 504, options [nop,nop,TS val 642580887 ecr 883199710], length 0
IP (tos 0x0, ttl 64, id 62727, offset 0, flags [DF], proto TCP (6), length 569)
    100.86.95.53.33470 > 100.94.66.26.443: Flags [P.], cksum 0x6c2f (incorrect -> 0xb944), seq 0:517, ack 1, win 504, options [nop,nop,TS val 642580890 ecr 883199710], length 517
IP (tos 0x0, ttl 64, id 62728, offset 0, flags [DF], proto TCP (6), length 52)
    100.86.95.53.33470 > 100.94.66.26.443: Flags [.], cksum 0x6a2a (incorrect -> 0x1b74), seq 517, ack 3153, win 495, options [nop,nop,TS val 642580892 ecr 883199716], length 0
IP (tos 0x0, ttl 64, id 62729, offset 0, flags [DF], proto TCP (6), length 132)
    100.86.95.53.33470 > 100.94.66.26.443: Flags [P.], cksum 0x6a7a (incorrect -> 0xcc36), seq 517:597, ack 3153, win 503, options [nop,nop,TS val 642580915 ecr 883199716], length 80
IP (tos 0x0, ttl 64, id 62730, offset 0, flags [DF], proto TCP (6), length 138)
    100.86.95.53.33470 > 100.94.66.26.443: Flags [P.], cksum 0x6a80 (incorrect -> 0xd81b), seq 597:683, ack 3153, win 503, options [nop,nop,TS val 642580915 ecr 883199716], length 86
IP (tos 0x0, ttl 64, id 62731, offset 0, flags [DF], proto TCP (6), length 126)
    100.86.95.53.33470 > 100.94.66.26.443: Flags [P.], cksum 0x6a74 (incorrect -> 0x53d1), seq 683:757, ack 3232, win 503, options [nop,nop,TS val 642580915 ecr 883199738], length 74
IP (tos 0x0, ttl 64, id 62732, offset 0, flags [DF], proto TCP (6), length 83)
    100.86.95.53.33470 > 100.94.66.26.443: Flags [P.], cksum 0x6a49 (incorrect -> 0x9f76), seq 757:788, ack 3404, win 503, options [nop,nop,TS val 642580915 ecr 883199739], length 31
IP (tos 0x0, ttl 64, id 62733, offset 0, flags [DF], proto TCP (6), length 52)
    100.86.95.53.33470 > 100.94.66.26.443: Flags [R.], cksum 0x6a2a (incorrect -> 0x1827), seq 788, ack 3668, win 503, options [nop,nop,TS val 642580916 ecr 883199739], length 0
^C
9 packets captured
9 packets received by filter
0 packets dropped by kernel

Also providing the host's ifconfig results since those could be useful here.

bond0: flags=5187<UP,BROADCAST,RUNNING,MASTER,MULTICAST>  mtu 9000
        ether ac:1f:6b:f5:91:4e  txqueuelen 1000  (Ethernet)
        RX packets 16857203909  bytes 9122375161534 (8.2 TiB)
        RX errors 0  dropped 1014050  overruns 0  frame 0
        TX packets 32860629208  bytes 45679002292600 (41.5 TiB)
        TX errors 0  dropped 1 overruns 0  carrier 0  collisions 0

bond0.3: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9000
        inet <censored>  netmask 255.255.255.254  broadcast 255.255.255.255
        ether ac:1f:6b:f5:91:4e  txqueuelen 1000  (Ethernet)
        RX packets 9013534921  bytes 4600341044068 (4.1 TiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 9020338353  bytes 41423640238400 (37.6 TiB)
        TX errors 0  dropped 1708 overruns 0  carrier 0  collisions 0

bond0.14: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9000
        inet 10.0.0.1  netmask 255.255.0.0  broadcast 10.0.255.255
        ether ac:1f:6b:f5:91:4e  txqueuelen 1000  (Ethernet)
        RX packets 277472225  bytes 4028799206994 (3.6 TiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 287890350  bytes 3020894690204 (2.7 TiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

br-aed9a2d94e0f: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 172.21.0.1  netmask 255.255.0.0  broadcast 172.21.255.255
        ether 02:42:88:32:37:42  txqueuelen 0  (Ethernet)
        RX packets 638709  bytes 60708812 (57.8 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 773127  bytes 2217870477 (2.0 GiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

docker0: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
        inet 172.17.0.1  netmask 255.255.0.0  broadcast 172.17.255.255
        ether 02:42:71:d1:e1:57  txqueuelen 0  (Ethernet)
        RX packets 215063964  bytes 628624678608 (585.4 GiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 239213383  bytes 1159284214720 (1.0 TiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

enp1s0f0: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST>  mtu 9000
        ether ac:1f:6b:f5:91:4e  txqueuelen 1000  (Ethernet)
        RX packets 3805385187  bytes 4002504935871 (3.6 TiB)
        RX errors 0  dropped 427848  overruns 0  frame 0
        TX packets 9169539979  bytes 13421386949912 (12.2 TiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

enp1s0f1: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST>  mtu 9000
        ether ac:1f:6b:f5:91:4e  txqueuelen 1000  (Ethernet)
        RX packets 13051818722  bytes 5119870225663 (4.6 TiB)
        RX errors 0  dropped 586202  overruns 0  frame 0
        TX packets 23691089229  bytes 32257615342688 (29.3 TiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 803695924  bytes 6717291468008 (6.1 TiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 803695924  bytes 6717291468008 (6.1 TiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

tailscale0: flags=4305<UP,POINTOPOINT,RUNNING,NOARP,MULTICAST>  mtu 1280
        inet 100.86.95.53  netmask 255.255.255.255  destination 100.86.95.53
        unspec 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00  txqueuelen 500  (UNSPEC)
        RX packets 6436164  bytes 757069171 (721.9 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 2378774  bytes 11558542068 (10.7 GiB)
        TX errors 0  dropped 8 overruns 0  carrier 0  collisions 0

veth1047922: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        ether e6:e7:e7:76:23:fe  txqueuelen 0  (Ethernet)
        RX packets 638709  bytes 69650738 (66.4 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 773127  bytes 2217870477 (2.0 GiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

ifconfig from within a container:

eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 172.17.0.2  netmask 255.255.0.0  broadcast 172.17.255.255
        inet6 fe80::42:acff:fe11:2  prefixlen 64  scopeid 0x20<link>
        ether 02:42:ac:11:00:02  txqueuelen 0  (Ethernet)
        RX packets 491  bytes 9483835 (9.0 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 278  bytes 19886 (19.4 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

@irbekrm
Copy link
Contributor

irbekrm commented Apr 3, 2024

Here's a tcpdump while running the same curl request (that does function) on the host instead of within a Docker container.

To clarify- that functions only with your fork or also with the latest tailscale? I do see that mss got set correctly there.

@clrxbl
Copy link
Author

clrxbl commented Apr 3, 2024

Here's a tcpdump while running the same curl request (that does function) on the host instead of within a Docker container.

To clarify- that functions only with your fork or also with the latest tailscale? I do see that mss got set correctly there.

On the host it'll function with the latest Tailscale (your unstable image).
Within the container it only functions with my fork.

@irbekrm
Copy link
Contributor

irbekrm commented Apr 3, 2024

Thanks for clarifying that!

So for now my understanding is that the issue in your case might be that when your client in container calls the exposed cluster workload, the packets somehow first get sent via an interface with an mtu that is higher than that of tailscale0, so we end up sending a syn packet with too high an mss to the cluster workload - and that results in the cluster workload sending back too big packets.

I would then assume that if, instead of setting mss on the proxy Pod's eth0, you could actually clamp tailscale0 interface that the Docker containers on your host use and that would work too? (cc @raggi as clamping by default in the linux client was discussed earlier)

I also noticed that there is a --enable-path-mtu-discovery flag for the Cilium agent. I wonder if setting that would help the cluster workload to discover the right packet size.

@clrxbl
Copy link
Author

clrxbl commented Apr 3, 2024

I also noticed that there is a --enable-path-mtu-discovery flag for the Cilium agent. I wonder if setting that would help the cluster workload to discover the right packet size.

I've tried that flag on Cilium before but it didn't seem to change anything.

I would then assume that if, instead of setting mss on the proxy Pod's eth0, you could actually clamp tailscale0 interface that the Docker containers on your host use and that would work too? (cc @raggi as clamping by default in the linux client was discussed earlier)

Haven't thought of this before actually but you do seem to be right.
Running the following on the Docker host with Tailscale installed seems to fix the issue and allow me to curl within a Docker container
iptables -t mangle -A FORWARD -p tcp --tcp-flags SYN,RST SYN -o tailscale0 -j TCPMSS --set-mss 1240

@irbekrm
Copy link
Contributor

irbekrm commented Apr 3, 2024

I've tried that flag on Cilium before but it didn't seem to change anything.

Thanks for confirming. To be honest I don't really know what that flag does. Maybe improving path mtu discovery is something to be raised with cilium folks.

Haven't thought of this before actually but you do seem to be right.
Running the following on the Docker host with Tailscale installed seems to fix the issue and allow me to curl within a Docker container
iptables -t mangle -A FORWARD -p tcp --tcp-flags SYN,RST SYN -o tailscale0 -j TCPMSS --set-mss 1240

Ok, so then this is an issue that could ultimately be solved by us doing clamping in core Linux client. I would like to understand the Docker setup a bit better and try to reproduce this as a use case for clamping in core. If you get a chance to add some notes about why you need that client-side setup that would also help us.

@wokalski do you also have a similar setup client-side? I guess your issue might be different.

@clrxbl
Copy link
Author

clrxbl commented Apr 3, 2024

I would like to understand the Docker setup a bit better and try to reproduce this as a use case for clamping in core.

My fleet of servers consist out of mostly baremetal servers that already run Tailscale on the host itself. While I could probably workaround this by running a secondary Tailscale instance within a container with e.g. userspace networking and expose it other containers using HTTP_PROXY variables, this is just less ideal versus being able to leech off of the host's Tailscale instance.

We have Docker workloads connecting to Tailscale services hosted within Kubernetes + Cilium clusters, exposed using the Tailscale k8s-operator.

@wokalski
Copy link

wokalski commented Apr 3, 2024

@irbekrm client side I have two types of clients:

  1. macOS/iOS clients that access services exposed via k8s operator
  2. egress traffic from a k8s operator service to another k8s operator service

@irbekrm
Copy link
Contributor

irbekrm commented Apr 3, 2024

Thanks @wokalski! Sounds like it might be either #11002 (comment) or another issue altogether. Keen to hear if the latest images with the nftables fix work for you, if not, we'll have to debug this separately I think

@wokalski
Copy link

wokalski commented Apr 5, 2024

I will test the new image this weekend.

@wokalski
Copy link

wokalski commented Apr 7, 2024

It seems that this issue no longer presents on 1.63.78 in my setup!

@irbekrm
Copy link
Contributor

irbekrm commented Apr 8, 2024

It seems that this issue no longer presents on 1.63.78 in my setup!

Thank you for confirming this!
If you had a setup where there was a cluster workload being proxied to via a tailscale proxy, for which the proxy rules were configured using nftables and it is possible that the too big packets where sent to the cluster workload (+ you had Cilium in kube proxy replacement mode, which seems to not support path MTU discovery) then it is likely that #11002 (comment) would have fixed your issue.
There is now 1.63.104 which has another small fix related to the same nftables clamping issue #11639, but we are also planning on releasing 1.64 later this week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fr Feature request kubernetes L2 Few Likelihood P1 Nuisance Priority level T0 New feature Issue type
Projects
None yet
Development

No branches or pull requests

4 participants