Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tailscale and Calico netfilter packet marks conflict with each other #591

Open
danderson opened this issue Jul 23, 2020 · 10 comments
Open

Comments

@danderson
Copy link
Member

Calico is a network addon for Kubernetes, which implements both connectivity within the cluster and policy enforcement stuff.

Calico takes the upper 16 bits of the netfilter packet mark for itself (aka 0xffff0000). This conflicts with Tailscale's use of 0x40000 and 0x80000, so trying to use Calico and Tailscale on the same Kubernetes machines will probably break Tailscale, or Calico, or both.

Likely this is just a question of documenting that the two cannot run together, unless you run tailscaled in a k8s pod without host network access (which is still useful as a "VPN hosted in k8s" subnet router), or reconfigure Calico to use different mark bits (which Calico supports).

Alternatively we could try to help ourselves to lower bits that don't conflict with Calico or other known users. There are no currently-known uses of the lower 8 bits of the packet mark, although people seem to steer clear of those as belonging to the local sysadmin for their own uses.

cc @bradfitz @apenwarr

@apenwarr
Copy link
Member

apenwarr commented Jul 23, 2020 via email

@danderson
Copy link
Member Author

I really don't want to put us in the camp of "uninstall all other networking software, or else". At a minimum, that will make us straight up incompatible with all Kubernetes-related software (all of which uses the bitmasking style on ~all packets flowing through the machine), and we'll have to deal with a myriad of support requests because people will keep trying to make it work on k8s no matter what we do. On the other hand, with bitmasking I believe we can make it work, because aside from the blanket marking k8s and friends don't actually care about our packets (but will drop/mishandle them if their mark isn't present), so the behaviors aren't in conflict.

You could argue that this is an indictment of how packet processing works on linux, and I'd agree with you. But breaking all k8s-related uses isn't something we should do lightly.

@apenwarr
Copy link
Member

apenwarr commented Jul 23, 2020 via email

@danderson
Copy link
Member Author

Yup, sounds good. I'll put it on the agenda for an eng meeting.

@danderson danderson added L2 Few Likelihood P5 Halts deployment Priority level T8 Crash Issue type kubernetes labels Aug 14, 2020
@DentonGentry
Copy link
Contributor

DentonGentry commented Jul 31, 2021

Since this bug was filed I believe we've moved ACL rules to a new table id unlikely to conflict. We use FWMARK for:

  • packets destined to Tailscale's own control server and related infrastructure, in net/netns/netns_linux.go
  • wgengine/router/router_linux.go:addNetfilterBase4()

@danopia
Copy link

danopia commented Nov 16, 2021

Hi, I just encountered this conflict as a user, when trying to configure a single host as both a Calico Kubernetes node and a Tailscale 1.16.2 exit node. I haven't had issues with other aspects of the nodes, just Tailscale exit routing.

a subset of nft rules from conflicted node, showing 0x40000 reuse
# nft list ruleset | grep 0x40000
                iifname "tailscale0" counter packets 893 bytes 53580 meta mark set 0x40000 
                mark 0x40000 counter packets 893 bytes 53580 accept
                iifname "tailscale0" counter packets 39 bytes 3120 meta mark set 0x40000 
                mark 0x40000 counter packets 39 bytes 3120 accept
                mark 0x40000 counter packets 0 bytes 0 masquerade 
                mark 0x40000 counter packets 0 bytes 0 masquerade  
                iifname "cali*"  counter packets 0 bytes 0 meta mark set mark or 0x40000 
                 mark and 0x40000 == 0x40000 fib saddr . mark . iif oif 0 counter packets 0 bytes 0 drop
                 mark and 0x40000 == 0x0 counter packets 2586 bytes 536822 jump cali-from-host-endpoint
                iifname "cali*"  counter packets 0 bytes 0 meta mark set mark or 0x40000 
                 mark and 0x40000 == 0x40000 fib saddr . mark . iif oif 0 counter packets 0 bytes 0 drop
                 mark and 0x40000 == 0x0 counter packets 2094 bytes 309770 jump cali-from-host-endpoint

As mentioned already in this thread, Calico supports remapping its marks. I added this envvar to my Calico daemonset which restored proper exit-routing behavior.

            - name: FELIX_IPTABLESMARKMASK
              value: "0xff00ff00"

@mayakacz
Copy link
Contributor

Is this still an issue?
It sounds like this shouldn't occur if Tailscale is running in a container in Kubernetes, but may occur if Tailscale is running on a node(?) in a Kubernetes cluster.
There were also changes to exit nodes in 1.20.

@DentonGentry
Copy link
Contributor

I suspect it will still be an issue, the use of fwmark can stomp on each other. #3310 (comment) proposes a general fix which might solve this and issue as well.

@ncfavier
Copy link

This seems orthogonal to #3310 (comment). To avoid the conflict, tailscale would have to use a different bit range.

@SnoFox
Copy link

SnoFox commented Feb 5, 2024

I have spent days debugging and working around this by trying to change Tailscales behavior and only finding how to work around this properly by way of a complete outage. As Tailscale offers --netfilter-mode=off I had changed the fwmarks in my manually-created iptables.

FR: Docs change to list known incompatibilities, such as Calico, and potential workarounds, like #591 (comment) (which was a simple change in K8s that cleanly rolled out and fixed everything).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants