Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Routing of Flannel traffic over wireguard #139

Open
jowenn opened this issue Jun 1, 2018 · 21 comments · May be fixed by #178
Open

Routing of Flannel traffic over wireguard #139

jowenn opened this issue Jun 1, 2018 · 21 comments · May be fixed by #178
Projects

Comments

@jowenn
Copy link

jowenn commented Jun 1, 2018

It is nice that there is wireguard, but it is not used for inter pod communication

Flannel by default uses the device with the default route (if run on the host), I'm not sure how it is handled when run as a pod

At least I could verify that my communication between two pods on two nodes is not encrypted, because it does not use the wg0 interface.

What I did:
I created a cluster
hetzner-kube cluster create ..... --ha-enabled -w 3

I launched 3 nginx containers (each container is scheduled on one node)
kubectl run nginx1 --image=nginx --replicas=3 --port=80

I started on each node (directly)
tcpdump -nnvvSSs 1514 -i eth0 | grep GET

I ran a single interactive container with curl:
kubectl run curl1 --image=radial/busyboxplus:curl -i -t --rm
In this container I startet
curl http://10.244...../MYTEST1 >/dev/null

For every connection between the busy box and a container on another node I gut a nice
GET /MYTEST1 HTTP/1.1 output from tcpdump

Therefore the traffic is no going flannel->wg0->encryption->eth0->and on but is going directly flannel->eth0, which means the traffic is not encrypted.

I also got some other GET / for instance when kubectl connected a terminal to a pod, so it appears that also not all control plane traffic is encrypted, except etcd, which seems to really use wg0

@pierreozoux
Copy link
Contributor

Yes, noticed that too.

I think what is happening is that wireguard is created, then kubeadm is configured to use wireguard IPs, and then when flannel starts, it discovers interface on its own, and choose eth.

I had to deploy my cluster with kubeadm (additional worker failing), and I used that config for flannel on top of wireguard:
https://git.indie.host/indiehost/standard/blob/master/kube-flannel.yml#L74-85

Hope it helps!

@jowenn
Copy link
Author

jowenn commented Jun 10, 2018

I've tried removing everything flannel related and used your file with kubectl apply and with kubectl create in both cases the result is that the pods are in crash loop. From the logs I can see that the post startup command errors out with a "Address already in use" message

@pierreozoux
Copy link
Contributor

Yeah, maybe you have to clean some folders too, and iterfaces.

I've used this command when I used kubeadm, to reset node:

systemctl stop kubelet
systemctl stop docker
rm -rf /var/lib/cni/
rm -rf /var/lib/kubelet/*
rm -rf /etc/cni/
ifconfig cni0 down
ifconfig flannel.1 down
ifconfig docker0 down

And reboot the node, it might help, but be careful, I'm not sure the node will join again the cluster. Try this at your own risk, on staging cluster.

@xetys
Copy link
Owner

xetys commented Jun 17, 2018

Your approach is generating new keys Everytime the network is restarting. That would lead to other nodes to lack the new public key. Isn't there a way just to specify which network interface should use?

@simonkern
Copy link

simonkern commented Jun 26, 2018

@pierreozoux's solution looks quite similar to flannel's official wireguard extension:
https://github.com/coreos/flannel/blob/master/dist/extension-wireguard

In the docs, it says about the PreStartupCommand which includes the key generation command:

Command to run before allocating a network to this host
The stdout of the process is captured and passed to the stdin of the SubnetAdd/Remove commands.

See: https://github.com/coreos/flannel/blob/master/Documentation/extension.md

@xetys xetys added this to To do in 1.0 via automation Jun 29, 2018
@pierreozoux
Copy link
Contributor

@simonkern yes sorry, I should have linked to official doc instead of my folder :)

@monofone
Copy link

monofone commented Aug 8, 2018

There is an option --iface to flanneld which can take an Interface name. By patching this in the flannel manifest and set to wg0 this should work already.

@voron
Copy link
Contributor

voron commented Aug 17, 2018

@monofone

There is an option --iface to flanneld which can take an Interface name.

I tried --iface wg0 without success, no internal traffic worked. Didn't dig into it unfortunately.

@monofone
Copy link

yes that's right, I experienced this behavior too

@voron
Copy link
Contributor

voron commented Aug 20, 2018

Looks like I found a work-around

  • Add --iface wg0 (2 lines) to flannel command line in kube-system flannel DaemonSet, kill all flannel pods to apply
echo 'spec:
  template:
    spec:
      containers:
      - args:
        - --ip-masq
        - --kube-subnet-mgr
        - --iface
        - wg0
        name: kube-flannel'|kubectl -n kube-system patch ds kube-flannel-ds --patch "$(cat)"
kubectl -n kube-system delete pod -l 'app=flannel'
  • disable TX checksum offload for flannel.1 interface on all servers via
ethtool -K flannel.1 tx off

Just a PoC to test, we'll need to make checksum offload permanent in case of implementation of this solution

@xetys
Copy link
Owner

xetys commented Aug 23, 2018

I like your approach. Could you explain what's happening here?

About the permanent solution, is systemd a good way to do that? That's what I would do here if this solution is sustainable.

@voron
Copy link
Contributor

voron commented Aug 23, 2018

I like your approach. Could you explain what's happening here?

As @monofone mentioned, there is enough to specify --iface wg0 arg to flanneld to get flannel to work via wg0 instead of default eth0. flanneld reports wg0's IP to k8s and then all nodes peer using node metadata annotations from k8s. ICMP starts to work already, but TCP is really-really buggy. I didn't investigate UDP though. I started to research TCP problem using this article as a base and got tcpdump -v checksum mismatch errors when service doesn't respond and normal rare TCP connections w/o checksum mismatch when service responded as expected. Same high TcpInCsumErrors values from nstat inside pod. All this pointed me to checksum offload problem. I started to disable TCP checksum offload on all adapters, but it looks like flannel.1 is good enough. Here is similar issue with Azure. It looks to me like a (kernel/driver) checksum offload bug with flannel vxlan encapsulation(all inside udp) inside wireguard encapsulation(all inside udp).

I got near 1500Mbit iperf3 w/o wireguard and near 900Mbit with wireguard in my performance tests on cx11 instances. IMHO flannel tcp checksum offload doesn't affect speed noticeably, especially compared to wireguard encryption.

About the permanent solution, is systemd a good way to do that?

I was thinking about some PostStartupCommand like with flannel extensions, but it looks like flannel doesn't support such option with build-in backends, while extension backends are not recommended for production use. flanneld re-creates it's interface in case of some configuration changes and so on. Thus udev hook, with RUN script or systemd event, as you asked, are possible. Here is systemd option:

  • Create file /etc/udev/rules.d/71-flannel.rules with line like
SUBSYSTEM=="net", ACTION=="add", KERNEL=="flannel.*", TAG+="systemd", ENV{SYSTEMD_WANTS}="flannel-created@%k.service"
  • Create systemd unit /etc/systemd/system/flannel-created@.service
[Unit]
Description=Disable TX checksum offload on flannel interface
[Service]
Type=oneshot
ExecStart=/sbin/ethtool -K %I tx off
  • Reload via systemctl daemon-reload and systemctl restart systemd-udevd.service

It did the job for me

@xetys
Copy link
Owner

xetys commented Aug 24, 2018

Does this persist after doing reboots, because of the "oneshot" type?

@voron
Copy link
Contributor

voron commented Aug 24, 2018

It runs on every flannel.* interface creation. Thus it persists after reboots too, as flannel starts on every k8s node boot and creates it's interface.

@xetys
Copy link
Owner

xetys commented Aug 24, 2018

Then I think that is great material for a PR, WDYT?

@voron voron linked a pull request Aug 25, 2018 that will close this issue
@quorak
Copy link
Contributor

quorak commented Sep 23, 2018

Hey guys,

just read through the issue here. Do I understand correctly, that the traffic between the nodes is not encrypted, opposed to #128 ?

Maybe we should declare this explicit in the repos README, as long as we got this covered. Might be a deal breaker for some evaluators.

best

@segator
Copy link

segator commented Feb 24, 2019

did you try hostgw instead of vxlan over wireguard?
The performance should be better I think.

@xetys xetys moved this from To do to In progress in 1.0 Mar 31, 2019
@ghost
Copy link

ghost commented May 21, 2020

I have the same question, currently the traffic inter nodes is not encrypted? So example if my webapp queries my mysql backend on a second node, the data can be sniffed?

@renanqts
Copy link

renanqts commented May 15, 2021

hostgw

It didn't work because it creates a route pointing CIDR of the PODs to Wireguard IPs as a gateway, but Wireguard needs CIDR in AllowedIPs.

@segator
Copy link

segator commented May 16, 2021

hostgw

It didn't work because it creates a route pointing CIDR of the PODs to Wireguard IPs as a gateway, but Wireguard needs CIDR in AllowedIPs.

Yes and you can configure the CIDR on the allowed IPs on each WG node, I have it working since year ago :)

@renanqts
Copy link

renanqts commented May 16, 2021

with hostgw?
I saw it in the logs when I tried.

Replacing existing route to 10.42.0.0/24 via 10.42.0.0 dev index 17 with 10.42.0.0/24 via 10.253.3.1 dev index 4

Forget, It work even with this log :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
1.0
  
In progress
Development

Successfully merging a pull request may close this issue.

9 participants