Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very slow network performance with FastDP+Encryption on linux kernel 4.12 #3075

Closed
nesc58 opened this issue Jul 24, 2017 · 10 comments
Closed
Milestone

Comments

@nesc58
Copy link

nesc58 commented Jul 24, 2017

Hi,
I have a huge issue with the following setup:

  • weave-net 2.0.1
  • kubernetes 1.7
  • docker version 17.05-ce
  • CoreOS with kernel 4.12.2
  • fastdp enabled
  • encryption enabled

What you expected to happen?

I expect a network performance with a bandwidth of >60% of the original bandwidth. Our servers connected with 1Gbit, so I would say with encryption something about 600Mbit would be great.

I tested the whole setup with a older CoreOS version with kernel 4.11 and the results are great:
Hardware machines without virtualization: ~900Mbit
Virtualized with XEN ~850Mbit
Virtualized with Xen with AES-NI disabled: ~250Mbit (really great without AES-NI I think)

What changed? Linux kernel from 4.12 to 4.11. Docker from 17.05-ce to 1.12.6.

All of this is faster than the setup with linux kernel 4.12. (results in what happend)

Using kernel 4.12 + WEAVE_NO_FASTDP=true + encryption is okay. The iperf bandwidth results are also about 800 to 900Mbit BUT the cpu load is about 100 to 200 percent (weave process).

What happened?

I've got a bandwidth (tested with iperf) of 3 to 55Mbit using coreos with linux kernel 4.12-2
Results of different setups:
55 Mbit: NOT virtualized machines (notebook)
25 Mbit: Virtualized machines (Virtualbox) on a notebook
6 Mbit: Virtualized machines (XenServer 7.2/7.0).

Disabling the tso (ethtool -K ...) is increasing the performance from 6 Mbit to 100Mbit but the cpu load of the ksoftirqd process is increasing, too (from 5 to 100 percent).

How to reproduce it?

Use CoreOS with the linux kernel 4.12.2 (today: beta and alpha releases (alpha got docker 17.05-ce))
Install kubernetes
Install weave-net with ./kubectl create -f ....
Install kube dns (kubernetes dns addon)
Run pods with ubuntu (I started a daemonset to run this pod on each machine)
exec in the pods container (ubuntu) and install and run iperf

Anything else we need to know?

I used the same configuration files for the different setups.
I found a commit for the kernel 4.12 which changed / added something to the xfrm hardware offloading. https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d77e38e612a017480157fe6d2c1422f42cb5b7e3

It would be fine if anybode can reproduce this issue.

Some information missing? Please let me know. But you have to know that I reinstalled the cluster to test with the older CoreOS version so the log files are deleted. So it would be fine if anybody else can reproduce it.

@brb
Copy link
Contributor

brb commented Aug 1, 2017

@nesc58 Thanks for the report.

I'm quite surprised to see that with WEAVE_NO_FASTDP=true you achieved such a high performance. For comparison, please see the benchmarks we conducted https://www.weave.works/blog/weave-net-performance-fast-datapath/ .

How do you ensure that iperf client and server pods are not scheduled on the same machine?
Also, could you run the same vanilla weave benchmarks as we did in your old and new setup?

@nesc58
Copy link
Author

nesc58 commented Aug 1, 2017

Hi @brb,
I have deployed a DaemonSet, so the ubuntu pod I used was scheduled to all nodes.
kubectl get pods -o wide shows the ip addresses of the pods and host machines or the configured node name.

The DaemonSet includes the following:
One container with the latest ubuntu image: Ubuntu
Configured containerporrs: The default iperf ports, depends on iperf version. I used iperf instead of iperf3

That's it.

Here is the .yml file

apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
  name: ubuntu
spec:  
  selector:
    matchLabels:
      k8s-app: ubuntu
  template:
    metadata:
      labels:
        k8s-app: ubuntu
    spec:
      containers:
      - image: ubuntu
        command:
        - sleep
        - "36000"
        imagePullPolicy: IfNotPresent
        name: ubuntu
        ports:
        - containerPort: 5001
          protocol: TCP

Exec in each pod and install iperf, run one as server and connected the others via pod ips displayed by the kubectl command

I cannot test it the next days because I am testing the etcd snapshot and restore function and so on. So I can only test with the current setup which is working.

In the next time I don't have access to an other infrastructure to test it again.

Hope the error is reproducable.

@nesc58
Copy link
Author

nesc58 commented Oct 4, 2017

The problem still existing with Kernel 4.13.3-coreos-r1.
The problem is: CoreOS updated the stable version to Kernel 4.12. Now I am unable to use WeaveWorks with encryption + fastdp with CoreOS. Stable: Kernel 4.12 with poor performance and Kernel 4.13, too.

@zenvdeluca
Copy link

zenvdeluca commented Oct 29, 2017

I can reproduce but in a a different context.. -- not using weave or docker, but IPsec mGRE tunnels.

w/ Ubuntu any kernel beyond 4.12 is performing very very bad. when downgrades to 4.10 the regression is over, performance is good.

tested using AWS and GCP instances. didnt test on bare metal. (ixgbevf nic driver)

you may want to add more data on this bug I filed couple days ago.
https://bugzilla.kernel.org/show_bug.cgi?id=197513

@zenvdeluca
Copy link

Hi,

I am glad to relay there is an available patch for kernel >=4.12 that could fix this problem.
more info => https://bugzilla.kernel.org/show_bug.cgi?id=197513

@rade
Copy link
Member

rade commented Oct 31, 2017

@zenvdeluca Thanks for investigating this. Well done for identifying what looks like the root cause! Let's hope that patch lands soon and gets ported to all the affected kernel versions.

@mikebryant
Copy link
Collaborator

We've just been hit by this, and gone through the whole investigation loop. This appears to work around the problem, at the cost of increased cpu usage: ethtool -K weave tso off

@rade
Copy link
Member

rade commented Dec 13, 2017

@zenvdeluca any news on getting that kernel patch merged?

@stuart-warren
Copy link
Contributor

Should be in kernel v4.14-rc8 (container linux >= 1590.0.0) torvalds/linux@73b9fc4

@rade
Copy link
Member

rade commented Mar 20, 2018

We've not had further reports of this, so it's a fair guess that the kernel fix has propagated sufficiently far and is doing its job. -> closing.

@rade rade closed this as completed Mar 20, 2018
@brb brb added this to the n/a milestone Mar 20, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants