Very slow network performance with FastDP+Encryption on linux kernel 4.12 #3075

nesc58 · 2017-07-24T11:17:07Z

Hi,
I have a huge issue with the following setup:

weave-net 2.0.1
kubernetes 1.7
docker version 17.05-ce
CoreOS with kernel 4.12.2
fastdp enabled
encryption enabled

What you expected to happen?

I expect a network performance with a bandwidth of >60% of the original bandwidth. Our servers connected with 1Gbit, so I would say with encryption something about 600Mbit would be great.

I tested the whole setup with a older CoreOS version with kernel 4.11 and the results are great:
Hardware machines without virtualization: ~900Mbit
Virtualized with XEN ~850Mbit
Virtualized with Xen with AES-NI disabled: ~250Mbit (really great without AES-NI I think)

What changed? Linux kernel from 4.12 to 4.11. Docker from 17.05-ce to 1.12.6.

All of this is faster than the setup with linux kernel 4.12. (results in what happend)

Using kernel 4.12 + WEAVE_NO_FASTDP=true + encryption is okay. The iperf bandwidth results are also about 800 to 900Mbit BUT the cpu load is about 100 to 200 percent (weave process).

What happened?

I've got a bandwidth (tested with iperf) of 3 to 55Mbit using coreos with linux kernel 4.12-2
Results of different setups:
55 Mbit: NOT virtualized machines (notebook)
25 Mbit: Virtualized machines (Virtualbox) on a notebook
6 Mbit: Virtualized machines (XenServer 7.2/7.0).

Disabling the tso (ethtool -K ...) is increasing the performance from 6 Mbit to 100Mbit but the cpu load of the ksoftirqd process is increasing, too (from 5 to 100 percent).

How to reproduce it?

Use CoreOS with the linux kernel 4.12.2 (today: beta and alpha releases (alpha got docker 17.05-ce))
Install kubernetes
Install weave-net with ./kubectl create -f ....
Install kube dns (kubernetes dns addon)
Run pods with ubuntu (I started a daemonset to run this pod on each machine)
exec in the pods container (ubuntu) and install and run iperf

Anything else we need to know?

I used the same configuration files for the different setups.
I found a commit for the kernel 4.12 which changed / added something to the xfrm hardware offloading. https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d77e38e612a017480157fe6d2c1422f42cb5b7e3

It would be fine if anybode can reproduce this issue.

Some information missing? Please let me know. But you have to know that I reinstalled the cluster to test with the older CoreOS version so the log files are deleted. So it would be fine if anybody else can reproduce it.

The text was updated successfully, but these errors were encountered:

brb · 2017-08-01T13:35:52Z

@nesc58 Thanks for the report.

I'm quite surprised to see that with WEAVE_NO_FASTDP=true you achieved such a high performance. For comparison, please see the benchmarks we conducted https://www.weave.works/blog/weave-net-performance-fast-datapath/ .

How do you ensure that iperf client and server pods are not scheduled on the same machine?
Also, could you run the same vanilla weave benchmarks as we did in your old and new setup?

nesc58 · 2017-08-01T16:44:52Z

Hi @brb,
I have deployed a DaemonSet, so the ubuntu pod I used was scheduled to all nodes.
kubectl get pods -o wide shows the ip addresses of the pods and host machines or the configured node name.

The DaemonSet includes the following:
One container with the latest ubuntu image: Ubuntu
Configured containerporrs: The default iperf ports, depends on iperf version. I used iperf instead of iperf3

That's it.

Here is the .yml file

apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
  name: ubuntu
spec:  
  selector:
    matchLabels:
      k8s-app: ubuntu
  template:
    metadata:
      labels:
        k8s-app: ubuntu
    spec:
      containers:
      - image: ubuntu
        command:
        - sleep
        - "36000"
        imagePullPolicy: IfNotPresent
        name: ubuntu
        ports:
        - containerPort: 5001
          protocol: TCP

Exec in each pod and install iperf, run one as server and connected the others via pod ips displayed by the kubectl command

I cannot test it the next days because I am testing the etcd snapshot and restore function and so on. So I can only test with the current setup which is working.

In the next time I don't have access to an other infrastructure to test it again.

Hope the error is reproducable.

nesc58 · 2017-10-04T14:02:03Z

The problem still existing with Kernel 4.13.3-coreos-r1.
The problem is: CoreOS updated the stable version to Kernel 4.12. Now I am unable to use WeaveWorks with encryption + fastdp with CoreOS. Stable: Kernel 4.12 with poor performance and Kernel 4.13, too.

zenvdeluca · 2017-10-29T20:44:32Z

I can reproduce but in a a different context.. -- not using weave or docker, but IPsec mGRE tunnels.

w/ Ubuntu any kernel beyond 4.12 is performing very very bad. when downgrades to 4.10 the regression is over, performance is good.

tested using AWS and GCP instances. didnt test on bare metal. (ixgbevf nic driver)

you may want to add more data on this bug I filed couple days ago.
https://bugzilla.kernel.org/show_bug.cgi?id=197513

zenvdeluca · 2017-10-30T23:55:50Z

Hi,

I am glad to relay there is an available patch for kernel >=4.12 that could fix this problem.
more info => https://bugzilla.kernel.org/show_bug.cgi?id=197513

rade · 2017-10-31T07:21:33Z

@zenvdeluca Thanks for investigating this. Well done for identifying what looks like the root cause! Let's hope that patch lands soon and gets ported to all the affected kernel versions.

mikebryant · 2017-12-13T18:23:34Z

We've just been hit by this, and gone through the whole investigation loop. This appears to work around the problem, at the cost of increased cpu usage: ethtool -K weave tso off

rade · 2017-12-13T18:27:22Z

@zenvdeluca any news on getting that kernel patch merged?

stuart-warren · 2017-12-14T09:59:34Z

Should be in kernel v4.14-rc8 (container linux >= 1590.0.0) torvalds/linux@73b9fc4

rade · 2018-03-20T07:53:32Z

We've not had further reports of this, so it's a fair guess that the kernel fix has propagated sufficiently far and is doing its job. -> closing.

brb added the state/investigating label Aug 1, 2017

stuart-warren mentioned this issue Dec 14, 2017

IPsec (ESP) performance drastically reduced from 4.10 to >=4.12 coreos/bugs#2294

Closed

paphillon mentioned this issue Dec 27, 2017

Weave defaults to sleeve mode when encryption is turned off #3207

Closed

rade closed this as completed Mar 20, 2018

brb added this to the n/a milestone Mar 20, 2018

brb removed the state/investigating label Mar 20, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Very slow network performance with FastDP+Encryption on linux kernel 4.12 #3075

Very slow network performance with FastDP+Encryption on linux kernel 4.12 #3075

nesc58 commented Jul 24, 2017

brb commented Aug 1, 2017

nesc58 commented Aug 1, 2017 •

edited

nesc58 commented Oct 4, 2017

zenvdeluca commented Oct 29, 2017 •

edited

zenvdeluca commented Oct 30, 2017

rade commented Oct 31, 2017

mikebryant commented Dec 13, 2017

rade commented Dec 13, 2017

stuart-warren commented Dec 14, 2017

rade commented Mar 20, 2018

Very slow network performance with FastDP+Encryption on linux kernel 4.12 #3075

Very slow network performance with FastDP+Encryption on linux kernel 4.12 #3075

Comments

nesc58 commented Jul 24, 2017

What you expected to happen?

What happened?

How to reproduce it?

Anything else we need to know?

brb commented Aug 1, 2017

nesc58 commented Aug 1, 2017 • edited

nesc58 commented Oct 4, 2017

zenvdeluca commented Oct 29, 2017 • edited

zenvdeluca commented Oct 30, 2017

rade commented Oct 31, 2017

mikebryant commented Dec 13, 2017

rade commented Dec 13, 2017

stuart-warren commented Dec 14, 2017

rade commented Mar 20, 2018

nesc58 commented Aug 1, 2017 •

edited

zenvdeluca commented Oct 29, 2017 •

edited