New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using RKE to deploy k8s error on RancherOS #2458

Closed
rootwuj opened this Issue Aug 27, 2018 · 5 comments

Comments

Projects
None yet
3 participants
@rootwuj

rootwuj commented Aug 27, 2018

RancherOS Version: (ros os version)
v1.4.1-rc2
Where are you running RancherOS? (docker-machine, AWS, GCE, baremetal, etc.)
AWS
Docker Version:
17.03.2-ce

Steps to Reproduce:
Built a k8s cluster by RKE, and got some error logs for ./rke up:

rancher@ip-172-31-45-0:~$ ./rke_linux-amd64 up
INFO[0000] Building Kubernetes cluster
INFO[0000] [dialer] Setup tunnel for host [172.31.45.0]
INFO[0000] [dialer] Setup tunnel for host [172.31.43.225]
INFO[0000] [dialer] Setup tunnel for host [172.31.34.146]
INFO[0000] [network] Deploying port listener containers
INFO[0000] [network] Pulling image [rancher/rke-tools:v0.1.13] on host [172.31.34.146]
INFO[0000] [network] Pulling image [rancher/rke-tools:v0.1.13] on host [172.31.45.0]
INFO[0000] [network] Pulling image [rancher/rke-tools:v0.1.13] on host [172.31.43.225]
INFO[0005] [network] Successfully pulled image [rancher/rke-tools:v0.1.13] on host [172.31.45.0]
INFO[0005] [network] Successfully pulled image [rancher/rke-tools:v0.1.13] on host [172.31.34.146]
INFO[0005] [network] Successfully pulled image [rancher/rke-tools:v0.1.13] on host [172.31.43.225]
INFO[0007] [network] Successfully started [rke-etcd-port-listener] container on host [172.31.34.146]
INFO[0007] [network] Successfully started [rke-etcd-port-listener] container on host [172.31.43.225]
INFO[0007] [network] Successfully started [rke-etcd-port-listener] container on host [172.31.45.0]
INFO[0007] [network] Successfully started [rke-cp-port-listener] container on host [172.31.45.0]
INFO[0008] [network] Successfully started [rke-worker-port-listener] container on host [172.31.45.0]
FATA[0057] Error checking if image [rancher/rke-tools:v0.1.13] exists on host [172.31.34.146]: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
rancher@ip-172-31-45-0:~$

I tried to use RancherOS v1.4.0 and I can successfully deploy the k8s cluster.
Also, when the cluster has only one node, it can be deployed successfully.

@rootwuj rootwuj added this to the v1.5.0 milestone Aug 27, 2018

@Jason-ZW

This comment has been minimized.

Member

Jason-ZW commented Aug 28, 2018

Using docker-machine with 3 ros-1.4.1-rc2 nodes, rke can work

@niusmallnan

This comment has been minimized.

Member

niusmallnan commented Sep 9, 2018

It hangs at futex, FUTEX_WAIT:

$ strace rke up
read(3, 0xc4200b4000, 4096)             = -1 EAGAIN (Resource temporarily unavailable)
futex(0x21652d8, FUTEX_WAIT, 0, NULLclock_gettime(CLOCK_MONOTONIC, {96, 518712676}) = 0
futex(0x21648d8, FUTEX_WAKE, 1)         = 1
futex(0x2164810, FUTEX_WAKE, 1)         = 1
read(3, "\0\0\16p\205`\347\334\253\247n\367\344\6\340t\3225\330CDqc\376b\7\22\312\315\r\36\36"..., 4096) = 3732
futex(0xc42002a938, FUTEX_WAKE, 1)      = 1
read(3, 0xc4200b4000, 4096)             = -1 EAGAIN (Resource temporarily unavailable)
futex(0x21652d8, FUTEX_WAIT, 0, NULL

dmesg output:

[ 1347.458883] docker0: port 3(veth531863e) entered blocking state
[ 1347.458884] docker0: port 3(veth531863e) entered forwarding state
[ 1347.676114] eth0: renamed from veth843926c
[ 1347.691905] IPv6: ADDRCONF(NETDEV_CHANGE): veth531863e: link becomes ready
[ 1376.973428] NOHZ: local_softirq_pending 44
[ 1377.239509] NOHZ: local_softirq_pending 44
[ 1377.239533] NOHZ: local_softirq_pending 44

After testing, it happens on the kernel >= 4.14.37.

I am thinking about rolling back the kernel, but this will make some security issues unsolvable, such as L1TF

@niusmallnan niusmallnan removed this from the v1.5.0 milestone Sep 9, 2018

@niusmallnan

This comment has been minimized.

Member

niusmallnan commented Sep 13, 2018

Something has broken MTU functionality in Xen: specifically, setting MTUs larger than 1500 fails. This prevents Jumbo Frames and other features which require larger than 1500 byte MTUs from being used.

This can be worked around by manually using ethtool to set SCATTER/GATHER
functionality:

$ sudo ethtool -K eth0 sg on
$ ip link set dev eth0 mtu 9001
$ rke up 

The issue is caused by the following commit to the xen-netfront driver:
"xen-netfront: Fix race between device setup and open"
commit f599c64fdf7d9c108e8717fb04bc41c680120da4
(introduced in kernel 4.14.37)

Reverting the above fix has confirmed that the problem goes away.

The following commits fix this issue in the mainline kernel:

"xen-netfront: Fix mismatched rtnl_unlock"
commit cb257783c2927b73614b20f915a91ff78aa6f3e8
"xen-netfront: Update features after registering netdev"
commit 45c8184c1bed1ca8a7f02918552063a00b909bf5

@niusmallnan niusmallnan added this to the v1.5.0 milestone Sep 13, 2018

@niusmallnan niusmallnan self-assigned this Sep 16, 2018

@rootwuj

This comment has been minimized.

rootwuj commented Sep 18, 2018

RancherOS Version 1.4.1-rc3 9/17
Verified fixed

@rootwuj rootwuj closed this Sep 18, 2018

@niusmallnan niusmallnan added BACKPORTED and removed BACKPORT labels Sep 19, 2018

@niusmallnan

This comment has been minimized.

Member

niusmallnan commented Sep 19, 2018

Backported to v1.4.1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment