Error: failed to start containers: kubelet #13314

sixcorners · 2018-05-06T06:23:18Z

Rancher versions:
rancher/server or rancher/rancher: rancher/rancher:latest@sha256:38839bb19bdcac084a413a4edce7efb97ab99b6d896bda2f433dfacfd27f8770
rancher/agent or rancher/rancher-agent:
rancher/rancher-agent:v2.0.0
Infrastructure Stack versions:
whatever the defaults are

Docker version: (docker version,docker info preferred)
Client:
Version: 17.03.2-ce
API version: 1.27
Go version: go1.8.3
Git commit: f5ec1e2-snap-345b814
Built: Thu Jun 29 23:40:29 2017
OS/Arch: linux/amd64

Server:
Version: 17.03.2-ce
API version: 1.27 (minimum version 1.12)
Go version: go1.8.3
Git commit: f5ec1e2-snap-345b814
Built: Thu Jun 29 23:40:29 2017
OS/Arch: linux/amd64
Experimental: false
Operating system and kernel: (cat /etc/os-release, uname -r preferred)
4.15.0-20-generic
Type/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO)
ssdnodes/custom cluster
Setup details: (single node rancher vs. HA rancher, internal DB vs. external DB)
single node
Environment Template: (Cattle/Kubernetes/Swarm/Mesos)
can't set up node
Steps to Reproduce:
snap install docker --channel=17.03/stable
mkdir /etc/kubernetes
sudo docker run -d --privileged --restart=unless-stopped --net=host -v /etc/kubernetes:/etc/kubernetes -v /var/run:/var/run rancher/rancher-agent:v2.0.0 --server https://rancher.sixcorners.info --token abc --ca-checksum xyz --worker
docker logs -f share-mnt
Results:
Found state.json: 931882e24ff0ef67b0e8744dbf1f7e04fd68afe714a29a2522293312824f3c51
time="2018-05-06T06:09:15Z" level=info msg="Execing [/usr/bin/nsenter --mount=/proc/21787/ns/mnt -F -- /var/snap/docker/common/var-lib-docker/aufs/mnt/5d00bd40adec6662aaec8ea2a5f5ce6a332e9dbfad087a008c5c89b7cac4c22f/usr/bin/share-mnt --stage2 /var/lib/kubelet /var/lib/rancher -- norun]"
Error response from daemon: {"message":"No such container: kubelet"}
Error: failed to start containers: kubelet
Error response from daemon: {"message":"No such container: kubelet"}
Error: failed to start containers: kubelet
Error response from daemon: {"message":"No such container: kubelet"}
Error: failed to start containers: kubelet

The text was updated successfully, but these errors were encountered:

jonaskello · 2018-05-06T21:40:01Z

I got this when adding a new node. I started the agent container as per the command in the UI and it seems to have started a second agent container named share-mnt. At the time there were only two containers running, the original agent container and this one. After a while I did docker rm -f on the share-mnt container and that seems to have cleared it up.

superseb · 2018-05-06T21:42:30Z

What is the output of docker ps -a when this happens? And docker logs --tail=all kubelet?

jonaskello · 2018-05-06T21:47:35Z

Here are some logs from my terminal:

jonkel@rancher2:~$ docker ps
CONTAINER ID        IMAGE                          COMMAND                  CREATED             STATUS              PORTS               NAMES
7378e65f3edc        rancher/rancher-agent:v2.0.0   "run.sh -- share-r..."   5 minutes ago       Up 2 minutes                            share-mnt
117a4debc918        rancher/rancher-agent:v2.0.0   "run.sh --server h..."   5 minutes ago       Up 5 minutes                            tender_fermi
jonkel@rancher2:~$ docker logs 73
Found container ID: 1365e16b71514e04bcc4d4553f757f5684b55deeaa38523d71fda23d41ee678e
Checking root: /host/run/runc
Checking file: 117a4debc9180552ac578ab8e7145c4ceb9b384981c70e23ebb03a8251ff2909
Checking file: 1365e16b71514e04bcc4d4553f757f5684b55deeaa38523d71fda23d41ee678e
Found state.json: 1365e16b71514e04bcc4d4553f757f5684b55deeaa38523d71fda23d41ee678e
time="2018-05-06T20:48:25Z" level=info msg="Execing [/usr/bin/nsenter --mount=/proc/10275/ns/mnt -F -- /var/lib/docker/aufs/mnt/25409162e1b2f1c3c0cbe62719f52bf6710bd4e97fe49ca612db6711d11a4a31/usr/bin/share-mnt --stage2 /var/lib/kubelet /var/lib/rancher -- norun]"
Error response from daemon: {"message":"No such container: kubelet"}
Error: failed to start containers: kubelet
Error response from daemon: {"message":"No such container: kubelet"}
Error: failed to start containers: kubelet
Error response from daemon: {"message":"No such container: kubelet"}
Error: failed to start containers: kubelet
Error response from daemon: {"message":"No such container: kubelet"}
Error: failed to start containers: kubelet
Error response from daemon: {"message":"No such container: kubelet"}

There was no container named kubelet started at this time. Unfortunately I did not do a ps -a so I don't know if it had been started and stopped.

superseb · 2018-05-06T21:49:20Z

Ok and docker logs 117a4debc918 ?

jonaskello · 2018-05-06T21:53:29Z

I'm afraid it is too late to do that since that container is gone. Here is some more logs from what I did next when I removed the stuck container:

jonkel@rancher2:~$ docker ps
CONTAINER ID        IMAGE                          COMMAND                  CREATED             STATUS              PORTS               NAMES
7378e65f3edc        rancher/rancher-agent:v2.0.0   "run.sh -- share-r..."   8 minutes ago       Up 5 minutes                            share-mnt
117a4debc918        rancher/rancher-agent:v2.0.0   "run.sh --server h..."   8 minutes ago       Up 8 minutes                            tender_fermi
jonkel@rancher2:~$ docker rm -f 7378e65f3edc
7378e65f3edc
jonkel@rancher2:~$ docker ps
CONTAINER ID        IMAGE                                COMMAND                  CREATED             STATUS              PORTS               NAMES
50db7bace022        rancher/hyperkube:v1.10.1-rancher2   "/opt/rke/entrypoi..."   10 seconds ago      Up 1 second                             kubelet
c6e4a2980730        rancher/hyperkube:v1.10.1-rancher2   "/opt/rke/entrypoi..."   21 seconds ago      Up 11 seconds                           kube-proxy
117a4debc918        rancher/rancher-agent:v2.0.0         "run.sh --server h..."   11 minutes ago      Up 11 minutes                           tender_fermi
jonkel@rancher2:~$ docker ps
CONTAINER ID        IMAGE                                COMMAND                  CREATED              STATUS              PORTS               NAMES
ebec6195fec5        d1a7302844b3                         "sh -c 'sysctl -w ..."   12 seconds ago       Up 1 second                             k8s_sysctl_nginx-ingress-controller-q5cgn_ingress-nginx_173d5bb8-5170-11e8-836e-000c2923739a_0
5e79d4008621        rancher/pause-amd64:3.1              "/pause"                 34 seconds ago       Up 2 seconds                            k8s_POD_canal-hdllj_kube-system_173d714f-5170-11e8-836e-000c2923739a_0
06d5b729ecf8        rancher/pause-amd64:3.1              "/pause"                 34 seconds ago       Up 4 seconds                            k8s_POD_cattle-node-agent-ttjfl_cattle-system_18853f76-5170-11e8-836e-000c2923739a_0
e78308215400        rancher/pause-amd64:3.1              "/pause"                 34 seconds ago       Up 13 seconds                           k8s_POD_nginx-ingress-controller-q5cgn_ingress-nginx_173d5bb8-5170-11e8-836e-000c2923739a_0
85e376a3857d        rancher/rke-tools:v0.1.4             "nginx-proxy CP_HO..."   58 seconds ago       Up 53 seconds                           nginx-proxy
50db7bace022        rancher/hyperkube:v1.10.1-rancher2   "/opt/rke/entrypoi..."   About a minute ago   Up 59 seconds                           kubelet
c6e4a2980730        rancher/hyperkube:v1.10.1-rancher2   "/opt/rke/entrypoi..."   About a minute ago   Up About a minute                       kube-proxy
117a4debc918        rancher/rancher-agent:v2.0.0         "run.sh --server h..."   12 minutes ago       Up 12 minutes                           tender_fermi
jonkel@rancher2:~$ docker ps
CONTAINER ID        IMAGE                                                                                                      COMMAND                  CREATED             STATUS              PORTS               NAMES
748f86982bd9        rancher/nginx-ingress-controller@sha256:58944b175505087dfa7afc444577d1933fd3bf1f1f668ef57b9eaaa8b36f59ce   "/usr/bin/dumb-ini..."   21 minutes ago      Up 21 minutes                           k8s_nginx-ingress-controller_nginx-ingress-controller-q5cgn_ingress-nginx_173d5bb8-5170-11e8-836e-000c2923739a_0
c63b7163f96b        rancher/coreos-flannel@sha256:93952a105b4576e8f09ab8c4e00483131b862c24180b0b7d342fb360bbe44f3d             "/opt/bin/flanneld..."   22 minutes ago      Up 22 minutes                           k8s_kube-flannel_canal-hdllj_kube-system_173d714f-5170-11e8-836e-000c2923739a_0
1f2189f72b05        rancher/calico-cni@sha256:cafcb06d6bd5ed1651e6cc7fe3f9a1848606be3950d7218cf4c9439634ca5342                 "/install-cni.sh"        24 minutes ago      Up 24 minutes                           k8s_install-cni_canal-hdllj_kube-system_173d714f-5170-11e8-836e-000c2923739a_0
cb8e27ed7d32        rancher/calico-node@sha256:21d581d7356f2dba648f2905502a38fd4ae325fd079d377bcf94028bcfa577a3                "start_runit"            27 minutes ago      Up 27 minutes                           k8s_calico-node_canal-hdllj_kube-system_173d714f-5170-11e8-836e-000c2923739a_0
e460bb3c4835        8cfec7659f1d                                                                                               "run.sh"                 28 minutes ago      Up 28 minutes                           k8s_agent_cattle-node-agent-ttjfl_cattle-system_18853f76-5170-11e8-836e-000c2923739a_0
5e79d4008621        rancher/pause-amd64:3.1                                                                                    "/pause"                 28 minutes ago      Up 28 minutes                           k8s_POD_canal-hdllj_kube-system_173d714f-5170-11e8-836e-000c2923739a_0
06d5b729ecf8        rancher/pause-amd64:3.1                                                                                    "/pause"                 28 minutes ago      Up 28 minutes                           k8s_POD_cattle-node-agent-ttjfl_cattle-system_18853f76-5170-11e8-836e-000c2923739a_0
e78308215400        rancher/pause-amd64:3.1                                                                                    "/pause"                 28 minutes ago      Up 28 minutes                           k8s_POD_nginx-ingress-controller-q5cgn_ingress-nginx_173d5bb8-5170-11e8-836e-000c2923739a_0
85e376a3857d        rancher/rke-tools:v0.1.4                                                                                   "nginx-proxy CP_HO..."   29 minutes ago      Up 29 minutes                           nginx-proxy
50db7bace022        rancher/hyperkube:v1.10.1-rancher2                                                                         "/opt/rke/entrypoi..."   29 minutes ago      Up 29 minutes                           kubelet
c6e4a2980730        rancher/hyperkube:v1.10.1-rancher2                                                                         "/opt/rke/entrypoi..."   29 minutes ago      Up 29 minutes                           kube-proxy

Juju-62q · 2018-05-07T05:48:20Z

I got same error when adding second node to k8s cluster.

here's some logs.

xxx@localhost ~ $ docker ps
CONTAINER ID        IMAGE                          COMMAND                  CREATED             STATUS              PORTS               NAMES
7e3995f9612e        rancher/rancher-agent:v2.0.0   "run.sh -- share-roo…"   7 minutes ago       Up 7 minutes                            share-mnt
3db29cd16328        rancher/rancher-agent:v2.0.0   "run.sh --server htt…"   7 minutes ago       Up 7 minutes                            laughing_goldwasser

xxx@localhost ~ $ docker logs 7e
Found container ID: 86e121fbbbbcff73e36fa17ca51b69a823bc3a0fc6a1312eab0bf7f694eeebe8
Checking root: /host/run/runc
Checking root: /host/var/run/runc
Checking root: /host/run/docker/execdriver/native
Checking root: /host/var/run/docker/execdriver/native
Checking root: /host/run/docker/runtime-runc/moby
Checking file: 3db29cd163283ea4f5d145d71c741e91b039648a45ce2480f95b25338e505009
Checking file: 7e3995f9612e592c21922d91ec72c29ef7a883fcff73ce029cfd3453277b32a2
Checking file: 86e121fbbbbcff73e36fa17ca51b69a823bc3a0fc6a1312eab0bf7f694eeebe8
Found state.json: 86e121fbbbbcff73e36fa17ca51b69a823bc3a0fc6a1312eab0bf7f694eeebe8
time="2018-05-07T05:33:56Z" level=info msg="Execing [/usr/bin/nsenter --mount=/proc/12825/ns/mnt -F -- /var/lib/docker/overlay2/2193cefa0869f186d06ef729c7765e059d01ec8a32a848f3ab34e15435fe5312/merged/usr/bin/share-mnt --stage2 /var/lib/kubelet /var/lib/rancher -- norun]" 
Error response from daemon: {"message":"No such container: kubelet"}
Error: failed to start containers: kubelet

xxx@localhost ~ $ docker logs 3d
-----BEGIN CERTIFICATE-----

-----END CERTIFICATE-----
time="2018-05-07T05:33:55Z" level=info msg="Option customConfig=map[address:<node IP Address> internalAddress:<node IP Address> roles:[worker]]"
time="2018-05-07T05:33:55Z" level=info msg="Option etcd=false"
time="2018-05-07T05:33:55Z" level=info msg="Option controlPlane=false"
time="2018-05-07T05:33:55Z" level=info msg="Option worker=true"
time="2018-05-07T05:33:55Z" level=info msg="Option requestedHostname=localhost"
time="2018-05-07T05:33:55Z" level=info msg="Connecting to wss://<server IP Address>/v3/connect/register with token <access token>"
time="2018-05-07T05:33:55Z" level=info msg="Connecting to proxy" url="wss://<server IP Address>/v3/connect/register"
time="2018-05-07T05:33:56Z" level=info msg="Starting plan monitor"

sixcorners · 2018-05-07T06:02:35Z

I've never seen anything interesting in the initial rancher agent's logs.
time="2018-05-07T05:59:14Z" level=info msg="Option requestedHostname=sn"
time="2018-05-07T05:59:14Z" level=info msg="Option customConfig=map[address:abc internalAddress: roles:[etcd worker controlplane]]"
time="2018-05-07T05:59:14Z" level=info msg="Option etcd=true"
time="2018-05-07T05:59:14Z" level=info msg="Option controlPlane=true"
time="2018-05-07T05:59:14Z" level=info msg="Option worker=true"
time="2018-05-07T05:59:14Z" level=info msg="Connecting to wss://rancher.sixcorners.info/v3/connect/register with token xyz"
time="2018-05-07T05:59:14Z" level=info msg="Connecting to proxy" url="wss://rancher.sixcorners.info/v3/connect/register"
time="2018-05-07T05:59:15Z" level=info msg="Starting plan monitor"

jonaskello · 2018-05-07T06:10:32Z

Interestingly I also got this when adding my second node, but the third was added fine. Basically what I did was follow the quickstart to set everything up on a single VM, then add a second VM as worker node and got the error. Then added a third node as worker and it went well.

Juju-62q · 2018-05-07T06:50:29Z

Hmm... I will try setup machine again(now using baremetal).
Or should I use VM to setup k8s cluster?

sixcorners · 2018-05-07T21:44:58Z

I'm not sure if bare metal or VM matters. I think I got similar errors when trying to deploy it on bare metal (my laptop)

djanjic · 2018-05-08T19:54:48Z

Hi, I also had this error because my VMs had the same name. After changing the name and cleaning entire VM (volume prune, system prune etc.) everything went back to normal.

sara4dev · 2018-05-09T15:52:11Z

I also had the same issue. But it turned out my localhost was setup right in /etc/hosts. once I fixed that, it came up!

Juju-62q · 2018-05-11T01:25:41Z

in my case, problem solved by changing name. thanks!

nicklasfrahm · 2018-05-15T10:01:59Z

So it is currently not possible to run the rancher/server and the rancher/agent on the same host without additional VM overhead? Is there a fix planned or is this an uncommon / unsupported use case, because I also ran into this problem when I wanted to deploy the agent to the same baremetal host as the server.

mitchellmaler · 2018-05-17T17:10:58Z

I am getting the same error. I just tried to deploy a cluster. All the masters that I started have the 'share-mnt' containers saying:

Error response from daemon: {"message":"No such container: kubelet"}
Error: failed to start containers: kubelet

All I ran was the docker run script generated by rancher to start the nodes.

ChromoX · 2018-05-17T20:52:32Z

I have seen this bug many times in many different places/circumstances. The only way I've found to fix it is by destroying the cluster and recreating it.

mitchellmaler · 2018-05-18T13:43:52Z

This cluster I just created in OpenStack right before running the docker command on each host. If the error is for something else like network or hostname like said above, it would be nice if it said so. Right now with the cannot find kublet error isn't very helpful since it was able to pull hyperkube and start other containers.

yotoobo · 2018-05-24T10:11:12Z

I get the same error, then I destroy the cluster and recreating it from the latest version. Then I add 3 nodes, and they are ok.

the system version:

CentOS Linux release 7.4.1708 (Core)

the docker version:

Server Version: 17.03.2-ce
Storage Driver: overlay
Backing Filesystem: xfs
Supports d_type: true

the rancher images;

REPOSITORY TAG IMAGE ID CREATED SIZE
rancher/rancher-agent v2.0.2 df087865517d 15 hours ago 228 MB
rancher/rancher latest 88526c7bea4e 15 hours ago 521 MB
rancher/rke-tools v0.1.8 5df9ccc4e588 7 days ago 136 MB

ram-devsecops · 2018-05-31T15:18:55Z

I'm getting same error, I tried destroying and recreating the cluster but no luck. I'm using 2.0.2 version of rancher.

Error response from daemon: {"message":"No such container: kubelet"} Error: failed to start containers: kubelet Error response from daemon: {"message":"No such container: kubelet"} Error: failed to start containers: kubelet Error response from daemon: {"message":"No such container: kubelet"} Error: failed to start containers: kubelet Error response from daemon: {"message":"No such container: kubelet"}

ram-devsecops · 2018-05-31T20:12:28Z

I used registry which had authentication enabled, after making the rancher repo public, my issue is resolved.

gazben · 2018-06-08T07:16:43Z

Same here on Ubuntu 18.04 with docker 17.03.2-ce (tried with 18.X but no luck).
Can I get out more log from the startup?

MSandro · 2018-06-10T22:29:23Z

Hi, same problem here DinD (Alpine Linux) Docker 18.03.1-ce Rancher 2.0.2
What could we do?

ibrokethecloud · 2018-06-12T23:21:34Z

I have run into the same issue as well.

[root@prv-verint-1.lab.devlabs kubernetes]# docker logs -f 3a43a93af75b
Found container ID: c066d5a54f9a3e2ff78c093be68ff9442cc065656f3f654f1487487f36acd337
Checking root: /host/run/runc
Checking file: 3a43a93af75be42ceb79aa1f0d1e9a6bb19c51983cd72537e6ac154c4074362c
Checking file: c066d5a54f9a3e2ff78c093be68ff9442cc065656f3f654f1487487f36acd337
Found state.json: c066d5a54f9a3e2ff78c093be68ff9442cc065656f3f654f1487487f36acd337
time="2018-06-12T23:18:36Z" level=info msg="Execing [/usr/bin/nsenter --mount=/proc/10039/ns/mnt -F -- /var/lib/docker/overlay/a57306998356a5882524dddddc3268f986d2e18c394bee6e8db5d4d72d7cbbe3/merged/usr/bin/share-mnt --stage2 /var/lib/kubelet /var/lib/rancher -- norun]"
Error response from daemon: {"message":"No such container: kubelet"}
Error: failed to start containers: kubelet
Error response from daemon: {"message":"No such container: kubelet"}
Error: failed to start containers: kubelet
Error response from daemon: {"message":"No such container: kubelet"}
Error: failed to start containers: kubelet
Error response from daemon: {"message":"No such container: kubelet"

I dont understand where it is finding the container id from, i have checked the local mounts on the node.

I was testing some scenarios so tried to re-register a node, after cleaning the local mounts.

I have also tried:

docker system prune -a -f
docker rm -f share-mnt

ibrokethecloud · 2018-06-12T23:37:31Z

the actual error comes from /usr/bin/share-root.sh

#!/bin/bash

ID=$(grep :devices: /proc/self/cgroup | head -n1 | awk -F/ '{print $NF}' | sed -e 's/docker-\(.*\)\.scope
/\1/')
IMAGE=$(docker inspect -f '{{.Config.Image}}' $ID)

docker run --privileged --net host --pid host -v /:/host --rm --entrypoint /usr/bin/share-mnt $IMAGE "$@"
 -- norun
while ! docker start kubelet; do
    sleep 2
done

I am not really sure if this is an issue with the logic, but no container named kubelet was created by agent however the agent is stuck trying to start it up.

superseb · 2018-06-13T11:24:29Z

The root cause can either be found in the other rancher/rancher-agent container where it connects to the rancher/rancher container and registers itself, or in the rancher/rancher container where it will log cluster provisioning status for the node (and why it's not spawning the needed containers)

jeremyweber-np · 2018-06-27T15:15:53Z

I am seeing the same issue when attempting to re-register a node. The steps I use are:

Create a custom cluster
Register a node
Delete the node from the rancher console
Attempt to clean up the node with the following: https://pastebin.com/AuQUJiM4
Reregister the node

In this case the rancher server reports:

2018/06/27 14:45:35 [ERROR] ClusterController c-v2998 [cluster-agent-controller] failed with : could not contact server: Get https://x.x.x.x:6443/version: dial tcp 127.0.0.1:6443: getsockopt: connection refused

The node has only 2 containers:

share-mnt: Error response from daemon: {"message":"No such container: kubelet"}
Error: failed to start containers: kubelet
quirky_hopper: ... time="2018-06-27T14:57:46Z" level=info msg="waiting for node to register"
time="2018-06-27T14:57:48Z" level=info msg="Starting plan monitor"

The only way to proceed at this point is stop all activity on the node, delete the node from the rancher ui, and recreate the cluster and re register.

Running server 2.0.4 with only the one node, Docker version 17.03.2-ce on Ubuntu 16.04.

wrsuarez · 2018-06-29T00:20:56Z

+1 for seeing the same thing on a brand new Rancher 2.0 deployment with a new set of CentOS7 nodes

sannysoft · 2018-07-05T14:01:20Z

Same for me.
Ubuntu 16.04, Docker 17.03.2-ce
I have 2 host - rancher + creating cluster with 2 machines and get same error.
Logs from docker shows
dockerd[2710]: time="2018-07-05T13:59:37.185566233Z" level=error msg="Handler for POST /v1.24/containers/kubelet/start returned error: No such container: kubelet"

CaptEmulation · 2019-06-23T23:55:47Z

I got this error on my first HA RKE install when adding my first cluster. Wiping nodes and re-creating the cluster had no effect. Removing the cluster and restarting rancher server nodes fixed it for me

gknepper · 2019-06-30T02:35:17Z

I solved removing the hostname from 127.0.0.1 on /etc/hosts and including the FQDN also. So the machine now is responding the hostname and FQDN to the eth0 not loopback.

Resuming:

192.168.1.X hostname FQDN

dankeder · 2019-07-25T15:10:27Z

I hit this issue as well. In my case it started when the last node of my
experimental cluster stopped working after I restarted into a new version of
Docker and it couldn't get up and running again. I managed to "fix" it without
deleting and creating a new cluster (as I had to several times before when
I encountered the same issue).

Here's what I did:

Cleaned the node by running

docker system prune
docker volume prune

Note that this will delete all the Docker volumes, take care if you have
important data in your volumes.

Cleaned Rancher/Kubernetes runtime data on the node.

rm -rf /etc/cni/ /etc/kubernetes/ /opt/cni/ /var/lib/calico/ /var/lib/cni/ /var/lib/rancher/ /var/run/calico/

Note that official docs on node cleanup recommend also removal of /opt/rke and
/var/lib/etcd. I did NOT remove them because they contain cluster etcd
snapshots and data. This is especially important in case there's only one node
in the cluster.

I exec-ed into the rancher container and hacked the cluster status (thx
@ibrokethecloud for the hint):

docker exec -it rancher bash

Inside the container:

apt-get update && apt-get -y install vim
kubectl edit cluster c-XXXX  # replace the cluster-id with an actual cluster ID

Now in the editor I found the key apiEndpoint (it should be directly under
the status key) and removed it. Exit the editor and container. Make sure
kubectl says that it updated the cluster.

From the Rancher UI I got the command for registering new node.
I set a different name for the node than it was before by adding a
--node-name to the docker run command (actually there's an edit box for this
under advanced settings). It looked like this:

docker run -d --privileged --restart=unless-stopped --net=host \
  -v /etc/kubernetes:/etc/kubernetes -v /var/run:/var/run rancher/rancher-agent:v2.2.6 \
  --server https://rancher.example.com --token XXXXXXXXXXXXXXX --node-name mynode2 \
  --etcd --controlplane --worker

I run the above command on the cleaned node and finally it registered
successfully and RKE started up all the kube-* and kubelet containers.

I also tried to just clean-up the node and register it again, but it always
ended up with the "Error: failed to start containers: kubelet".

Given the above, I think Rancher doesn't handle well the case when all the
cluster nodes are become unresponsive. In this case nodes can't even be removed
from the cluster. When I tried to remove the faulty node it got stuck in the
"Removing" state indefinitely, probably because cluster's etcd couldn't be reached.

Flos · 2019-07-31T22:10:14Z

I just created a new local VM with Ubuntu 18.04 to test if rancher 2.2.6 is 'stable' now. But I can't add the first worker with all 3 roles on my single machine setup because of this issue.

zhming0 · 2019-08-02T13:01:16Z

faild to start Kubelet means something has stopped it from starting up.
In my case, the problem is that my etcd cluster lost quorum, removing dead node from etcd state manually using etcdctl fixed this issue for me and the kubelet issue is gone.

sixcorners · 2019-08-02T18:44:48Z

Man.. there sure are a lot of posts here.
One of the first few responses was asking for more information. I also want more information because it seems like a lot of different kinds of problems can cause this. Could this error message be enhanced?

sww0825521xy · 2019-08-23T05:21:59Z

Rancher stable 2.2.8 also found this issue()

sleep 2
docker start kubelet
Error response from daemon: {"message":"No such container: kubelet"}
Error: failed to start containers: kubelet

docker ps -a

CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
f428817311ac rancher/hyperkube:v1.14.6-rancher1 "/opt/rke-tools/entrâ€¦" 2 hours ago Up 2 hours kube-proxy
08f59cee9f37 rancher/hyperkube:v1.14.6-rancher1 "/opt/rke-tools/entrâ€¦" 2 hours ago Up 2 hours kubelet
e1ad2c121a66 rancher/hyperkube:v1.14.6-rancher1 "/opt/rke-tools/entrâ€¦" 2 hours ago Up 2 hours kube-scheduler
1088e8e2ef66 rancher/hyperkube:v1.14.6-rancher1 "/opt/rke-tools/entrâ€¦" 2 hours ago Up 2 hours kube-controller-manager
5c30c26352c1 rancher/hyperkube:v1.14.6-rancher1 "/opt/rke-tools/entrâ€¦" 2 hours ago Up 2 hours kube-apiserver
0465bfcce0d2 rancher/rke-tools:v0.1.42 "/bin/bash" 2 hours ago Created service-sidekick
809c08e9f9bf rancher/coreos-etcd:v3.3.10-rancher1 "/usr/local/bin/etcdâ€¦" 2 hours ago Up 2 hours etcd
12a2ff2ae478 rancher/rancher-agent:v2.2.8 "run.sh --server httâ€¦" 2 hours ago Up 2 hours flamboyant_engelbart
ab18ff7b118c rancher/rancher-agent:v2.2.8 "run.sh -- share-rooâ€¦" 2 hours ago Exited (0) 2 hours ago share-mnt
b998e63a1a7a rancher/rancher-agent:v2.2.8 "run.sh --server httâ€¦" 2 hours ago Up 2 hours affectionate_edison
6a8d364109ab rancher/rancher:stable "entrypoint.sh" 2 hours ago Up 2 hours 0.0.0.0:80->80/tcp, 0.0.0.0:443->443/tcp silly_mclean

seedsam · 2019-11-04T09:51:23Z

I have run into the same issue as well , but solved.
try to see the rancher server log ,not the log of agent.
agent log:

sleep 2
docker start kubelet
Error response from daemon: {"message":"No such container: kubelet"}
Error: failed to start containers: kubelet
sleep 2
docker start kubelet
Error response from daemon: {"message":"No such container: kubelet"}
Error: failed to start containers: kubelet
sleep 2
docker start kubelet
Error response from daemon: {"message":"No such container: kubelet"}
Error: failed to start containers: kubelet

rancher server log:
2019/11/04 09:25:07 [INFO] kontainerdriver rancherkubernetesengine stopped
2019/11/04 09:25:07 [INFO] cluster [c-w6rtq] provisioning: [network] Deploying port listener containers
2019/11/04 09:25:07 [INFO] cluster [c-w6rtq] provisioning: [network] Pulling image [http://10.1.69.87/rancher/rke-tools:v0.1.34] on host [10.1.69.82]
2019/11/04 09:25:07 [ERROR] cluster [c-w6rtq] provisioning: [invalid reference format]
2019/11/04 09:25:07 [ERROR] ClusterController c-w6rtq [cluster-provisioner-controller] failed with : [invalid reference format]
2019-11-04 09:28:59.185541 I | mvcc: store.index: compact 5146

Found that the images address is more http.

jhughes2112 · 2019-12-28T04:51:47Z

Spent days tracking this backwards, looking for answers. Anyone still getting this error, you should first note that the share-mnt container does this in a tight loop while your node is pulling the kubernetes images and setting up other things by the remote Rancher server, so this isn't abnormal to see on first boot. In the case of a ram-only machine, you'll always see this when you boot until the provisioning progresses to the kubernetes being available stage. It's not necessarily stuck, it might just be waiting.

I have noticed that if you fail to provision a node, it will always fail well before it pulls the kubernetes container images, due to the cluster recognizing the node but assigning it a different certificate (or something), leaving it hung. I solved it by wiping my Rancher server and starting over. This has absolutely got to be fixed properly, somehow.

HOWEVER...

If you actually do get a container named kubelet created, and the error switches over to this:

Error response from daemon: {"message":"linux mounts: open /proc/self/mountinfo: no such file or directory"}
Error: failed to start containers: kubelet

from what I can tell, it means you're running a system where the OS is on a tmpfs volume (either diskless, or not installed to disk yet). There are other errors similar to this in the Docker and Kubernetes forums. If I get to the bottom of where it happens, I'll open a ticket.

jhughes2112 · 2020-01-04T01:28:38Z

Got to the bottom of it. I was running diskless iPXE booted RancherOS. Although rancher-agent started up, by the time it started installing Kubernetes onto the machine, it exhausted something in the tmpfs file system, causing a failure so bad that nothing shows up when you "ls -al /proc", and "df" or "du" would just fail. Three different perfectly good servers with almost exactly the same results. There may be a kernel setting that would expand the OS file system somehow, but by installing to hard drive and booting there instead, all the issues went away.

Still worth noting that if you are watching the logs of share-mnt while spinning up a node, quite a number of errors will pop up while k8s is installing, even on a completely healthy node.

maiku1008 · 2020-05-07T13:16:55Z

I bumped into the very same issue today.

I have basically been trying to add a worker node to rancher all morning.
In the end it seemed that using a certain hostname for the node didn't work, but using a different one registered the node straight away.

It seems in my tests I have been impatiently adding/removing nodes from the cluster by just deleting them from the UI. My suspicion is that etcd cluster got dirty as a result.
A more graceful procedure should be followed, such as

Cordon the node from rancher
Drain the node from rancher
Delete node from rancher (wait for cluster to update)
SSH to the node (this will stop all containers): docker stop $(docker ps -q)
SSH to the node (this will remove all containers): docker rm $(docker ps -q)
Reboot node

I did not rebuild the cluster from scratch, as in my case I knew exactly when the tests started. I merely restored an earlier etcd snapshot, tried again and it all worked as expected.

Just-Insane · 2020-05-14T21:30:30Z

@jhughes2112

Got to the bottom of it. I was running diskless iPXE booted RancherOS. Although rancher-agent started up, by the time it started installing Kubernetes onto the machine, it exhausted something in the tmpfs file system, causing a failure so bad that nothing shows up when you "ls -al /proc", and "df" or "du" would just fail. Three different perfectly good servers with almost exactly the same results. There may be a kernel setting that would expand the OS file system somehow, but by installing to hard drive and booting there instead, all the issues went away.

Were you able to get worker nodes stood up on an iPXE booted RancherOS? I am currently trying to boot 10 nodes this way (without local storage), and it does not seem to be working (seeing the same errors you were).

Seeing the following in docker.log:

time="2020-05-14T22:12:30.797236245Z" level=warning msg="Error while cleaning up container resource mounts." container=776124b2138bdb63ea26b0b82cabbbaf518e1dc760cb255a9baf73494d710861 error="open /proc/self/mountinfo: no such file or directory"
time="2020-05-14T22:12:30.809112374Z" level=error msg="776124b2138bdb63ea26b0b82cabbbaf518e1dc760cb255a9baf73494d710861 cleanup: failed to delete container from containerd: no such container"
time="2020-05-14T22:12:30.809216451Z" level=error msg="Handler for POST /v1.24/containers/776124b2138bdb63ea26b0b82cabbbaf518e1dc760cb255a9baf73494d710861/start returned error: open /proc/self/mountinfo: no such file or directory"
time="2020-05-14T22:12:30.816105471Z" level=warning msg="Failed to parse cgroup information: open /proc/self/mountinfo: no such file or directory"
time="2020-05-14T22:12:30.817304838Z" level=warning msg="Failed to parse cgroup information: open /proc/self/mountinfo: no such file or directory"
time="2020-05-14T22:12:30.887537100Z" level=warning msg="Failed to parse cgroup information: open /proc/self/mountinfo: no such file or directory"
time="2020-05-14T22:12:30.904236485Z" level=warning msg="Falling back to default propagation for bind source in daemon root" container=f6a16097291ddd5d3c8bfcfb3dc4081349fe233204eb3d864a45bf3243b18e42 source=/var/lib
time="2020-05-14T22:12:30.965317948Z" level=warning msg="Error while cleaning up container resource mounts." container=f6a16097291ddd5d3c8bfcfb3dc4081349fe233204eb3d864a45bf3243b18e42 error="open /proc/self/mountinfo: no such file or directory"
time="2020-05-14T22:12:30.967717432Z" level=error msg="f6a16097291ddd5d3c8bfcfb3dc4081349fe233204eb3d864a45bf3243b18e42 cleanup: failed to delete container from containerd: no such container"
time="2020-05-14T22:12:30.967995651Z" level=error msg="Handler for POST /v1.24/containers/f6a16097291ddd5d3c8bfcfb3dc4081349fe233204eb3d864a45bf3243b18e42/start returned error: failed to open stdout fifo: couldn't stat /proc/self/fd/30: stat /proc/self/fd/30: no such file or directory"

jhughes2112 · 2020-05-14T22:51:20Z

@Just-Insane Never got it working. From what I could tell, the tmpfs is exhausted of inodes or something, to the extent the OS stopped working. Someone with more time on their hands can maybe sort it out for the rest of us, but I just threw a small HDD in each box and installed ROS there instead. Worked fine. Ultimately gave up on iPXE because of it, unfortunately, and ROS as well, because 99% of the value of it, to me at least, was to treat nodes as extremely dumb cattle.

Just-Insane · 2020-05-14T22:56:38Z

Ah, that's unfortunate.

My only thought is maybe switching to K3sOS, but that is rather poorly documented, especially for iPXE booting.

Yea, the only reason I'm using iPXE right now is because of these node's lack of local storage, otherwise, I'd just install an OS and it'd be fine.

Further to your note on dumb cattle, I am not sure rancher would handle non-persistent nodes trying to join with existing node names anyways, so that might not even work.

jhughes2112 · 2020-05-14T23:07:11Z

I had set up my DHCP to assign the same IP address to the machine's MAC addresses, then ran an internal web server that trivially returned a hostname according to the requestor's IP address. The nodes would boot, curl the server for its name, then set it right up front in the cloud init. It's an afternoon to set up, but not hard. Overall didn't seem worth pursuing it, given the failure to run entirely in RAM.

I am running a k3os cluster in AWS, and probably would not recommend it for running iPXE, since it really wants an external high availability MySQL setup that it can keep its brains in. RDS is good for that, but running locally.... ehh.

Just-Insane · 2020-05-14T23:09:43Z

Maybe I am misunderstanding, but k3os requires an external HA MySQL setup? I thought it was just a lighter weight version of RancherOS.

Hopefully someone on the OS team is able to take a look at that issue and provide a workaround or resolution.

What did you end up doing OS wise with your hosts you were trying to get working diskless?

jhughes2112 · 2020-05-15T00:00:17Z

I'm not sure if k3os and k3s is the same thing (I suspect not, on reflection), but to configure an HA Rancher management cluster, they recommend using k3s rather than ROS, which does need something to store shared data into. The easy way is MySQL, the hard way is etcd. https://rancher.com/docs/rancher/v2.x/en/installation/k8s-install/kubernetes-rke/

I ended up dropping an older server SAS drives about 80gb, which is plenty large for an OS. I dropped Ubuntu 18.04 on it and did a simple install. Life became much easier after that.

Just-Insane · 2020-05-15T01:56:21Z

Oh, my Rancher master cluster is on separate nodes, and my master nodes for this cluster have local storage, so I am fine on that front.

Ah, I might have to end up adding local storage I guess. Not looking forward to that cost lol.

dividebysandwich · 2020-05-28T13:54:35Z

I'm having the same issue: I added a third node to a cluster, worked great.
Added a fourth node, getting this error in share-mnt:

INFO: Arguments: – share-root.sh docker run -d --privileged --restart=unless-stopped --net=host -v /etc/kubernetes:/etc/kubernetes -v /var/run:/var/run rancher/rancher-agent:v2.3.5 --server https://10.16.176.10:8443 --token REDACTED --ca-checksum e634f7f0d65cb60e815e4987830cdcd09317800827e683ca14239ec61f9eae1d --no-register --only-write-certs --node-name nbd-cluster-node4 /var/lib/kubelet /var/lib/rancher
+ trap 'exit 0' SIGTERM
++ grep :devices: /proc/self/cgroup
++ head -n1
++ awk -F/ '

{print $NF}

'
++ sed (.*)\.scope/\1/'
+ ID=3991efd40484f66f49956e5ed3d0e31049f7a164267a3170aa0440740bdede2f
++ docker inspect -f '.Config.Image' 3991efd40484f66f49956e5ed3d0e31049f7a164267a3170aa0440740bdede2f
+ IMAGE=rancher/rancher-agent:v2.3.5
+ bash -c 'docker run -d --privileged --restart=unless-stopped --net=host -v /etc/kubernetes:/etc/kubernetes -v /var/run:/var/run rancher/rancher-agent:v2.3.5 --server https://10.16.176.10:8443 --token 4c28cmskzpr4lbjzxtvsn67cpkwqnktcbzz6h7h6zldgd6j4h6g8jg --ca-checksum e634f7f0d65cb60e815e4987830cdcd09317800827e683ca14239ec61f9eae1d --no-register --only-write-certs'
92b2310fda72442305ad3d860d0ff83e7e9d4c7e6094181a52f91c1b0deb8f10
+ docker run --privileged --net host --pid host -v /:/host --rm --entrypoint /usr/bin/share-mnt rancher/rancher-agent:v2.3.5 --node-name nbd-cluster-node4 /var/lib/kubelet /var/lib/rancher – norun
Incorrect Usage.

NAME:
/usr/bin/share-mnt - A new cli application

USAGE:
/usr/bin/share-mnt [global options] command [command options] [arguments...]

VERSION:
1d97ce9

COMMANDS:
help, h Shows a list of commands or help for one command

GLOBAL OPTIONS:
--stage2
--help, -h show help
--version, -v print the version

time="2020-05-28T12:42:35Z" level=fatal msg="flag provided but not defined: -node-name"
+ docker start kubelet
Error response from daemon:

{"message":"No such container: kubelet"}

Error: failed to start containers: kubelet
+ sleep 2
+ docker start kubelet
Error response from daemon:

{"message":"No such container: kubelet"}

Error: failed to start containers: kubelet
+ sleep 2
+ docker start kubelet
Error response from daemon:

{"message":"No such container: kubelet"}

Error: failed to start containers: kubelet
+ sleep 2

I've checked the diskspace, all ok. Also the hostname is correct and set up correctly in the hosts file.

Docker ps output when this happens:

CONTAINER ID        IMAGE                          COMMAND                  CREATED             STATUS              PORTS               NAMES

9e7d5c8d8b2d        rancher/rancher-agent:v2.3.5   "run.sh --server h..."   4 seconds ago       Up 3 seconds                            clever_davinci

18ec181d0456        rancher/rancher-agent:v2.3.5   "run.sh -- share-r..."   4 seconds ago       Up 3 seconds                            share-mnt

19f4d9d89a00        rancher/rancher-agent:v2.3.5   "run.sh --server h..."   15 seconds ago      Up 14 seconds                           zealous_cray

I also tried changing the hostname but no joy there. Any pointers would be greatly appreciated. And I join the chorus of those who ask for a better error message that lets us actually determine what's wrong.

EDIT: I think I'll create a separate issue because this thread is filled with lots of unrelated issues, and it's not even sure it's the same as the original issue.

andrezaycev · 2020-06-10T12:28:42Z

Have the same issue after every reboot worker node. Every time help docker prune, rm folders and readd rancher-client. Wait solution how fix it

dividebysandwich · 2020-06-10T12:32:29Z

Just FYI I haven't gotten to investigate or pull logs yet, but I will do that when I get some time. I've opened another issue for this which has been closed because I can't touch that system at the moment.

jhughes2112 · 2020-06-10T14:30:40Z

kubelet failing also is an indicator that your node was added to the rancher cluster before, failed for some other reason, then you tried to add it again. If you don't use the cleanup script, you probably haven't gotten rid of a few items that make rancher misinterpret the node as re-joining a cluster rather than being a node new to the cluster. Hunt for it, you'll find it.

xmh19936688 · 2020-07-29T14:38:06Z

I got the same error while creating a new cluster and add a node. And then resolve it!

Here are some info:
CentOS: 7.3.1611
kernel: 3.10.0-862
docker: 18.09.9
rancher: v2.2.2 (set up by rke)
rke: v0.2.4

Check steps:

make sure the /etc/hosts on all nodes (local-cluster's nodes and the new node) have all record about each node.
vi /etc/hosts
make sure each node (local-cluster's nodes and the new node) can ssh to another without password.
ssh-keygen -t rsa & ssh-copy-id root@x.x.x.x
make sure the system-default-registry (below Global -> Settings) is not end with '/'.
make sure the private registry (under advanced options) is not end with '/'.

Maybe you need to clean up the new cluster and node after failed options.
Clean steps:

delete node on rancher UI and wait it disappear.
delete cluster on rancher UI and wait it disappear.
run some scripts shown as below and here to clean the node and reboot.

docker ps -aq | xargs docker stop
docker ps -aq | xargs docker rm -v
docker volume rm $(sudo docker volume ls -q)
mount | grep '/var/lib/kubelet'| awk '{print $3}'|xargs umount
rm -rf /var/lib/etcd \
    /var/lib/cni \
    /var/run/calico \
    /etc/kubernetes/ssl \
    /etc/kubernetes/.tmp/ \
    /opt/cni \
    /var/lib/kubelet \
    /var/lib/rancher \
    /var/lib/calico

Finally, up the new cluster and GLHF! Please let me know if anything can be optimized, thanks!

stale · 2021-07-10T04:03:22Z

This repository uses a bot to automatically label issues which have not had any activity (commit/comment/label) for 60 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the bot can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the bot will automatically close the issue in 14 days. Thank you for your contributions.

ChromoX mentioned this issue May 31, 2018

When control plane becomes unavailable (host powered down) when there is another control plane in the cluster , not able to deploy nodes in the worker plane and other control plane. #13706

Closed

publicocean0 mentioned this issue Jun 7, 2018

Error in log (in console Waiting for etcd and controlplane nodes to be registered) #13889

Closed

Just-Insane mentioned this issue May 14, 2020

iPXE boot unable to register with Rancher rancher/os#2999

Open

stale bot added the status/stale label Jul 10, 2021

stale bot closed this as completed Jul 28, 2021

zube bot added [zube]: Done and removed [zube]: Unscheduled labels Jul 28, 2021

zube bot removed the [zube]: Done label Oct 26, 2021

Error: failed to start containers: kubelet #13314

Error: failed to start containers: kubelet #13314

Comments

sixcorners commented May 6, 2018 • edited by jira-sync-svc Loading

jonaskello commented May 6, 2018

superseb commented May 6, 2018

jonaskello commented May 6, 2018 • edited Loading

superseb commented May 6, 2018

jonaskello commented May 6, 2018

Juju-62q commented May 7, 2018 • edited Loading

sixcorners commented May 7, 2018

jonaskello commented May 7, 2018

Juju-62q commented May 7, 2018

sixcorners commented May 7, 2018

djanjic commented May 8, 2018

sara4dev commented May 9, 2018

Juju-62q commented May 11, 2018

nicklasfrahm commented May 15, 2018

mitchellmaler commented May 17, 2018

ChromoX commented May 17, 2018 • edited Loading

mitchellmaler commented May 18, 2018 • edited Loading

yotoobo commented May 24, 2018 • edited Loading

ram-devsecops commented May 31, 2018

ram-devsecops commented May 31, 2018

gazben commented Jun 8, 2018

MSandro commented Jun 10, 2018 • edited Loading

ibrokethecloud commented Jun 12, 2018

ibrokethecloud commented Jun 12, 2018

superseb commented Jun 13, 2018

jeremyweber-np commented Jun 27, 2018

wrsuarez commented Jun 29, 2018

sannysoft commented Jul 5, 2018

CaptEmulation commented Jun 23, 2019

gknepper commented Jun 30, 2019

dankeder commented Jul 25, 2019

Flos commented Jul 31, 2019

zhming0 commented Aug 2, 2019

sixcorners commented Aug 2, 2019

sww0825521xy commented Aug 23, 2019

seedsam commented Nov 4, 2019

jhughes2112 commented Dec 28, 2019

jhughes2112 commented Jan 4, 2020

maiku1008 commented May 7, 2020

Just-Insane commented May 14, 2020 • edited Loading

jhughes2112 commented May 14, 2020 • edited Loading

Just-Insane commented May 14, 2020

jhughes2112 commented May 14, 2020

Just-Insane commented May 14, 2020

jhughes2112 commented May 15, 2020

Just-Insane commented May 15, 2020

dividebysandwich commented May 28, 2020 • edited Loading

andrezaycev commented Jun 10, 2020

dividebysandwich commented Jun 10, 2020

jhughes2112 commented Jun 10, 2020

xmh19936688 commented Jul 29, 2020 • edited Loading

stale bot commented Jul 10, 2021

sixcorners commented May 6, 2018 •

edited by jira-sync-svc

Loading

jonaskello commented May 6, 2018 •

edited

Loading

Juju-62q commented May 7, 2018 •

edited

Loading

ChromoX commented May 17, 2018 •

edited

Loading

mitchellmaler commented May 18, 2018 •

edited

Loading

yotoobo commented May 24, 2018 •

edited

Loading

MSandro commented Jun 10, 2018 •

edited

Loading

Just-Insane commented May 14, 2020 •

edited

Loading

jhughes2112 commented May 14, 2020 •

edited

Loading

dividebysandwich commented May 28, 2020 •

edited

Loading

xmh19936688 commented Jul 29, 2020 •

edited

Loading