Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error: failed to start containers: kubelet #13314

Closed
sixcorners opened this issue May 6, 2018 · 91 comments
Closed

Error: failed to start containers: kubelet #13314

sixcorners opened this issue May 6, 2018 · 91 comments

Comments

@sixcorners
Copy link

sixcorners commented May 6, 2018

Rancher versions:
rancher/server or rancher/rancher: rancher/rancher:latest@sha256:38839bb19bdcac084a413a4edce7efb97ab99b6d896bda2f433dfacfd27f8770
rancher/agent or rancher/rancher-agent:
rancher/rancher-agent:v2.0.0
Infrastructure Stack versions:
whatever the defaults are

Docker version: (docker version,docker info preferred)
Client:
Version: 17.03.2-ce
API version: 1.27
Go version: go1.8.3
Git commit: f5ec1e2-snap-345b814
Built: Thu Jun 29 23:40:29 2017
OS/Arch: linux/amd64

Server:
Version: 17.03.2-ce
API version: 1.27 (minimum version 1.12)
Go version: go1.8.3
Git commit: f5ec1e2-snap-345b814
Built: Thu Jun 29 23:40:29 2017
OS/Arch: linux/amd64
Experimental: false
Operating system and kernel: (cat /etc/os-release, uname -r preferred)
4.15.0-20-generic
Type/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO)
ssdnodes/custom cluster
Setup details: (single node rancher vs. HA rancher, internal DB vs. external DB)
single node
Environment Template: (Cattle/Kubernetes/Swarm/Mesos)
can't set up node
Steps to Reproduce:
snap install docker --channel=17.03/stable
mkdir /etc/kubernetes
sudo docker run -d --privileged --restart=unless-stopped --net=host -v /etc/kubernetes:/etc/kubernetes -v /var/run:/var/run rancher/rancher-agent:v2.0.0 --server https://rancher.sixcorners.info --token abc --ca-checksum xyz --worker
docker logs -f share-mnt
Results:
Found state.json: 931882e24ff0ef67b0e8744dbf1f7e04fd68afe714a29a2522293312824f3c51
time="2018-05-06T06:09:15Z" level=info msg="Execing [/usr/bin/nsenter --mount=/proc/21787/ns/mnt -F -- /var/snap/docker/common/var-lib-docker/aufs/mnt/5d00bd40adec6662aaec8ea2a5f5ce6a332e9dbfad087a008c5c89b7cac4c22f/usr/bin/share-mnt --stage2 /var/lib/kubelet /var/lib/rancher -- norun]"
Error response from daemon: {"message":"No such container: kubelet"}
Error: failed to start containers: kubelet
Error response from daemon: {"message":"No such container: kubelet"}
Error: failed to start containers: kubelet
Error response from daemon: {"message":"No such container: kubelet"}
Error: failed to start containers: kubelet

@jonaskello
Copy link

I got this when adding a new node. I started the agent container as per the command in the UI and it seems to have started a second agent container named share-mnt. At the time there were only two containers running, the original agent container and this one. After a while I did docker rm -f on the share-mnt container and that seems to have cleared it up.

@superseb
Copy link
Contributor

superseb commented May 6, 2018

What is the output of docker ps -a when this happens? And docker logs --tail=all kubelet?

@jonaskello
Copy link

jonaskello commented May 6, 2018

Here are some logs from my terminal:

jonkel@rancher2:~$ docker ps
CONTAINER ID        IMAGE                          COMMAND                  CREATED             STATUS              PORTS               NAMES
7378e65f3edc        rancher/rancher-agent:v2.0.0   "run.sh -- share-r..."   5 minutes ago       Up 2 minutes                            share-mnt
117a4debc918        rancher/rancher-agent:v2.0.0   "run.sh --server h..."   5 minutes ago       Up 5 minutes                            tender_fermi
jonkel@rancher2:~$ docker logs 73
Found container ID: 1365e16b71514e04bcc4d4553f757f5684b55deeaa38523d71fda23d41ee678e
Checking root: /host/run/runc
Checking file: 117a4debc9180552ac578ab8e7145c4ceb9b384981c70e23ebb03a8251ff2909
Checking file: 1365e16b71514e04bcc4d4553f757f5684b55deeaa38523d71fda23d41ee678e
Found state.json: 1365e16b71514e04bcc4d4553f757f5684b55deeaa38523d71fda23d41ee678e
time="2018-05-06T20:48:25Z" level=info msg="Execing [/usr/bin/nsenter --mount=/proc/10275/ns/mnt -F -- /var/lib/docker/aufs/mnt/25409162e1b2f1c3c0cbe62719f52bf6710bd4e97fe49ca612db6711d11a4a31/usr/bin/share-mnt --stage2 /var/lib/kubelet /var/lib/rancher -- norun]"
Error response from daemon: {"message":"No such container: kubelet"}
Error: failed to start containers: kubelet
Error response from daemon: {"message":"No such container: kubelet"}
Error: failed to start containers: kubelet
Error response from daemon: {"message":"No such container: kubelet"}
Error: failed to start containers: kubelet
Error response from daemon: {"message":"No such container: kubelet"}
Error: failed to start containers: kubelet
Error response from daemon: {"message":"No such container: kubelet"}

There was no container named kubelet started at this time. Unfortunately I did not do a ps -a so I don't know if it had been started and stopped.

@superseb
Copy link
Contributor

superseb commented May 6, 2018

Ok and docker logs 117a4debc918 ?

@jonaskello
Copy link

I'm afraid it is too late to do that since that container is gone. Here is some more logs from what I did next when I removed the stuck container:

jonkel@rancher2:~$ docker ps
CONTAINER ID        IMAGE                          COMMAND                  CREATED             STATUS              PORTS               NAMES
7378e65f3edc        rancher/rancher-agent:v2.0.0   "run.sh -- share-r..."   8 minutes ago       Up 5 minutes                            share-mnt
117a4debc918        rancher/rancher-agent:v2.0.0   "run.sh --server h..."   8 minutes ago       Up 8 minutes                            tender_fermi
jonkel@rancher2:~$ docker rm -f 7378e65f3edc
7378e65f3edc
jonkel@rancher2:~$ docker ps
CONTAINER ID        IMAGE                                COMMAND                  CREATED             STATUS              PORTS               NAMES
50db7bace022        rancher/hyperkube:v1.10.1-rancher2   "/opt/rke/entrypoi..."   10 seconds ago      Up 1 second                             kubelet
c6e4a2980730        rancher/hyperkube:v1.10.1-rancher2   "/opt/rke/entrypoi..."   21 seconds ago      Up 11 seconds                           kube-proxy
117a4debc918        rancher/rancher-agent:v2.0.0         "run.sh --server h..."   11 minutes ago      Up 11 minutes                           tender_fermi
jonkel@rancher2:~$ docker ps
CONTAINER ID        IMAGE                                COMMAND                  CREATED              STATUS              PORTS               NAMES
ebec6195fec5        d1a7302844b3                         "sh -c 'sysctl -w ..."   12 seconds ago       Up 1 second                             k8s_sysctl_nginx-ingress-controller-q5cgn_ingress-nginx_173d5bb8-5170-11e8-836e-000c2923739a_0
5e79d4008621        rancher/pause-amd64:3.1              "/pause"                 34 seconds ago       Up 2 seconds                            k8s_POD_canal-hdllj_kube-system_173d714f-5170-11e8-836e-000c2923739a_0
06d5b729ecf8        rancher/pause-amd64:3.1              "/pause"                 34 seconds ago       Up 4 seconds                            k8s_POD_cattle-node-agent-ttjfl_cattle-system_18853f76-5170-11e8-836e-000c2923739a_0
e78308215400        rancher/pause-amd64:3.1              "/pause"                 34 seconds ago       Up 13 seconds                           k8s_POD_nginx-ingress-controller-q5cgn_ingress-nginx_173d5bb8-5170-11e8-836e-000c2923739a_0
85e376a3857d        rancher/rke-tools:v0.1.4             "nginx-proxy CP_HO..."   58 seconds ago       Up 53 seconds                           nginx-proxy
50db7bace022        rancher/hyperkube:v1.10.1-rancher2   "/opt/rke/entrypoi..."   About a minute ago   Up 59 seconds                           kubelet
c6e4a2980730        rancher/hyperkube:v1.10.1-rancher2   "/opt/rke/entrypoi..."   About a minute ago   Up About a minute                       kube-proxy
117a4debc918        rancher/rancher-agent:v2.0.0         "run.sh --server h..."   12 minutes ago       Up 12 minutes                           tender_fermi
jonkel@rancher2:~$ docker ps
CONTAINER ID        IMAGE                                                                                                      COMMAND                  CREATED             STATUS              PORTS               NAMES
748f86982bd9        rancher/nginx-ingress-controller@sha256:58944b175505087dfa7afc444577d1933fd3bf1f1f668ef57b9eaaa8b36f59ce   "/usr/bin/dumb-ini..."   21 minutes ago      Up 21 minutes                           k8s_nginx-ingress-controller_nginx-ingress-controller-q5cgn_ingress-nginx_173d5bb8-5170-11e8-836e-000c2923739a_0
c63b7163f96b        rancher/coreos-flannel@sha256:93952a105b4576e8f09ab8c4e00483131b862c24180b0b7d342fb360bbe44f3d             "/opt/bin/flanneld..."   22 minutes ago      Up 22 minutes                           k8s_kube-flannel_canal-hdllj_kube-system_173d714f-5170-11e8-836e-000c2923739a_0
1f2189f72b05        rancher/calico-cni@sha256:cafcb06d6bd5ed1651e6cc7fe3f9a1848606be3950d7218cf4c9439634ca5342                 "/install-cni.sh"        24 minutes ago      Up 24 minutes                           k8s_install-cni_canal-hdllj_kube-system_173d714f-5170-11e8-836e-000c2923739a_0
cb8e27ed7d32        rancher/calico-node@sha256:21d581d7356f2dba648f2905502a38fd4ae325fd079d377bcf94028bcfa577a3                "start_runit"            27 minutes ago      Up 27 minutes                           k8s_calico-node_canal-hdllj_kube-system_173d714f-5170-11e8-836e-000c2923739a_0
e460bb3c4835        8cfec7659f1d                                                                                               "run.sh"                 28 minutes ago      Up 28 minutes                           k8s_agent_cattle-node-agent-ttjfl_cattle-system_18853f76-5170-11e8-836e-000c2923739a_0
5e79d4008621        rancher/pause-amd64:3.1                                                                                    "/pause"                 28 minutes ago      Up 28 minutes                           k8s_POD_canal-hdllj_kube-system_173d714f-5170-11e8-836e-000c2923739a_0
06d5b729ecf8        rancher/pause-amd64:3.1                                                                                    "/pause"                 28 minutes ago      Up 28 minutes                           k8s_POD_cattle-node-agent-ttjfl_cattle-system_18853f76-5170-11e8-836e-000c2923739a_0
e78308215400        rancher/pause-amd64:3.1                                                                                    "/pause"                 28 minutes ago      Up 28 minutes                           k8s_POD_nginx-ingress-controller-q5cgn_ingress-nginx_173d5bb8-5170-11e8-836e-000c2923739a_0
85e376a3857d        rancher/rke-tools:v0.1.4                                                                                   "nginx-proxy CP_HO..."   29 minutes ago      Up 29 minutes                           nginx-proxy
50db7bace022        rancher/hyperkube:v1.10.1-rancher2                                                                         "/opt/rke/entrypoi..."   29 minutes ago      Up 29 minutes                           kubelet
c6e4a2980730        rancher/hyperkube:v1.10.1-rancher2                                                                         "/opt/rke/entrypoi..."   29 minutes ago      Up 29 minutes                           kube-proxy

@Juju-62q
Copy link

Juju-62q commented May 7, 2018

I got same error when adding second node to k8s cluster.

here's some logs.

xxx@localhost ~ $ docker ps
CONTAINER ID        IMAGE                          COMMAND                  CREATED             STATUS              PORTS               NAMES
7e3995f9612e        rancher/rancher-agent:v2.0.0   "run.sh -- share-roo…"   7 minutes ago       Up 7 minutes                            share-mnt
3db29cd16328        rancher/rancher-agent:v2.0.0   "run.sh --server htt…"   7 minutes ago       Up 7 minutes                            laughing_goldwasser
xxx@localhost ~ $ docker logs 7e
Found container ID: 86e121fbbbbcff73e36fa17ca51b69a823bc3a0fc6a1312eab0bf7f694eeebe8
Checking root: /host/run/runc
Checking root: /host/var/run/runc
Checking root: /host/run/docker/execdriver/native
Checking root: /host/var/run/docker/execdriver/native
Checking root: /host/run/docker/runtime-runc/moby
Checking file: 3db29cd163283ea4f5d145d71c741e91b039648a45ce2480f95b25338e505009
Checking file: 7e3995f9612e592c21922d91ec72c29ef7a883fcff73ce029cfd3453277b32a2
Checking file: 86e121fbbbbcff73e36fa17ca51b69a823bc3a0fc6a1312eab0bf7f694eeebe8
Found state.json: 86e121fbbbbcff73e36fa17ca51b69a823bc3a0fc6a1312eab0bf7f694eeebe8
time="2018-05-07T05:33:56Z" level=info msg="Execing [/usr/bin/nsenter --mount=/proc/12825/ns/mnt -F -- /var/lib/docker/overlay2/2193cefa0869f186d06ef729c7765e059d01ec8a32a848f3ab34e15435fe5312/merged/usr/bin/share-mnt --stage2 /var/lib/kubelet /var/lib/rancher -- norun]" 
Error response from daemon: {"message":"No such container: kubelet"}
Error: failed to start containers: kubelet
xxx@localhost ~ $ docker logs 3d
-----BEGIN CERTIFICATE-----

-----END CERTIFICATE-----
time="2018-05-07T05:33:55Z" level=info msg="Option customConfig=map[address:<node IP Address> internalAddress:<node IP Address> roles:[worker]]"
time="2018-05-07T05:33:55Z" level=info msg="Option etcd=false"
time="2018-05-07T05:33:55Z" level=info msg="Option controlPlane=false"
time="2018-05-07T05:33:55Z" level=info msg="Option worker=true"
time="2018-05-07T05:33:55Z" level=info msg="Option requestedHostname=localhost"
time="2018-05-07T05:33:55Z" level=info msg="Connecting to wss://<server IP Address>/v3/connect/register with token <access token>"
time="2018-05-07T05:33:55Z" level=info msg="Connecting to proxy" url="wss://<server IP Address>/v3/connect/register"
time="2018-05-07T05:33:56Z" level=info msg="Starting plan monitor"

@sixcorners
Copy link
Author

I've never seen anything interesting in the initial rancher agent's logs.
time="2018-05-07T05:59:14Z" level=info msg="Option requestedHostname=sn"
time="2018-05-07T05:59:14Z" level=info msg="Option customConfig=map[address:abc internalAddress: roles:[etcd worker controlplane]]"
time="2018-05-07T05:59:14Z" level=info msg="Option etcd=true"
time="2018-05-07T05:59:14Z" level=info msg="Option controlPlane=true"
time="2018-05-07T05:59:14Z" level=info msg="Option worker=true"
time="2018-05-07T05:59:14Z" level=info msg="Connecting to wss://rancher.sixcorners.info/v3/connect/register with token xyz"
time="2018-05-07T05:59:14Z" level=info msg="Connecting to proxy" url="wss://rancher.sixcorners.info/v3/connect/register"
time="2018-05-07T05:59:15Z" level=info msg="Starting plan monitor"

@jonaskello
Copy link

Interestingly I also got this when adding my second node, but the third was added fine. Basically what I did was follow the quickstart to set everything up on a single VM, then add a second VM as worker node and got the error. Then added a third node as worker and it went well.

@Juju-62q
Copy link

Juju-62q commented May 7, 2018

Hmm... I will try setup machine again(now using baremetal).
Or should I use VM to setup k8s cluster?

@sixcorners
Copy link
Author

I'm not sure if bare metal or VM matters. I think I got similar errors when trying to deploy it on bare metal (my laptop)

@djanjic
Copy link

djanjic commented May 8, 2018

Hi, I also had this error because my VMs had the same name. After changing the name and cleaning entire VM (volume prune, system prune etc.) everything went back to normal.

@sara4dev
Copy link

sara4dev commented May 9, 2018

I also had the same issue. But it turned out my localhost was setup right in /etc/hosts. once I fixed that, it came up!

@Juju-62q
Copy link

in my case, problem solved by changing name. thanks!

@nicklasfrahm
Copy link

So it is currently not possible to run the rancher/server and the rancher/agent on the same host without additional VM overhead? Is there a fix planned or is this an uncommon / unsupported use case, because I also ran into this problem when I wanted to deploy the agent to the same baremetal host as the server.

@mitchellmaler
Copy link

I am getting the same error. I just tried to deploy a cluster. All the masters that I started have the 'share-mnt' containers saying:

Error response from daemon: {"message":"No such container: kubelet"}
Error: failed to start containers: kubelet

All I ran was the docker run script generated by rancher to start the nodes.

@ChromoX
Copy link

ChromoX commented May 17, 2018

I have seen this bug many times in many different places/circumstances. The only way I've found to fix it is by destroying the cluster and recreating it.

@mitchellmaler
Copy link

mitchellmaler commented May 18, 2018

This cluster I just created in OpenStack right before running the docker command on each host. If the error is for something else like network or hostname like said above, it would be nice if it said so. Right now with the cannot find kublet error isn't very helpful since it was able to pull hyperkube and start other containers.

@yotoobo
Copy link

yotoobo commented May 24, 2018

I get the same error, then I destroy the cluster and recreating it from the latest version. Then I add 3 nodes, and they are ok.

the system version:

CentOS Linux release 7.4.1708 (Core)

the docker version:

Server Version: 17.03.2-ce
Storage Driver: overlay
Backing Filesystem: xfs
Supports d_type: true

the rancher images;

REPOSITORY TAG IMAGE ID CREATED SIZE
rancher/rancher-agent v2.0.2 df087865517d 15 hours ago 228 MB
rancher/rancher latest 88526c7bea4e 15 hours ago 521 MB
rancher/rke-tools v0.1.8 5df9ccc4e588 7 days ago 136 MB

image

@ram-devsecops
Copy link

I'm getting same error, I tried destroying and recreating the cluster but no luck. I'm using 2.0.2 version of rancher.

Error response from daemon: {"message":"No such container: kubelet"} Error: failed to start containers: kubelet Error response from daemon: {"message":"No such container: kubelet"} Error: failed to start containers: kubelet Error response from daemon: {"message":"No such container: kubelet"} Error: failed to start containers: kubelet Error response from daemon: {"message":"No such container: kubelet"}

@ram-devsecops
Copy link

I used registry which had authentication enabled, after making the rancher repo public, my issue is resolved.

@gazben
Copy link

gazben commented Jun 8, 2018

Same here on Ubuntu 18.04 with docker 17.03.2-ce (tried with 18.X but no luck).
Can I get out more log from the startup?

@MSandro
Copy link

MSandro commented Jun 10, 2018

Hi, same problem here DinD (Alpine Linux) Docker 18.03.1-ce Rancher 2.0.2
What could we do?

@ibrokethecloud
Copy link
Contributor

I have run into the same issue as well.

[root@prv-verint-1.lab.devlabs kubernetes]# docker logs -f 3a43a93af75b
Found container ID: c066d5a54f9a3e2ff78c093be68ff9442cc065656f3f654f1487487f36acd337
Checking root: /host/run/runc
Checking file: 3a43a93af75be42ceb79aa1f0d1e9a6bb19c51983cd72537e6ac154c4074362c
Checking file: c066d5a54f9a3e2ff78c093be68ff9442cc065656f3f654f1487487f36acd337
Found state.json: c066d5a54f9a3e2ff78c093be68ff9442cc065656f3f654f1487487f36acd337
time="2018-06-12T23:18:36Z" level=info msg="Execing [/usr/bin/nsenter --mount=/proc/10039/ns/mnt -F -- /var/lib/docker/overlay/a57306998356a5882524dddddc3268f986d2e18c394bee6e8db5d4d72d7cbbe3/merged/usr/bin/share-mnt --stage2 /var/lib/kubelet /var/lib/rancher -- norun]"
Error response from daemon: {"message":"No such container: kubelet"}
Error: failed to start containers: kubelet
Error response from daemon: {"message":"No such container: kubelet"}
Error: failed to start containers: kubelet
Error response from daemon: {"message":"No such container: kubelet"}
Error: failed to start containers: kubelet
Error response from daemon: {"message":"No such container: kubelet"

I dont understand where it is finding the container id from, i have checked the local mounts on the node.

I was testing some scenarios so tried to re-register a node, after cleaning the local mounts.

I have also tried:

  • docker system prune -a -f
  • docker rm -f share-mnt

@ibrokethecloud
Copy link
Contributor

the actual error comes from /usr/bin/share-root.sh

#!/bin/bash

ID=$(grep :devices: /proc/self/cgroup | head -n1 | awk -F/ '{print $NF}' | sed -e 's/docker-\(.*\)\.scope
/\1/')
IMAGE=$(docker inspect -f '{{.Config.Image}}' $ID)

docker run --privileged --net host --pid host -v /:/host --rm --entrypoint /usr/bin/share-mnt $IMAGE "$@"
 -- norun
while ! docker start kubelet; do
    sleep 2
done

I am not really sure if this is an issue with the logic, but no container named kubelet was created by agent however the agent is stuck trying to start it up.

@superseb
Copy link
Contributor

The root cause can either be found in the other rancher/rancher-agent container where it connects to the rancher/rancher container and registers itself, or in the rancher/rancher container where it will log cluster provisioning status for the node (and why it's not spawning the needed containers)

@jeremyweber-np
Copy link

I am seeing the same issue when attempting to re-register a node. The steps I use are:

  1. Create a custom cluster
  2. Register a node
  3. Delete the node from the rancher console
  4. Attempt to clean up the node with the following: https://pastebin.com/AuQUJiM4
  5. Reregister the node

In this case the rancher server reports:

2018/06/27 14:45:35 [ERROR] ClusterController c-v2998 [cluster-agent-controller] failed with : could not contact server: Get https://x.x.x.x:6443/version: dial tcp 127.0.0.1:6443: getsockopt: connection refused

The node has only 2 containers:

share-mnt: Error response from daemon: {"message":"No such container: kubelet"}
Error: failed to start containers: kubelet
quirky_hopper: ... time="2018-06-27T14:57:46Z" level=info msg="waiting for node to register"
time="2018-06-27T14:57:48Z" level=info msg="Starting plan monitor"

The only way to proceed at this point is stop all activity on the node, delete the node from the rancher ui, and recreate the cluster and re register.

Running server 2.0.4 with only the one node, Docker version 17.03.2-ce on Ubuntu 16.04.

@wrsuarez
Copy link

+1 for seeing the same thing on a brand new Rancher 2.0 deployment with a new set of CentOS7 nodes

@sannysoft
Copy link

Same for me.
Ubuntu 16.04, Docker 17.03.2-ce
I have 2 host - rancher + creating cluster with 2 machines and get same error.
Logs from docker shows
dockerd[2710]: time="2018-07-05T13:59:37.185566233Z" level=error msg="Handler for POST /v1.24/containers/kubelet/start returned error: No such container: kubelet"

@CaptEmulation
Copy link

I got this error on my first HA RKE install when adding my first cluster. Wiping nodes and re-creating the cluster had no effect. Removing the cluster and restarting rancher server nodes fixed it for me

@gknepper
Copy link

I solved removing the hostname from 127.0.0.1 on /etc/hosts and including the FQDN also. So the machine now is responding the hostname and FQDN to the eth0 not loopback.

Resuming:

192.168.1.X hostname FQDN

@dankeder
Copy link

I hit this issue as well. In my case it started when the last node of my
experimental cluster stopped working after I restarted into a new version of
Docker and it couldn't get up and running again. I managed to "fix" it without
deleting and creating a new cluster (as I had to several times before when
I encountered the same issue).

Here's what I did:

  1. Cleaned the node by running
docker system prune
docker volume prune

Note that this will delete all the Docker volumes, take care if you have
important data in your volumes.

  1. Cleaned Rancher/Kubernetes runtime data on the node.
rm -rf /etc/cni/ /etc/kubernetes/ /opt/cni/ /var/lib/calico/ /var/lib/cni/ /var/lib/rancher/ /var/run/calico/

Note that official docs on node cleanup recommend also removal of /opt/rke and
/var/lib/etcd. I did NOT remove them because they contain cluster etcd
snapshots and data. This is especially important in case there's only one node
in the cluster.

  1. I exec-ed into the rancher container and hacked the cluster status (thx
    @ibrokethecloud for the hint):
docker exec -it rancher bash

Inside the container:

apt-get update && apt-get -y install vim
kubectl edit cluster c-XXXX  # replace the cluster-id with an actual cluster ID

Now in the editor I found the key apiEndpoint (it should be directly under
the status key) and removed it. Exit the editor and container. Make sure
kubectl says that it updated the cluster.

  1. From the Rancher UI I got the command for registering new node.
    I set a different name for the node than it was before by adding a
    --node-name to the docker run command (actually there's an edit box for this
    under advanced settings). It looked like this:
docker run -d --privileged --restart=unless-stopped --net=host \
  -v /etc/kubernetes:/etc/kubernetes -v /var/run:/var/run rancher/rancher-agent:v2.2.6 \
  --server https://rancher.example.com --token XXXXXXXXXXXXXXX --node-name mynode2 \
  --etcd --controlplane --worker
  1. I run the above command on the cleaned node and finally it registered
    successfully and RKE started up all the kube-* and kubelet containers.

I also tried to just clean-up the node and register it again, but it always
ended up with the "Error: failed to start containers: kubelet".

Given the above, I think Rancher doesn't handle well the case when all the
cluster nodes are become unresponsive. In this case nodes can't even be removed
from the cluster. When I tried to remove the faulty node it got stuck in the
"Removing" state indefinitely, probably because cluster's etcd couldn't be reached.

@Flos
Copy link

Flos commented Jul 31, 2019

I just created a new local VM with Ubuntu 18.04 to test if rancher 2.2.6 is 'stable' now. But I can't add the first worker with all 3 roles on my single machine setup because of this issue.

@zhming0
Copy link

zhming0 commented Aug 2, 2019

faild to start Kubelet means something has stopped it from starting up.
In my case, the problem is that my etcd cluster lost quorum, removing dead node from etcd state manually using etcdctl fixed this issue for me and the kubelet issue is gone.

@sixcorners
Copy link
Author

Man.. there sure are a lot of posts here.
One of the first few responses was asking for more information. I also want more information because it seems like a lot of different kinds of problems can cause this. Could this error message be enhanced?

@sww0825521xy
Copy link

Rancher stable 2.2.8 also found this issue()

  • sleep 2
  • docker start kubelet
    Error response from daemon: {"message":"No such container: kubelet"}
    Error: failed to start containers: kubelet

docker ps -a

  • CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
  • f428817311ac rancher/hyperkube:v1.14.6-rancher1 "/opt/rke-tools/entr…" 2 hours ago Up 2 hours kube-proxy
  • 08f59cee9f37 rancher/hyperkube:v1.14.6-rancher1 "/opt/rke-tools/entr…" 2 hours ago Up 2 hours kubelet
  • e1ad2c121a66 rancher/hyperkube:v1.14.6-rancher1 "/opt/rke-tools/entr…" 2 hours ago Up 2 hours kube-scheduler
  • 1088e8e2ef66 rancher/hyperkube:v1.14.6-rancher1 "/opt/rke-tools/entr…" 2 hours ago Up 2 hours kube-controller-manager
  • 5c30c26352c1 rancher/hyperkube:v1.14.6-rancher1 "/opt/rke-tools/entr…" 2 hours ago Up 2 hours kube-apiserver
  • 0465bfcce0d2 rancher/rke-tools:v0.1.42 "/bin/bash" 2 hours ago Created service-sidekick
  • 809c08e9f9bf rancher/coreos-etcd:v3.3.10-rancher1 "/usr/local/bin/etcd…" 2 hours ago Up 2 hours etcd
  • 12a2ff2ae478 rancher/rancher-agent:v2.2.8 "run.sh --server htt…" 2 hours ago Up 2 hours flamboyant_engelbart
  • ab18ff7b118c rancher/rancher-agent:v2.2.8 "run.sh -- share-roo…" 2 hours ago Exited (0) 2 hours ago share-mnt
  • b998e63a1a7a rancher/rancher-agent:v2.2.8 "run.sh --server htt…" 2 hours ago Up 2 hours affectionate_edison
  • 6a8d364109ab rancher/rancher:stable "entrypoint.sh" 2 hours ago Up 2 hours 0.0.0.0:80->80/tcp, 0.0.0.0:443->443/tcp silly_mclean

@seedsam
Copy link

seedsam commented Nov 4, 2019

I have run into the same issue as well , but solved.
try to see the rancher server log ,not the log of agent.
agent log:

  • sleep 2
  • docker start kubelet
    Error response from daemon: {"message":"No such container: kubelet"}
    Error: failed to start containers: kubelet
  • sleep 2
  • docker start kubelet
    Error response from daemon: {"message":"No such container: kubelet"}
    Error: failed to start containers: kubelet
  • sleep 2
  • docker start kubelet
    Error response from daemon: {"message":"No such container: kubelet"}
    Error: failed to start containers: kubelet

rancher server log:
2019/11/04 09:25:07 [INFO] kontainerdriver rancherkubernetesengine stopped
2019/11/04 09:25:07 [INFO] cluster [c-w6rtq] provisioning: [network] Deploying port listener containers
2019/11/04 09:25:07 [INFO] cluster [c-w6rtq] provisioning: [network] Pulling image [http://10.1.69.87/rancher/rke-tools:v0.1.34] on host [10.1.69.82]
2019/11/04 09:25:07 [ERROR] cluster [c-w6rtq] provisioning: [invalid reference format]
2019/11/04 09:25:07 [ERROR] ClusterController c-w6rtq [cluster-provisioner-controller] failed with : [invalid reference format]
2019-11-04 09:28:59.185541 I | mvcc: store.index: compact 5146

Found that the images address is more http.

@jhughes2112
Copy link

Spent days tracking this backwards, looking for answers. Anyone still getting this error, you should first note that the share-mnt container does this in a tight loop while your node is pulling the kubernetes images and setting up other things by the remote Rancher server, so this isn't abnormal to see on first boot. In the case of a ram-only machine, you'll always see this when you boot until the provisioning progresses to the kubernetes being available stage. It's not necessarily stuck, it might just be waiting.

I have noticed that if you fail to provision a node, it will always fail well before it pulls the kubernetes container images, due to the cluster recognizing the node but assigning it a different certificate (or something), leaving it hung. I solved it by wiping my Rancher server and starting over. This has absolutely got to be fixed properly, somehow.

HOWEVER...

If you actually do get a container named kubelet created, and the error switches over to this:

Error response from daemon: {"message":"linux mounts: open /proc/self/mountinfo: no such file or directory"}
Error: failed to start containers: kubelet

from what I can tell, it means you're running a system where the OS is on a tmpfs volume (either diskless, or not installed to disk yet). There are other errors similar to this in the Docker and Kubernetes forums. If I get to the bottom of where it happens, I'll open a ticket.

@jhughes2112
Copy link

Got to the bottom of it. I was running diskless iPXE booted RancherOS. Although rancher-agent started up, by the time it started installing Kubernetes onto the machine, it exhausted something in the tmpfs file system, causing a failure so bad that nothing shows up when you "ls -al /proc", and "df" or "du" would just fail. Three different perfectly good servers with almost exactly the same results. There may be a kernel setting that would expand the OS file system somehow, but by installing to hard drive and booting there instead, all the issues went away.

Still worth noting that if you are watching the logs of share-mnt while spinning up a node, quite a number of errors will pop up while k8s is installing, even on a completely healthy node.

@maiku1008
Copy link

I bumped into the very same issue today.

I have basically been trying to add a worker node to rancher all morning.
In the end it seemed that using a certain hostname for the node didn't work, but using a different one registered the node straight away.

It seems in my tests I have been impatiently adding/removing nodes from the cluster by just deleting them from the UI. My suspicion is that etcd cluster got dirty as a result.
A more graceful procedure should be followed, such as

Cordon the node from rancher
Drain the node from rancher
Delete node from rancher (wait for cluster to update)
SSH to the node (this will stop all containers): docker stop $(docker ps -q)
SSH to the node (this will remove all containers): docker rm $(docker ps -q)
Reboot node

I did not rebuild the cluster from scratch, as in my case I knew exactly when the tests started. I merely restored an earlier etcd snapshot, tried again and it all worked as expected.

@Just-Insane
Copy link

Just-Insane commented May 14, 2020

@jhughes2112

Got to the bottom of it. I was running diskless iPXE booted RancherOS. Although rancher-agent started up, by the time it started installing Kubernetes onto the machine, it exhausted something in the tmpfs file system, causing a failure so bad that nothing shows up when you "ls -al /proc", and "df" or "du" would just fail. Three different perfectly good servers with almost exactly the same results. There may be a kernel setting that would expand the OS file system somehow, but by installing to hard drive and booting there instead, all the issues went away.

Were you able to get worker nodes stood up on an iPXE booted RancherOS? I am currently trying to boot 10 nodes this way (without local storage), and it does not seem to be working (seeing the same errors you were).

Seeing the following in docker.log:

time="2020-05-14T22:12:30.797236245Z" level=warning msg="Error while cleaning up container resource mounts." container=776124b2138bdb63ea26b0b82cabbbaf518e1dc760cb255a9baf73494d710861 error="open /proc/self/mountinfo: no such file or directory"
time="2020-05-14T22:12:30.809112374Z" level=error msg="776124b2138bdb63ea26b0b82cabbbaf518e1dc760cb255a9baf73494d710861 cleanup: failed to delete container from containerd: no such container"
time="2020-05-14T22:12:30.809216451Z" level=error msg="Handler for POST /v1.24/containers/776124b2138bdb63ea26b0b82cabbbaf518e1dc760cb255a9baf73494d710861/start returned error: open /proc/self/mountinfo: no such file or directory"
time="2020-05-14T22:12:30.816105471Z" level=warning msg="Failed to parse cgroup information: open /proc/self/mountinfo: no such file or directory"
time="2020-05-14T22:12:30.817304838Z" level=warning msg="Failed to parse cgroup information: open /proc/self/mountinfo: no such file or directory"
time="2020-05-14T22:12:30.887537100Z" level=warning msg="Failed to parse cgroup information: open /proc/self/mountinfo: no such file or directory"
time="2020-05-14T22:12:30.904236485Z" level=warning msg="Falling back to default propagation for bind source in daemon root" container=f6a16097291ddd5d3c8bfcfb3dc4081349fe233204eb3d864a45bf3243b18e42 source=/var/lib
time="2020-05-14T22:12:30.965317948Z" level=warning msg="Error while cleaning up container resource mounts." container=f6a16097291ddd5d3c8bfcfb3dc4081349fe233204eb3d864a45bf3243b18e42 error="open /proc/self/mountinfo: no such file or directory"
time="2020-05-14T22:12:30.967717432Z" level=error msg="f6a16097291ddd5d3c8bfcfb3dc4081349fe233204eb3d864a45bf3243b18e42 cleanup: failed to delete container from containerd: no such container"
time="2020-05-14T22:12:30.967995651Z" level=error msg="Handler for POST /v1.24/containers/f6a16097291ddd5d3c8bfcfb3dc4081349fe233204eb3d864a45bf3243b18e42/start returned error: failed to open stdout fifo: couldn't stat /proc/self/fd/30: stat /proc/self/fd/30: no such file or directory"

@jhughes2112
Copy link

jhughes2112 commented May 14, 2020

@Just-Insane Never got it working. From what I could tell, the tmpfs is exhausted of inodes or something, to the extent the OS stopped working. Someone with more time on their hands can maybe sort it out for the rest of us, but I just threw a small HDD in each box and installed ROS there instead. Worked fine. Ultimately gave up on iPXE because of it, unfortunately, and ROS as well, because 99% of the value of it, to me at least, was to treat nodes as extremely dumb cattle.

@Just-Insane
Copy link

Ah, that's unfortunate.

My only thought is maybe switching to K3sOS, but that is rather poorly documented, especially for iPXE booting.

Yea, the only reason I'm using iPXE right now is because of these node's lack of local storage, otherwise, I'd just install an OS and it'd be fine.

Further to your note on dumb cattle, I am not sure rancher would handle non-persistent nodes trying to join with existing node names anyways, so that might not even work.

@jhughes2112
Copy link

I had set up my DHCP to assign the same IP address to the machine's MAC addresses, then ran an internal web server that trivially returned a hostname according to the requestor's IP address. The nodes would boot, curl the server for its name, then set it right up front in the cloud init. It's an afternoon to set up, but not hard. Overall didn't seem worth pursuing it, given the failure to run entirely in RAM.

I am running a k3os cluster in AWS, and probably would not recommend it for running iPXE, since it really wants an external high availability MySQL setup that it can keep its brains in. RDS is good for that, but running locally.... ehh.

@Just-Insane
Copy link

Maybe I am misunderstanding, but k3os requires an external HA MySQL setup? I thought it was just a lighter weight version of RancherOS.

Hopefully someone on the OS team is able to take a look at that issue and provide a workaround or resolution.

What did you end up doing OS wise with your hosts you were trying to get working diskless?

@jhughes2112
Copy link

I'm not sure if k3os and k3s is the same thing (I suspect not, on reflection), but to configure an HA Rancher management cluster, they recommend using k3s rather than ROS, which does need something to store shared data into. The easy way is MySQL, the hard way is etcd. https://rancher.com/docs/rancher/v2.x/en/installation/k8s-install/kubernetes-rke/

I ended up dropping an older server SAS drives about 80gb, which is plenty large for an OS. I dropped Ubuntu 18.04 on it and did a simple install. Life became much easier after that.

@Just-Insane
Copy link

Oh, my Rancher master cluster is on separate nodes, and my master nodes for this cluster have local storage, so I am fine on that front.

Ah, I might have to end up adding local storage I guess. Not looking forward to that cost lol.

@dividebysandwich
Copy link

dividebysandwich commented May 28, 2020

I'm having the same issue: I added a third node to a cluster, worked great.
Added a fourth node, getting this error in share-mnt:

INFO: Arguments: – share-root.sh docker run -d --privileged --restart=unless-stopped --net=host -v /etc/kubernetes:/etc/kubernetes -v /var/run:/var/run rancher/rancher-agent:v2.3.5 --server https://10.16.176.10:8443 --token REDACTED --ca-checksum e634f7f0d65cb60e815e4987830cdcd09317800827e683ca14239ec61f9eae1d --no-register --only-write-certs --node-name nbd-cluster-node4 /var/lib/kubelet /var/lib/rancher
+ trap 'exit 0' SIGTERM
++ grep :devices: /proc/self/cgroup
++ head -n1
++ awk -F/ '

{print $NF}

'
++ sed (.*)\.scope/\1/'
+ ID=3991efd40484f66f49956e5ed3d0e31049f7a164267a3170aa0440740bdede2f
++ docker inspect -f '.Config.Image' 3991efd40484f66f49956e5ed3d0e31049f7a164267a3170aa0440740bdede2f
+ IMAGE=rancher/rancher-agent:v2.3.5
+ bash -c 'docker run -d --privileged --restart=unless-stopped --net=host -v /etc/kubernetes:/etc/kubernetes -v /var/run:/var/run rancher/rancher-agent:v2.3.5 --server https://10.16.176.10:8443 --token 4c28cmskzpr4lbjzxtvsn67cpkwqnktcbzz6h7h6zldgd6j4h6g8jg --ca-checksum e634f7f0d65cb60e815e4987830cdcd09317800827e683ca14239ec61f9eae1d --no-register --only-write-certs'
92b2310fda72442305ad3d860d0ff83e7e9d4c7e6094181a52f91c1b0deb8f10
+ docker run --privileged --net host --pid host -v /:/host --rm --entrypoint /usr/bin/share-mnt rancher/rancher-agent:v2.3.5 --node-name nbd-cluster-node4 /var/lib/kubelet /var/lib/rancher – norun
Incorrect Usage.

NAME:
/usr/bin/share-mnt - A new cli application

USAGE:
/usr/bin/share-mnt [global options] command [command options] [arguments...]

VERSION:
1d97ce9

COMMANDS:
help, h Shows a list of commands or help for one command

GLOBAL OPTIONS:
--stage2
--help, -h show help
--version, -v print the version

time="2020-05-28T12:42:35Z" level=fatal msg="flag provided but not defined: -node-name"
+ docker start kubelet
Error response from daemon:

{"message":"No such container: kubelet"}

Error: failed to start containers: kubelet
+ sleep 2
+ docker start kubelet
Error response from daemon:

{"message":"No such container: kubelet"}

Error: failed to start containers: kubelet
+ sleep 2
+ docker start kubelet
Error response from daemon:

{"message":"No such container: kubelet"}

Error: failed to start containers: kubelet
+ sleep 2

I've checked the diskspace, all ok. Also the hostname is correct and set up correctly in the hosts file.

Docker ps output when this happens:

CONTAINER ID        IMAGE                          COMMAND                  CREATED             STATUS              PORTS               NAMES

9e7d5c8d8b2d        rancher/rancher-agent:v2.3.5   "run.sh --server h..."   4 seconds ago       Up 3 seconds                            clever_davinci

18ec181d0456        rancher/rancher-agent:v2.3.5   "run.sh -- share-r..."   4 seconds ago       Up 3 seconds                            share-mnt

19f4d9d89a00        rancher/rancher-agent:v2.3.5   "run.sh --server h..."   15 seconds ago      Up 14 seconds                           zealous_cray

I also tried changing the hostname but no joy there. Any pointers would be greatly appreciated. And I join the chorus of those who ask for a better error message that lets us actually determine what's wrong.

EDIT: I think I'll create a separate issue because this thread is filled with lots of unrelated issues, and it's not even sure it's the same as the original issue.

@andrezaycev
Copy link

Have the same issue after every reboot worker node. Every time help docker prune, rm folders and readd rancher-client. Wait solution how fix it

@dividebysandwich
Copy link

Just FYI I haven't gotten to investigate or pull logs yet, but I will do that when I get some time. I've opened another issue for this which has been closed because I can't touch that system at the moment.

@jhughes2112
Copy link

kubelet failing also is an indicator that your node was added to the rancher cluster before, failed for some other reason, then you tried to add it again. If you don't use the cleanup script, you probably haven't gotten rid of a few items that make rancher misinterpret the node as re-joining a cluster rather than being a node new to the cluster. Hunt for it, you'll find it.

@xmh19936688
Copy link
Contributor

xmh19936688 commented Jul 29, 2020

I got the same error while creating a new cluster and add a node. And then resolve it!

Here are some info:
CentOS: 7.3.1611
kernel: 3.10.0-862
docker: 18.09.9
rancher: v2.2.2 (set up by rke)
rke: v0.2.4

Check steps:

  • make sure the /etc/hosts on all nodes (local-cluster's nodes and the new node) have all record about each node.
    vi /etc/hosts
  • make sure each node (local-cluster's nodes and the new node) can ssh to another without password.
    ssh-keygen -t rsa & ssh-copy-id root@x.x.x.x
  • make sure the system-default-registry (below Global -> Settings) is not end with '/'.
  • make sure the private registry (under advanced options) is not end with '/'.

Maybe you need to clean up the new cluster and node after failed options.
Clean steps:

  • delete node on rancher UI and wait it disappear.
  • delete cluster on rancher UI and wait it disappear.
  • run some scripts shown as below and here to clean the node and reboot.
docker ps -aq | xargs docker stop
docker ps -aq | xargs docker rm -v
docker volume rm $(sudo docker volume ls -q)
mount | grep '/var/lib/kubelet'| awk '{print $3}'|xargs umount
rm -rf /var/lib/etcd \
    /var/lib/cni \
    /var/run/calico \
    /etc/kubernetes/ssl \
    /etc/kubernetes/.tmp/ \
    /opt/cni \
    /var/lib/kubelet \
    /var/lib/rancher \
    /var/lib/calico

Finally, up the new cluster and GLHF! Please let me know if anything can be optimized, thanks!

@stale
Copy link

stale bot commented Jul 10, 2021

This repository uses a bot to automatically label issues which have not had any activity (commit/comment/label) for 60 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the bot can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the bot will automatically close the issue in 14 days. Thank you for your contributions.

@stale stale bot added the status/stale label Jul 10, 2021
@stale stale bot closed this as completed Jul 28, 2021
@zube zube bot removed the [zube]: Done label Oct 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests