Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rancher single node wont start on m1 mac with latest docker desktop #35930

Open
presidenten opened this issue Dec 20, 2021 · 10 comments
Open

Rancher single node wont start on m1 mac with latest docker desktop #35930

presidenten opened this issue Dec 20, 2021 · 10 comments
Labels
feature/rancher-docker-install release-note Note this issue in the milestone's release notes team/hostbusters The team that is responsible for provisioning/managing downstream clusters + K8s version support [zube]: To Triage

Comments

@presidenten
Copy link

presidenten commented Dec 20, 2021

Rancher Server Setup

  • Rancher version: 2.6.2
  • Installation option (Docker install):
    • Docker single node install
  • Proxy/Cert Details: none

Environment

  • Mac M1 on macOS 12.0.1
  • Docker desktop for mac 4.3.1, docker engine 20.10.11

Describe the bug
Rancher crashes periodically, usually after 10-15s, and thus never starts.

Getting lots of [FATAL] k3s exited with: exit status 1 and Unexpected watch close - watch lasted less than a second and no items received. The error depends on how many times it has restarted. Sometimes it installs crds and such before crashing.

To Reproduce

  • Start rancher on m1 mac with latest docker desktop version
docker run -d --privileged --restart=unless-stopped --name rancher -p 4080:80 -p 4443:443 rancher/rancher:v2.6.2

Result

                                                                           ↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓
CONTAINER ID   IMAGE                    COMMAND           CREATED          STATUS                  PORTS                                         NAMES
c6b9cc8302f5   rancher/rancher:v2.6.2   "entrypoint.sh"   30 minutes ago   Up Less than a second   0.0.0.0:4080->80/tcp, 0.0.0.0:4443->443/tcp   rancher
                                                                           ↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑

And

2021/12/20 08:31:38 [INFO] Waiting for k3s to start
time="2021-12-20T08:31:38Z" level=info msg="Acquiring lock file /var/lib/rancher/k3s/data/.lock"
time="2021-12-20T08:31:38Z" level=info msg="Preparing data dir /var/lib/rancher/k3s/data/d57e75cb49c3cfd88307a8669e8adcf6b7740b66d6125a45c00aaa54301a5746"
2021/12/20 08:31:39 [INFO] Waiting for k3s to start
exit status 1
2021/12/20 08:31:46 [FATAL] k3s exited with: exit status 1

Edit:
I have now verified that rancher 2.6.2 works on Docker desktop for mac 4.2.0, but if I update to 4.3.1, it stops working.
Moving back to 4.2.0, makes it work again.


Expected Result
I expect rancher single node to not crash before starting

I have tried to prune containers, prune volumes, and even factory reset docker desktop.

Full logs until first crash

2021/12/20 08:31:37 [INFO] Rancher version v2.6.2 (64c748d16) is starting
2021/12/20 08:31:37 [INFO] Rancher arguments {ACMEDomains:[] AddLocal:true Embedded:false BindHost: HTTPListenPort:80 HTTPSListenPort:443 K8sMode:auto Debug:false Trace:false NoCACerts:false AuditLogPath:/var/log/auditlog/rancher-api-audit.log AuditLogMaxage:10 AuditLogMaxsize:100 AuditLogMaxbackup:10 AuditLevel:0 Features: ClusterRegistry:}
2021/12/20 08:31:37 [INFO] Listening on /tmp/log.sock
2021/12/20 08:31:37 [INFO] Running etcd --data-dir=management-state/etcd --heartbeat-interval=500 --election-timeout=5000
running etcd on unsupported architecture "arm64" since ETCD_UNSUPPORTED_ARCH is set
2021-12-20 08:31:37.558753 W | pkg/flags: unrecognized environment variable ETCD_URL=https://github.com/etcd-io/etcd/releases/download/v3.4.15/etcd-v3.4.15-linux-arm64.tar.gz
[WARNING] Deprecated '--logger=capnslog' flag is set; use '--logger=zap' flag instead
2021-12-20 08:31:37.558790 I | etcdmain: etcd Version: 3.4.15
2021-12-20 08:31:37.558792 I | etcdmain: Git SHA: aa7126864
2021-12-20 08:31:37.558810 I | etcdmain: Go Version: go1.12.17
2021-12-20 08:31:37.558816 I | etcdmain: Go OS/Arch: linux/arm64
2021-12-20 08:31:37.558820 I | etcdmain: setting maximum number of CPUs to 5, total number of available CPUs is 5
[WARNING] Deprecated '--logger=capnslog' flag is set; use '--logger=zap' flag instead
2021-12-20 08:31:37.559130 I | embed: name = default
2021-12-20 08:31:37.559140 I | embed: data dir = management-state/etcd
2021-12-20 08:31:37.559142 I | embed: member dir = management-state/etcd/member
2021-12-20 08:31:37.559143 I | embed: heartbeat = 500ms
2021-12-20 08:31:37.559144 I | embed: election = 5000ms
2021-12-20 08:31:37.559146 I | embed: snapshot count = 100000
2021-12-20 08:31:37.559151 I | embed: advertise client URLs = http://localhost:2379
2021-12-20 08:31:37.559208 W | pkg/fileutil: check file permission: directory "management-state/etcd" exist, but the permission is "drwxr-xr-x". The recommended permission is "-rwx------" to prevent possible unprivileged access to the data.
2021-12-20 08:31:37.564482 I | etcdserver: starting member 8e9e05c52164694d in cluster cdf818194e3a8c32
raft2021/12/20 08:31:37 INFO: 8e9e05c52164694d switched to configuration voters=()
raft2021/12/20 08:31:37 INFO: 8e9e05c52164694d became follower at term 0
raft2021/12/20 08:31:37 INFO: newRaft 8e9e05c52164694d [peers: [], term: 0, commit: 0, applied: 0, lastindex: 0, lastterm: 0]
raft2021/12/20 08:31:37 INFO: 8e9e05c52164694d became follower at term 1
raft2021/12/20 08:31:37 INFO: 8e9e05c52164694d switched to configuration voters=(10276657743932975437)
2021-12-20 08:31:37.565972 W | auth: simple token is not cryptographically signed
2021-12-20 08:31:37.567814 I | etcdserver: starting server... [version: 3.4.15, cluster version: to_be_decided]
2021-12-20 08:31:37.568141 I | etcdserver: 8e9e05c52164694d as single-node; fast-forwarding 9 ticks (election ticks 10)
raft2021/12/20 08:31:37 INFO: 8e9e05c52164694d switched to configuration voters=(10276657743932975437)
2021-12-20 08:31:37.568395 I | etcdserver/membership: added member 8e9e05c52164694d [http://localhost:2380] to cluster cdf818194e3a8c32
2021-12-20 08:31:37.568794 I | embed: listening for peers on 127.0.0.1:2380
raft2021/12/20 08:31:38 INFO: 8e9e05c52164694d is starting a new election at term 1
raft2021/12/20 08:31:38 INFO: 8e9e05c52164694d became candidate at term 2
raft2021/12/20 08:31:38 INFO: 8e9e05c52164694d received MsgVoteResp from 8e9e05c52164694d at term 2
raft2021/12/20 08:31:38 INFO: 8e9e05c52164694d became leader at term 2
raft2021/12/20 08:31:38 INFO: raft.node: 8e9e05c52164694d elected leader 8e9e05c52164694d at term 2
2021-12-20 08:31:38.067422 I | etcdserver: setting up the initial cluster version to 3.4
2021-12-20 08:31:38.070460 N | etcdserver/membership: set the initial cluster version to 3.4
2021-12-20 08:31:38.070547 I | etcdserver/api: enabled capabilities for version 3.4
2021-12-20 08:31:38.070585 I | etcdserver: published {Name:default ClientURLs:[http://localhost:2379]} to cluster cdf818194e3a8c32
2021-12-20 08:31:38.071103 I | embed: ready to serve client requests
2021-12-20 08:31:38.074404 N | embed: serving insecure client requests on 127.0.0.1:2379, this is strongly discouraged!
2021/12/20 08:31:38 [INFO] Waiting for k3s to start
time="2021-12-20T08:31:38Z" level=info msg="Acquiring lock file /var/lib/rancher/k3s/data/.lock"
time="2021-12-20T08:31:38Z" level=info msg="Preparing data dir /var/lib/rancher/k3s/data/d57e75cb49c3cfd88307a8669e8adcf6b7740b66d6125a45c00aaa54301a5746"
2021/12/20 08:31:39 [INFO] Waiting for k3s to start
exit status 1
2021/12/20 08:31:46 [FATAL] k3s exited with: exit status 1
@Oats87
Copy link
Contributor

Oats87 commented Dec 20, 2021

@presidenten Are you able to docker cp <rancher-container>:/var/lib/rancher/k3s.log . to copy the k3s.log out of the Rancher container and see if you are able to inspect those logs to determine why K3s won't start?

Alternatively, you can upload that log (or the crash from it) here.

@presidenten
Copy link
Author

@Oats87, sure, Ill do so tomorrow, since Im stuck in meetings all day today.

@presidenten
Copy link
Author

presidenten commented Dec 21, 2021

@Oats87 I found some space in the schedule today after all

I start rancher with

docker run -d --privileged --restart=unless-stopped --name rancher -p 4080:80 -p 4443:443 rancher/rancher:v2.6.2

Then copy the log after a while. Then I deleted the docker container and updated docker to 4.3.1, so I continue with the same docker image.

For the 4.3.1 case I also set rancher loglevel to debug after starting the container.

docker exec rancher loglevel --set debug

k3s-docker-desktop-4.2.0.log
k3s-docker-desktop-4.3.1.log
docker-logs-rancher-docker-desktop-4.3.1.log

I noticed this line today that I seemed to have missed when I opened the issue

2021/12/21 14:48:56 [INFO] Running etcd --data-dir=management-state/etcd --heartbeat-interval=500 --election-timeout=5000
running etcd on unsupported architecture "arm64" since ETCD_UNSUPPORTED_ARCH is set
2021-12-21 14:48:56.147450 W | pkg/flags: unrecognized environment variable ETCD_URL=https://github.com/etcd-io/etcd/releases/download/v3.4.15/etcd-v3.4.15-linux-arm64.tar.gz

Seems pretty sus to me.

I couldnt find it in the k3s.log, but I saw it in terminal while looking at the logs, so I added them as well. I set

Its strange because I run the exact same command to start rancher.

For completeness I also deleted the docker image and pulled a new one with docker image pull rancher/rancher:v2.6.2 --platform linux/arm64/v8, but I still get the same result of rancher crashing all the time.

@presidenten
Copy link
Author

presidenten commented Dec 21, 2021

So I also did

docker image pull rancher/rancher:v2.6.2 --platform linux/x86_64

Then redid the test with loglevel debug.

But still the same issue.

Here are the logs:
k3s-docker-desktop-4.3.1-image-archx86.log

docker-logs-rancher-docker-desktop-4.3.1-archx86.log

@snasovich snasovich added the team/hostbusters The team that is responsible for provisioning/managing downstream clusters + K8s version support label Dec 22, 2021
@snasovich snasovich added this to the v2.6.4 - Triaged milestone Dec 22, 2021
@throrin19
Copy link

Same problem here in macbook pro 2016 with docker desktop 4.3.2

@Oats87
Copy link
Contributor

Oats87 commented Jan 3, 2022

This appears to be occurring with Docker desktop > 4.3.0 as it has moved to cgroupsv2 which is not directly supported by Rancher yet.

The options to fix/work around this are:

  1. Build a custom rancher/rancher container with a modified entrypoint.sh that will evacuate the root cgroup. Instructions for this are below in this comment but this is NOT a supported workaround and can easily lead to other problems. This is only tested with v2.6.3 and will not work with v2.5.x versions of Rancher as they are currently using a version of K3s that is too old to support cgroupsv2.
  2. Downgrade to a Docker desktop version < 4.3.0
  3. Use Rancher Desktop as this still operates with cgroupsv1 and will do so until Rancher supports cgroupsv2.

Seems that the best mitigation to this is going to be to use a mitigation to attempt to evacuate the root cgroup if run in a containerized environment... and this will likely need to land in norman and get bumped into Rancher.

It's relatively easy to create a custom rancher/rancher container that will evacuate the root cgroup, thus allowing Rancher to start up.

Creating an entrypoint.sh with cribbed evacuation logic like this: https://github.com/rancher/k3d/pull/579/files#diff-71e760f22ea8192fe65294b2330d4bd29fc3888fbf283ba4ac69fda1af3878dd and marking it executable i.e. chmod +x entrypoint.sh:

#!/bin/bash
set -e

if [ ! -e /run/secrets/kubernetes.io/serviceaccount ] && [ ! -e /dev/kmsg ]; then
    echo "ERROR: Rancher must be ran with the --privileged flag when running outside of Kubernetes"
    exit 1
fi
rm -f /var/lib/rancher/k3s/server/cred/node-passwd
if [ -e /var/lib/rancher/etcd ] && [ ! -e /var/lib/rancher/k3s/server/db/etcd ]; then
  mkdir -p /var/lib/rancher/k3s/server/db
  ln -sf /var/lib/rancher/etcd /var/lib/rancher/k3s/server/db/etcd
  echo -n 'default' > /var/lib/rancher/k3s/server/db/etcd/name
fi
if [ -e /var/lib/rancher/k3s/server/db/etcd ]; then
  k3s server --cluster-init --cluster-reset &> ./k3s-cluster-reset.log
  if [ $? -ne 0 ]; then
    echo "ERROR:" && cat ./k3s-cluster-reset.log
    rm -f /var/lib/rancher/k3s/server/db/reset-flag
  fi
fi
if [ -x "$(command -v update-ca-certificates)" ]; then
  update-ca-certificates
fi
if [ -x "$(command -v c_rehash)" ]; then
  c_rehash
fi
if [ -f /sys/fs/cgroup/cgroup.controllers ]; then
  echo "[$(date -Iseconds)] [CgroupV2 Fix] Evacuating Root Cgroup ..."
	# move the processes from the root group to the /init group,
  # otherwise writing subtree_control fails with EBUSY.
  mkdir -p /sys/fs/cgroup/init
  xargs -rn1 < /sys/fs/cgroup/cgroup.procs > /sys/fs/cgroup/init/cgroup.procs || :
  # enable controllers
  sed -e 's/ / +/g' -e 's/^/+/' <"/sys/fs/cgroup/cgroup.controllers" >"/sys/fs/cgroup/cgroup.subtree_control"
  echo "[$(date -Iseconds)] [CgroupV2 Fix] Done"
fi
exec tini -- rancher --http-listen-port=80 --https-listen-port=443 --audit-log-path=${AUDIT_LOG_PATH} --audit-level=${AUDIT_LEVEL} --audit-log-maxage=${AUDIT_LOG_MAXAGE} --audit-log-maxbackup=${AUDIT_LOG_MAXBACKUP} --audit-log-maxsize=${AUDIT_LOG_MAXSIZE} "${@}"

then creating a Dockerfile

FROM rancher/rancher:v2.6.3
COPY entrypoint.sh /usr/bin/entrypoint.sh

and doing a docker build will end up with a cgroupsv2 compatible rancher container.

I have an example container at oats87/rancher:v2.6.3-cgv2 but I cannot stress enough that you should NOT use this container as I cannot support it.

@luctrate
Copy link

luctrate commented Mar 31, 2022

Any updates?
Same error on debian 11.

2022/03/31 08:00:53 [INFO] Listening on /tmp/log.sock
2022/03/31 08:00:53 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6443/version?timeout=15m0s": dial tcp 127.0.0.1:6443: connect: connection refused
2022/03/31 08:00:55 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6443/version?timeout=15m0s": dial tcp 127.0.0.1:6443: connect: connection refused
2022/03/31 08:00:57 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6443/version?timeout=15m0s": dial tcp 127.0.0.1:6443: connect: connection refused
2022/03/31 08:00:59 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6443/version?timeout=15m0s": dial tcp 127.0.0.1:6443: connect: connection refused
2022/03/31 08:01:01 [INFO] Waiting for server to become available: the server is currently unable to handle the request
2022/03/31 08:01:03 [INFO] Waiting for server to become available: an error on the server ("apiserver not ready") has prevented the request from succeeding
2022/03/31 08:01:05 [INFO] Waiting for server to become available: an error on the server ("apiserver not ready") has prevented the request from succeeding
2022/03/31 08:01:07 [INFO] Waiting for server to become available: an error on the server ("apiserver not ready") has prevented the request from succeeding
2022/03/31 08:01:09 [INFO] Waiting for server to become available: an error on the server ("apiserver not ready") has prevented the request from succeeding
2022/03/31 08:01:19 [FATAL] k3s exited with: exit status 1

@snasovich snasovich modified the milestones: v2.6.5, v2.6.x Apr 15, 2022
@yjqg6666
Copy link

Any update for this issue which stops me from trying it?

@xrow
Copy link

xrow commented Apr 28, 2022

The wokraround also works on EL9

podman run -d --restart=unless-stopped \
  --name rancher \
  -p 80:80 -p 443:443 \
  --privileged \
  docker.io/oats87/rancher:v2.6.3-cgv2

@xrow
Copy link

xrow commented Jul 21, 2022

I ended up using k3s testing with helm chart rancher 2.6.7-rc3 on centos 9

@Jono-SUSE-Rancher Jono-SUSE-Rancher added this to the v2.x - Backlog milestone Oct 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature/rancher-docker-install release-note Note this issue in the milestone's release notes team/hostbusters The team that is responsible for provisioning/managing downstream clusters + K8s version support [zube]: To Triage
Projects
None yet
Development

No branches or pull requests

9 participants