Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rancher container is restarting every 15 seconds on Ubuntu 22.04 #36238

Closed
fedos3d opened this issue Jan 23, 2022 · 77 comments
Closed

Rancher container is restarting every 15 seconds on Ubuntu 22.04 #36238

fedos3d opened this issue Jan 23, 2022 · 77 comments
Assignees
Labels
kind/bug Issues that are defects reported by users or that we know have reached a real release priority/0 QA/XS team/hostbusters The team that is responsible for provisioning/managing downstream clusters + K8s version support
Milestone

Comments

@fedos3d
Copy link

fedos3d commented Jan 23, 2022

Rancher Server Setup

  • Rancher version: v2.6.3
  • Installation option (Docker install/Helm Chart): Docker install

Describe the bug

After I restarted my ubuntu vm, my Rancher UI docker container is restarting every 15 seconds

Here is the log:

2022/01/22 11:40:19 [INFO] Rancher version v2.6.3 (3c1d5fac3) is starting
2022/01/22 11:40:19 [INFO] Rancher arguments {ACMEDomains:[] AddLocal:true Embedded:false BindHost: HTTPListenPort:80 HTTPSListenPort:443 K8sMode:auto Debug:false Trace:false NoCACerts:false AuditLogPath:/var/log/auditlog/rancher-api-audit.log AuditLogMaxage:10 AuditLogMaxsize:100 AuditLogMaxbackup:10 AuditLevel:0 Features: ClusterRegistry:}
2022/01/22 11:40:19 [INFO] Listening on /tmp/log.sock
2022/01/22 11:40:19 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6443/version?timeout=15m0s": dial tcp 127.0.0.1:6443: connect: connection refused
2022/01/22 11:40:21 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6443/version?timeout=15m0s": dial tcp 127.0.0.1:6443: connect: connection refused
2022/01/22 11:40:23 [INFO] Waiting for server to become available: the server is currently unable to handle the request
2022/01/22 11:40:25 [INFO] Waiting for server to become available: an error on the server ("apiserver not ready") has prevented the request from succeeding
2022/01/22 11:40:27 [INFO] Waiting for server to become available: an error on the server ("apiserver not ready") has prevented the request from succeeding
2022/01/22 11:40:42 [INFO] Running in single server mode, will not peer connections
2022/01/22 11:40:43 [INFO] Applying CRD features.management.cattle.io
2022/01/22 11:41:29 [ERROR] unable to update feature harvester in initialize features: Put "https://127.0.0.1:6443/apis/management.cattle.io/v3/features/harvester": EOF
2022/01/22 11:41:29 [FATAL] k3s exited with: exit status 255

Any advice how to make it work again?

@kha7iq
Copy link

kha7iq commented Jan 25, 2022

Having same issue running v2.6.3 in a single node docker container.
If you stop the container and start again it will fail.

@Curtis7015
Copy link

This has been happening to me for about a month and I cannot solve it. I can restore snapshots via Proxmox and it'll be good for about 3 days and randomly back to the reboot loop. Prior to the random event, everything on the node looks healthy.

@fedos3d
Copy link
Author

fedos3d commented Jan 26, 2022

This has been happening to me for about a month and I cannot solve it. I can restore snapshots via Proxmox and it'll be good for about 3 days and randomly back to the reboot loop. Prior to the random event, everything on the node looks healthy.

Same thing,

@Acidherr
Copy link

Acidherr commented Jan 27, 2022

This has also been happenig to me. I installed Version 2.4 and it worked but the latest stable release has the behavior described above.

@Curtis7015
Copy link

Curtis7015 commented Jan 27, 2022

This has also been happening to me. I installed Version 2.4 and it worked but the latest stable release has the behavior described above.

I'm going to give this a shot. Would you mind dropping the tag you used?

@Acidherr
Copy link

Sorry I wasn't more specific, I installed using the docker container with the tag v2.4.9. It works!

@kha7iq
Copy link

kha7iq commented Jan 29, 2022

Sorry I wasn't more specific, I installed using the docker container with the tag v2.4.9. It works!

Unfortunately using older versions does not help me, my downstream clusters are k8s 1. 22 so can't import them.

@themowski
Copy link

Our team is seeing this behavior as well with single-node Rancher v2.6.3-patch1 running on AlmaLinux 8.5 (one of the Enterprise Linux distros).

For the Rancher devs -- this is basically the same behavior reported in #35892 and #36047, which my team is also seeing -- the error messages we get alternate between the ones reported here and the ones in those tickets. The comments on #35892 seem to indicate that this is a problem with updates to various systemd packages in EL 8.5, and that the problems did not occur in EL 8.4.

@h0jeZvgoxFepBQ2C
Copy link

h0jeZvgoxFepBQ2C commented Apr 8, 2022

We have the same problem on a fresh Debian 11 single node server :(

@crushing-stegosaur
Copy link

Same problem here on a fresh OpenSUSE server for v2.6. Downgrading to v2.4.9 worked.

@IanJSaul
Copy link

^ Worked for me as well.

I've briefly tried Rancher so many times in the 1st quarter of this year, and never managed to get one to stay up long enough to USE it.

WHY is this STILL going on?

@koraykutanoglu
Copy link

hello i have this same problem on ubuntu 22.04 LTS and rancher is not working.

2022/04/26 20:33:36 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6443/version?timeout=15m0s": dial tcp 127.0.0.1:6443: connect: connection refused

@kmarji
Copy link

kmarji commented Apr 28, 2022

I'm having the same issue ... downgrading to version 2.5 works, however some of my clusters are running 1.22 and I cannot import them.
after SUSE bought rancher it has became the worst product ever ... version 2.6 is very buggy and unstable. I'm willing to rebuild my cluster on version 1.21 and not move to 2.6

@Indirectelex
Copy link

Waiting for server to become available: an error on the server ("apiserver not ready") has prevented the request from succeeding

@wbarnard81
Copy link

wbarnard81 commented May 4, 2022

hello i have this same problem on ubuntu 22.04 LTS and rancher is not working.

2022/04/26 20:33:36 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6443/version?timeout=15m0s": dial tcp 127.0.0.1:6443: connect: connection refused

Same here. Will try Ubuntu 20.04 LTS, as the last time that worked.

Edit: Downgraded to Ubuntu 20.04 LTS and that solved the problem for me. So they must fix the issue with Ubuntu 22.04 LTS

@lucserre
Copy link

lucserre commented May 5, 2022

Fresh Debian 11 images. Same issue.

@SyriiAdvent
Copy link

SyriiAdvent commented May 6, 2022

VMware - Ubuntu server 20.10 to 22.10. same issues.
Even tried rancher v2.5.X. all same issues.

@IanJSaul
Copy link

IanJSaul commented May 7, 2022

Ubuntu 20.04 - no issues with all latest releases of Rancher. This has proven to be the solution for all my issues so far.

@rsteckler
Copy link

Ubuntu 18.04
Docker 20.10.15
Tried rancher latest, tried 2.6.4, tried 2.5.2.

Same thing every time. It runs long enough to get itself setup, then restarts every few minutes.

I can't understand how this has been happening for months with no engagement from the devs?

@GuilhermeViterboGalvao
Copy link

The solution for me was changing from Ubuntu to CentOS 7 (from AWS Marketplace).

I'm using AWS and I installed the Rancher as a single node (for the main/local cluster).

Here are the steps that I did.

1-) Install the docker on CentOS:
sudo yum install docker

2-) Install vim editor:
sudo yum install vim

3-) Create the file daemon.json:
sudo vi /etc/docker/daemon.json

3.1-) File content:
{ "group": "dockerroot" }

4-) Restart docker:
sudo systemctl restart docker

5-) Enable docker:
sudo systemctl enable docker

6-) Add centos user to group dockerroot:
sudo usermod -aG dockerroot centos

7-) Last but not least, install Rancher:
sudo docker run -d --restart=unless-stopped -p 80:80 -p 443:443 -v /rancher:/var/lib/rancher --privileged rancher/rancher

@pdavies91
Copy link

Downgraded from Debian11 to 10 works fine

@FelipeSanchezCalzada
Copy link

Also tried with Debian 10 and it works. With debian 11 it doesn't work.

@mrsmall
Copy link

mrsmall commented May 25, 2022

Rancher 2.6.5 on ubuntu 22.04 (Hetzner Cloud VM) - same problem.
Reinstalled VM with debian 10 - rancher works fine.

@ognjen011
Copy link

I had this issue on Ubuntu 22.04 i fixed it by editing /etc/default/grub file. Added these values into GRUB_CMDLINE_LINUX:

GRUB_CMDLINE_LINUX="cgroup_memory=1 cgroup_enable=memory swapaccount=1 systemd.unified_cgroup_hierarchy=0"

then did:

sudo update-grub
sudo reboot

This resolved my problem but i spent entire day on it. Maybe it helps someone else as well. I think the same thing works for Debian 11.

@Kemichal
Copy link

@ognjen011 Thank you, this GRUB config worked for me on Ubuntu 21.10.

@NotNorom
Copy link

NotNorom commented Jun 1, 2022

After days of struggling with the same problem on 22.04 the GRUB_CMDLINE_LINUX thing fixed it :D Thanks @ognjen011 <3

@andrijaaspire
Copy link

Thank you @ognjen011 ! After days of struggling with Ubuntu 22.04 this fixed the issue!

@ragnaros2046
Copy link

I had this issue on Ubuntu 22.04 i fixed it by editing /etc/default/grub file. Added these values into GRUB_CMDLINE_LINUX:

GRUB_CMDLINE_LINUX="cgroup_memory=1 cgroup_enable=memory swapaccount=1 systemd.unified_cgroup_hierarchy=0"

then did:

sudo update-grub sudo reboot

This resolved my problem but i spent entire day on it. Maybe it helps someone else as well. I think the same thing works for Debian 11.

It works for Debian 11(debian-11.3.0-amd64-DVD-1.iso) you save my day bro!

@lerminou
Copy link

@Sahota1225 not only Ubuntu is impacted, Redhat too.

@kinarashah
Copy link
Member

Available to test with v2.7-head once https://drone-publish.rancher.io/rancher/rancher/8317/1/1 is green

@houshym
Copy link

houshym commented Sep 19, 2022

@kinarashah for me works. Thanks

@wildcommunist

This comment was marked as off-topic.

@thaneunsoo
Copy link
Contributor

thaneunsoo commented Sep 21, 2022

thaneunsoo said: ### Test Environment: ###
Rancher version: v2.7-head
Rancher cluster type: HA
Docker version: 20.10

Downstream cluster type: Custom


Testing:

Tested this issue with the following steps:

  1. On Ubuntu 22.04 instance install Rancher using docker
  2. Verify install is successful
  3. Verify that rancher container does not restart every 15 seconds
  4. Reboot the instance
  5. Verify that rancher container does not restart every 15 seconds

Result
Rancher container is no longer restarting on Ubuntu 22.04

Rancher container is still not running successfully and I am now seeing the following error in the docker logs and am unable to reach Rancher

[ERROR] available chart version (1.0.5+up0.2.6) for rancher-webhook is less than the min version (1.0.6+up0.2.7-rc4) 
2022/09/21 21:33:09 [ERROR] Failed to find system chart rancher-webhook will try again in 5 seconds: no chart name found

Tracking the issue here and will close this ticket once issue is resolved and rancher is able to run successfully.

@thaneunsoo
Copy link
Contributor

Test Environment:

Rancher version: v2.7-head eab28dd
Rancher cluster type: single-node docker install
Docker version: 20.10


Testing:

Tested this issue with the following steps:

  1. On Ubuntu 22.04 instance install Rancher using docker
  2. Provision RKE1 cluster
  3. Provision RKE2 cluster

Result - Pass
Rancher container is running successfully and RKE1 and RKE2 clusters also come up fine as well.
image.png

image.png

image.png

@zube zube bot closed this as completed Oct 4, 2022
@jloisel
Copy link

jloisel commented Oct 19, 2022

Backport to 2.6.x ?

@kinarashah
Copy link
Member

@jloisel The PR has also been backported to v2.6.x #38925 and should be fixed in v2.6.9.

@zube zube bot removed the [zube]: Done label Jan 3, 2023
@HaigNalbandian
Copy link

Still not working in el9, correct? I can't get stable nor v2.6.9 running on Rocky Linux 9

@sachaw
Copy link

sachaw commented Apr 21, 2023

Still not working in el9, correct? I can't get stable nor v2.6.9 running on Rocky Linux 9

Same issue, EL9.1, even with the latest 2.7.3-rc1. any workarounds?

@hhruszka
Copy link

hhruszka commented May 21, 2023

I managed to resolve this issue with the following procedure. It turned out that iptables where not installed and appropriate modules loaded into kernel. I tested it on Rocky 8.7 with rancher 2.7.3.

SOLUTION:

  1. install iptables
  2. make sure that missing iptables modules are loaded into kernel:

Add both modules to the file

sudo cat<<EOF >/etc/modules-load.d/modules.conf
iptable_nat
iptable_filter
EOF

and reboot your host"

sudo reboot
  1. execute:
sudo docker run -d --restart=unless-stopped -p 80:80 -p 443:443 --privileged --name=rancher rancher/rancher`

TROUBLESHOOTING:

Please note that I named rancher container "rancher" (--name=rancher docker run flag)

Create directory on your host:

sudo mkdir /rancher
sudo chmod 777 /rancher

Run rancher container with the following command. It is essential to mount volumes for debugging purposes.

sudo docker run -d --restart=unless-stopped -p 80:80 -p 443:443 -v /rancher:/var/lib/rancher --privileged --net=host --name=rancher rancher/rancher`

Execute

docker logs -f rancher`

If rancher fails to start because of https://127.0.0.1:6444/ cannot connect error "[INFO] Waiting for server to become available: Get "https://127.0.0.1:6444/version?timeout=15m0s": dial tcp 127.0.0.1:6444: connect: connection refused", then you need to investigate /rancher/k3s.log on your host.

I found that iptables were missing on the host.
After installing iptables I had to manually load two modules - k3s was complaining about missing them in k3s.log:

sudo modprobe iptable_nat
sudo modprobe iptable_filter

Add both modules to the file

sudo cat<<EOF >/etc/modules-load.d/modules.conf
iptable_nat
iptable_filter
EOF

and reboot your host"

sudo reboot

Then I got an issue with x509 certificate which got resolved by the following commands (please note that I named my container "rancher"):

sudo docker exec -it rancher sh -c "rm /var/lib/rancher/k3s/server/tls/dynamic-cert.json"
sudo docker exec -it rancher k3s kubectl --insecure-skip-tls-verify=true delete secret -n kube-system k3s-serving
sudo docker restart rancher

When you finish troubleshooting and get missing iptables modules loaded into kernel then you can run docker in the way provided in the solution section - without volumes attached and host network. The issue with the invalid x509 certificates could be a side effect of troubleshooting.

@jackyting825
Copy link

I managed to resolve this issue with the following procedure. It turned out that iptables where not installed and appropriate modules loaded into kernel. I tested it on Rocky 8.7 with rancher 2.7.3.

SOLUTION:

  1. install iptables
  2. make sure that missing iptables modules are loaded into kernel:

Add both modules to the file

sudo cat<<EOF >/etc/modules-load.d/modules.conf
iptable_nat
iptable_filter
EOF

and reboot your host"

sudo reboot
  1. execute:
sudo docker run -d --restart=unless-stopped -p 80:80 -p 443:443 --privileged --name=rancher rancher/rancher`

TROUBLESHOOTING:

Please note that I named rancher container "rancher" (--name=rancher docker run flag)

Create directory on your host:

sudo mkdir /rancher
sudo chmod 777 /rancher

Run rancher container with the following command. It is essential to mount volumes for debugging purposes.

sudo docker run -d --restart=unless-stopped -p 80:80 -p 443:443 -v /rancher:/var/lib/rancher --privileged --net=host --name=rancher rancher/rancher`

Execute

docker logs -f rancher`

If rancher fails to start because of https://127.0.0.1:6444/ cannot connect error "[INFO] Waiting for server to become available: Get "https://127.0.0.1:6444/version?timeout=15m0s": dial tcp 127.0.0.1:6444: connect: connection refused", then you need to investigate /rancher/k3s.log on your host.

I found that iptables were missing on the host. After installing iptables I had to manually load two modules - k3s was complaining about missing them in k3s.log:

sudo modprobe iptable_nat
sudo modprobe iptable_filter

Add both modules to the file

sudo cat<<EOF >/etc/modules-load.d/modules.conf
iptable_nat
iptable_filter
EOF

and reboot your host"

sudo reboot

Then I got an issue with x509 certificate which got resolved by the following commands (please note that I named my container "rancher"):

sudo docker exec -it rancher sh -c "rm /var/lib/rancher/k3s/server/tls/dynamic-cert.json"
sudo docker exec -it rancher k3s kubectl --insecure-skip-tls-verify=true delete secret -n kube-system k3s-serving
sudo docker restart rancher

When you finish troubleshooting and get missing iptables modules loaded into kernel then you can run docker in the way provided in the solution section - without volumes attached and host network. The issue with the invalid x509 certificates could be a side effect of troubleshooting.

on rocky linux 9.3 works fine

@prbakhsh
Copy link

@hhruszka your solution works for my oracle linux 9.3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues that are defects reported by users or that we know have reached a real release priority/0 QA/XS team/hostbusters The team that is responsible for provisioning/managing downstream clusters + K8s version support
Projects
None yet
Development

No branches or pull requests