Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Microk8s in permanent failed state on reboot ( raspberry pi 4, affects single and multi node cluster ) #2204

Closed
horvatic opened this issue Apr 26, 2021 · 19 comments · Fixed by #2207

Comments

@horvatic
Copy link

Running on a raspberry pi 4, ubuntu 20.04 LTS fresh install. When rebooting there is a possibility of a permanent failed state. When this happens this message will appear: microk8s is not running. Use microk8s inspect for a deeper inspection.

Running: microk8s inspect shows no errors
Running: microk8s start or microk8s stop; microk8s start will not solve the issues
Running: microk8s reset will also not solve the issue
Running microk8s stop before the reboot seems to help, but still fails on some snap update
Rebooting will also not solve the issue

This issue seems to happen at random. I can reboot once, or twice with no failed state. Once I get a failed state there is no way to recover unless microk8's is re-installed

@ktsakalozos
Copy link
Member

Hi @horvatic, could you share an inspection tarball?

@horvatic
Copy link
Author

horvatic commented Apr 26, 2021

Hi @horvatic, could you share an inspection tarball?

I reinstalled microk8's so let me attempt to trigger a failed state.

@horvatic
Copy link
Author

horvatic commented Apr 26, 2021

@horvatic
Copy link
Author

Another ball from another node:

I uninstalled my current setup. Deleted my snap/microk8's folder. Then reinstalled. Once reinstalled I made sure it was working, and rebooted. Once rebooted I have a failed state. NOTE: ha was setup, but the issue also happens when ha is not setup

inspection-report-20210426_181428.tar.gz

@ktsakalozos
Copy link
Member

Looking at the logs of the containerd (journalctl -u snap.microk8s.daemon-containerd) I see this:

Apr 26 17:40:42 cedar microk8s.daemon-containerd[2141]: ++ date -r /proc/1 +%s
Apr 26 17:40:42 cedar microk8s.daemon-containerd[2138]: + boot_time=1
Apr 26 17:40:42 cedar microk8s.daemon-containerd[2138]: + echo 'Last time service started was 1619458244 and the host booted at 1'
Apr 26 17:40:42 cedar microk8s.daemon-containerd[2138]: Last time service started was 1619458244 and the host booted at 1

Could you confirm that date -r /proc/1 +%s returns 1?

@horvatic
Copy link
Author

What info will that tell us? Is it just a issue with ceder, as I attached redwood and another one

@ktsakalozos
Copy link
Member

When containerd starts it tries to recover any running containers. It does so by looking at the containers state kept in /var/snap/microks/common/run/containerd. After a reboot there are no containers running so we need to clear any pre-existing container state. To find out when the machine booted we check the update date of process 1. In your case it seems date -r /proc/1 +%s reports that process 1 was created on time 1. Maybe getting the boot time from process 1 is not safe or it might be an issue with RPi. Either way we need to fix this.

@horvatic
Copy link
Author

When containerd starts it tries to recover any running containers. It does so by looking at the containers state kept in /var/snap/microks/common/run/containerd. After a reboot there are no containers running so we need to clear any pre-existing container state. To find out when the machine booted we check the update date of process 1. In your case it seems date -r /proc/1 +%s reports that process 1 was created on time 1. Maybe getting the boot time from process 1 is not safe or it might be an issue with RPi. Either way we need to fix this.

Ah I see, do you want me to run the command on all my nodes: redwood, cesder, and willow?

@ktsakalozos
Copy link
Member

ktsakalozos commented Apr 26, 2021

I am fairly certain you will get 1 if you call date -r /proc/1 +%s.

Maybe this line [1] needs to be updated with a better way to detect the boot time. Is there anything special about your setup kernel/hardware/distribution? I need to reproduce this. Maybe /proc/stat | grep btime is a better choice.

[1] https://github.com/ubuntu/microk8s/blob/master/microk8s-resources/actions/common/utils.sh#L683

@horvatic
Copy link
Author

horvatic commented Apr 26, 2021

Here are the results

horvatic@redwood:~$ date -r /proc/1 +%s
1
horvatic@cedar:~$ date -r /proc/1 +%s
1
horvatic@willow:~$ date -r /proc/1 +%s
1

I also did about five reboots to confirm these number are correct, and checked the status of microk8's each time. If it was not running I would run start, then check the status.

It looks like if you keep restarting and run start it will come back online about 2/5 times

NOTE: This is also not a waiting issue, as I had a node in this state for six days

@horvatic
Copy link
Author

@ktsakalozos My setup is

3 raspberry pis, 4 GB of ram, 64 GB SD, Ubuntu 20.04 LTS. Using wired connection. Nothing else is installed expected microk8s

@ktsakalozos
Copy link
Member

@horvatic, referenced in this issue you will find a PR with a fix. As soon as the PR gets merged it should be on 1.21/edge within the day. A build of the .snap file is available now as an artifact in [1]. Any feedback would be appreciated.

As a mitigation for now you can either:

  • clean the containerd state rm /var/snap/microk8s/common/run/containerd/* when the machine boots, or
  • move the containerd state to a tmpfs location. This can be done by editing /var/snap/microk8s/current/args/containerd and setting --state /run/containerd.

[1] https://github.com/ubuntu/microk8s/pull/2207/checks?check_run_id=2442807158

@horvatic
Copy link
Author

horvatic commented Apr 27, 2021

@ktsakalozos ill install the microk8s.snap package tonight, and test it out. Will post results once testing is done!

@horvatic
Copy link
Author

horvatic commented Apr 27, 2021

@ktsakalozos

I am having this error:
error: cannot install snap file: snap "microk8s" supported architectures (amd64) are incompatible with this system (arm64)

Can you link me the arm build?

Install using: sudo snap install --dangerous --classic microk8s.snap

@ktsakalozos
Copy link
Member

I am having this error:
error: cannot install snap file: snap "microk8s" supported architectures (amd64) are incompatible with this system (arm64)

Unfortunately you need an ARM64 build and the github action produces a AMD64 snap.

@horvatic
Copy link
Author

@ktsakalozos I will be unable to test until an arm build is produced. Should I just wait until the edge release?

@ktsakalozos
Copy link
Member

@horvatic just merged the fix. We should have an arm64 within the day.

@ktsakalozos ktsakalozos reopened this Apr 27, 2021
@horvatic
Copy link
Author

@ktsakalozos kk i'll test it tomorrow at 5 pm CST and report the results

@horvatic
Copy link
Author

@ktsakalozos All tested

Restarted 3 times, and all times microk8's started on reboot successfully!

NOTE: There is a delay of about 10 secs if you reboot. So if immediately run microk8's status after a reboot it will show microk8's isn't running. If this issue get logged again the user may just need to wait 10 secs or so. I do not count this as a bug, as it takes time to start up the services.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants