-
Notifications
You must be signed in to change notification settings - Fork 772
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
snap auto-refresh breaks cluster #1022
Comments
I've left one worker node in the stuck state in case that's useful for troubleshooting, and have now come across a well-known issue in that pods running on that |
Thank you for reporting this @eug48. I opened an issue/topic with the snap team at [1]. One note here is that you cannot hold snap refreshes forever ( [1] https://forum.snapcraft.io/t/snap-refresh-breaks-microk8s-cluster/15906 |
Thanks very much for raising that @ktsakalozos and correcting my incorrect assumption that |
@eug48 sorry for the trouble and thanks for the report. The fact that it hangs during the copy-data phase is curious. I think you mentioned you have one node in the bad state? It looks like the data in /var/snap/microk8s/ did not even got started to get copied, i.e. the new snaps data dir did not even get created, is this correct? |
@mvo5 yes, /snap/microk8s/1254 got created but in /var/snap/microk8s/ there is only 1176. Upon further investigation I've probably found the cause. I've been trying out rook-ceph and there is still a volume mounted with it:
However trying to
So this is already a complex and therefore rather brittle set-up, and I think having snap auto-refreshes added to the mix makes failure much more likely. Having an option for it to be turned off permanently so that users can upgrade manually and fix these kinds of problems would be great for production use. |
For anyone reading this with the same issue, TL;DR: Personally, the fact that |
I also have problems with my microk8s cluster that my be related to this issue. I am experiencing regular service failure that start almost exactly at 2am in a not yet discovered interval (some days). The exact service is a VerneMQ MQTT server that is using a MariaDB for authentication. The result is that authentication does not work after that (unknown) event happened. This event could relate to snap activities as I discovered some micro8ks restarting by snap around that time. I also would like to disable auto refresh for microk8s to further investigate the problem and to proof my assumption. |
Hi @skobow, is it possible you were following the |
Hi @ktsakalozos, I am using 1.19/stable channel which currently installs v1.19.2 |
Nothing got released on |
Find the tarball attached.
The time stamps fit for the service stop working. Even though there might not be any updates something happens anyway. Could that be related? |
Hi! Fyi: exactly the same happened tonight at the same time. @ktsakalozos what are these service commands and why are they run? |
I am not sure why snapd decides to restart MicroK8s. Could you attach the snapd log |
@ktsakalozos find the log attached! |
@ktsakalozos Any news on this topic? |
@skobow in the snapd.log I see these failures:
If you do not know what might be causing this we will go to https://forum.snapcraft.io/ and ask there. |
hello. I believe I'm running in to the same issue here as well.
It appears whenever snap decides to auto-refresh, microk8s hangs on the copy step and never completes (taking the cluster down). The only things that seem to be effective were either rebooting the machine or recently discovered that I could abort the auto-refresh:
However, eventually the auto-refresh happens again...
After reading this thread, I gave me the idea to look for unusual mounts that were lingering... and while i wasn't able to find references to libceph I did see that the nfs mounts from the nfs provisioner running in my cluster were erroring out.
Not sure if this is the actual cause but thought I'd share in case it was helpful to anyone. Did anyone find a resolution to this problem? |
I experienced the same issue. Pods remain in terminating states. New ones are created but fail to run due to a connectivity issue that I can't reproduce outside of the pod. Removing deployments and services before recreating them does not help. I had to reinstall the whole cluster after resetting all nodes.
|
Came across a probably related problem - snap refreshed microk8s and took the cluster down - all pods are in "sandbox changed" state then. Same goes for node reboots, btw.
It would be great to tackle one of those two. Happy to provide any log, as I can easily reproduce the sandbox-issue. I'd be careful with "reliable production-ready Kubernetes distribution" (from https://ubuntu.com/blog/introduction-to-microk8s-part-1-2) until then :) |
I also came across this issue today, also running rook-ceph in a 3 node cluster. The Rook/Ceph cluster works perfectly fine otherwise. |
Great, thanks snap autorefresh, you have crashed my entire cluster with this: |
Incase it's helpful to anyone here, I was able to permanently disable the auto-refresh by disabling the snapd service.
Since doing this, I haven't had any stability issues with my cluster at all. This is my temporary fix until I have time to migrate my cluster to k3s or something that actually works. |
These are all no permanent fixes, snap will never implement disabling auto updates, and this will always become a problem. I just suggest not touching microk8s at all, only use it for development purposes, and ban it for all production purposes. |
The Kubernetes project ships a few releases every month [1]. These releases include security, bug and regression fixes. Every production grade Kubernetes distribution should have a mechanism to release such fixes even before they are released from upstream. For MicroK8s this mechanism is the snaps. Snaps allow us to keep your Kubernetes infrastructure up to date not only with fresh Kubernetes binaries but also update/fix integrations with underlying system and the Kubernetes ecosystem. If you do not want to take the risk of automated refreshes you have at least two options:
[1] https://github.com/kubernetes/kubernetes/releases |
@ktsakalozos the point of "security" is pretty moot if it breaks everything while updating it, it's defeating its own purpose. |
@ShadowJonathan, I am not sure why you mention only security and in quotes. Any update that breaks the cluster is defeating its own purpose. For anyone that wants to contribute back to this project we would be grateful if you could run non-production clusters with the candidate channels of the track you follow, for example |
Your point was that security is paramount and absolute, that it should be the excuse that makes this problem okay, it's not, it's an excuse that only exasperates this problem and the whole of snap for servers in general. Snaps are fine for user apps, those can deal with being restarted, crashing, shutting down, again and again. Server apps need more delicacy, planning, and oversight. Any admin/operator would not want the developer control over when, how, and why something will update, they want complete control over their systems, and the snaps auto-updating feature is a complete insult to that.
I'm glad you agree, then? I'd rather have a cluster which is outdated and vulnerable, and possibly get hacked, if it's about my own oversight and my own fault (at least then i can tune it to my own schedule and my own system). With auto-update, and even the update window, that control is taken away from me, as now i have to scramble to make sure the eventual update will not fuck with my system, and then to do it manually, safe, and controlled to make sure it does not fuck over the data. (which it did for me, 1.2TB of scraping data, all corrupted because docker didnt want to close within 30 seconds, after which it got SIGKILLd) As a sysadmin, I control a developer's software, when, where, and how. The developer doesn't control my system, unless I tell it to. And even then, only on my own conditions. Snaps violated this principle, and that's why I'm incredibly displeased with them. |
i thinking same.... but they are telling that is production ready... really??? |
@lfdominguez branding |
but if microk8s get out from snap.... or use another method, like executable self-contained (like k3s o k0s) i think that is better, you get out of the insane snap auto-refresh.... |
@a-hahn the intended use of MicroK8s is to run it with updates on. In the documentation of the product we want to recommend the intended use.
The issue under question here is why the intended use is to run Kubernetes with updates on. I hope this will become clear with an example. K8s v1.23.0 came out on the 7th of Dec, v1.23.1 on the 16th of Dec, v1.23.2 on the 19th of January, v1.23.3 on the 26th of January [2]. In parallel we have updates on underlying components such as containerd, runc etc. You can check what some of these updates are about in [1] and after reviewing them you may say that you are not interested in them and that is absolutely fine. But on our side, the side of a K8s distribution, we cannot just ignore updates. We will continue providing updates and we would like these updates to be applied to already existing deployments because we strongly believe if they are not the end product will degrade in quality. It would have been irresponsible for us to ship a product knowing that within 15 days would be out of date and thus would hurt Canonical's reputation as a first class solution provider much more. In the same way you allow updates on your OS, your phone, your tablet you are expected (intended use) to allow updates on this K8s distribution. I hope this clarifies why we take this position in respect to updates and why we are not "misguiding people". For the case where the user wants to have full control over what updates get applied and when the recommendation is to not disconnect the deployment from updates but to use a proxy to filter-out and block updates that you deem harmful. This process is described in our official docs [3], and, yes, it is not as easy as download the snap binary and install it but it is a process that makes you aware of updates. After all this, if you still want to download the snap binary and install it the way you suggest the snap ecosystem will not stop you. You are given the Also, please allow me this last point. I find the "paywall" comment unfair for the company that has been giving you all these years Ubuntu, cloud-init, Launchpad, MaaS, LXD, juju OpenStack, Multipass, MicroK8s, Charmed K8s, snapcraft and more. [1] https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.23.md |
@a-hahn Hi. In respect of the documentation, no matter how many warnings or comments are added to things, sadly many people don't read them. There is no hidden agenda, this is simply a matter of responsibility. If you buy a car and want to disconnect all the warning lights, that is also up to you, but you wouldn't expect to find the instructions to do so printed in the owners manual. Please add your method as a comment on discourse if you like. |
That'll be nothing more than black-holing the concern. |
What is the actual concern? |
@ktsakalozos So much time spent on arguing. This simple one-liner |
@a-hahn Hello Andreas, thank you for your engagement on this, and for making a contribution to the documentation for this project. I'm a Director of Engineering at Canonical, and my responsibility is documentation - everything to do with documentation and the way we produce it, for all Canonical products and projects (if you'd like to learn more about what my work is, I invite you to read The future of documentation at Canonical). Needless to say, I think that documentation is extremely important. I think that's generally true, but particularly so for an open-source software organisation like Canonical, and for open-source software projects themselves. That's because documentation is part of the contract with an open-source community. Documentation is one of the most important ways of sealing the relationship between a product and its community. Community members understandably feel strongly about documentation. You put time and effort into making an improvement to the MicroK8s documentation, and it was declined - for reasons that you don't agree with - by one of the maintainers. I can see that you are angry and frustrated that the contribution you made was reversed, and your reasoning about it also not accepted. Can I ask, are you upset because you think @evilnick made a wrong technical decision about what should or should not be documented? Or would you say you feel more upset because you need to feel a different relationship to exist between you (as a community member), and the project, and Canonical? Unfortunately I am not in a position to comment on the technical aspect of this. I am not an expert in MicroK8s security. However, one of the things I would like to achieve in my work is improved community engagement through documentation and improved experiences for documentation contributors, so I'd be very happy if you would like to talk more about that. Either way, one thing I would like to say in the meantime is that I do know that Nick is a very community-minded person. Before making this decision, we discussed it together (nor was I the only person it was raised with), because it's a hard thing to do to turn down someone's contribution. It was not done lightly. So it's also a hard thing to be in that position, and to receive angry criticism for it, or to be accused of not respecting the code of conduct. Personally, I would be upset by that myself. |
@evildmp if I were to guess, a large share of the frustration is not personal, but rather aimed at snap's packaging in-of-itself, which is then a main root to cause this problem. The solution offered is a hack, an explicit circumvention of the problem, which does not do much to offer a satisfying resolution, nor does it help lighten the burden that the problem caused, it only cripples the effectiveness of the platform, while a better solution is available from snap's side, while they do not wish to give developers those tools, out of political and ideological reasons, that explicitly tear away control from users in a patriarchal fashion, in the sense that the developers would like to think they know their users' systems better than the users would. (Which, imo, that is maybe true for normal application users, this becomes far less true for developers, and very much not true for system administrators, for which snaps all have the same attitude) I don't want to perpetuate the cycle here, at the very least know this; it wasn't personal, the frustration is high, and this issue is just one part of the knot where the pressure became too high. |
The approach of the mikrok8s team is not professional. The sole purpose of kubernetes is to provide a platform for running fault tolerant services, and it is completely ruined by packaging it into totally unsuitable, and DNA broken "snap" tool. It's hard to say more. I personaly got rid of all microk8s on my servers and migrated to k3s. Next is to replace back ubuntu by plain Debian. |
This is not an issue, or criticism of code, or even criticism of someone's behaviour. It is an abusive remark. There is nothing constructive that anyone can do in response to this comment. Sometimes people get angry about open-source software projects they participating, which is OK. It's also OK to express anger sometimes. It is not OK, and it is explicitly against the Ubuntu code of conduct adopted by the MicroK8s project, to make abusive comments. I politely request that you delete that part of your comment. Thank you. |
@evildmp Do you really think you gave a satisfying professional response to my statement ? You have missed a chance to clarify: 'As a director of engineering at canonical I'd like to assure you that you can expect a professional and complete documentation for our products. And this also includes controversial fixes or alternative or internal usage instructions for our products from our staff or our users even if we don't recommend those to our customers. We encourage and enforce transparency. We are committed to leaving the choice to our users and customers to use and deploy our products in a way that best suits their needs even if we don't agree with it or consider it harmful. Of course we will flag those with a big fat warning label'. As long as you don't say that the best answer to this issue still comes from your former employee [see: Bypassing store refresh]. As long as interns don't have the courage or the companies allowance to give some really helpful background information I'm afraid its not cynical to say then hopefully more people are leaving the company to make up their minds finally. That'd be a really sad conclusion for the friends of Canonical, Ubuntu and Microk8s. |
I think that some last comments is off-topic related to title of the issue..... We have a problem... the autorefresh system of snap breaks microk8s, we need an option to disable autorefresh... only that... |
@lfdominguez you cannot disable autorefresh. You can download and install microk8s manually, but then updating it is manual as well. |
Yes I underestand that... so if is a political of canonical & snap dont change that (really i think that is a wrong idea dont let the user disable that)... why then waste the time in this issue??? snap team is not listen to users, because without use a workaround like discuss in this issue... or better, go to the microk8s kubernetes dristribution official doc and install it on a production cluster, you will mess all when autorefresh change something.... that's from my sysadmin point of view is faaaaaaaaaar from a production grade system. |
@evildmp ,
|
Said differently, why is there so much reluctance to improve user experience with a feature so many people requests? Is there something to do with data collected from auto updates? |
@vazir id also like to note that the current Ubuntu server installer, when entering the "additional software" screen, installs these through snap. So, if you've installed docker on that screen, it'll go ahead and install that in a snap container, including any other software, such as microk8s. |
@ShadowJonathan - this mindless practice effectively and rapidly moves Ubuntu as distribution out of servers. I do not believe, they do not understand it. So, there is the only conclusion - someone slowly killing Ubuntu from inside. Nokia way |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
A note that is probably useful for people that have commented in the issue: Starting with snapd 2.58, it is possible to indefinitely hold MicroK8s (and any other installed snap packages) from updating with the following command:
See also the "Control updates" section in the snapd documentation Not stale |
I can only say "finally"... But I dropped microk8s, switched to k3s, also no UBUNTU any more anywhere, servers and desktops. Back to Debian. It is hard to express how ANNOYING to get those "Pending update of 'BLA BLA SNAP' close the app to avoid disruption". Ubuntu is trying to mimic worst parts of the damned windows... |
Had switched to Rancher/k3s already. No further bothering on unexpected auto updates. |
Changed to rancher rke2 (very stable), no snap anymore i hope |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
This isn't stale, this is still a problem |
this just happened to me, failed on 2/3 nodes |
This morning a close-to-production cluster fell over after snap's auto-refresh "feature" failed on 3 of 4 worker nodes - looks like it hanged at the
Copy snap "microk8s" data
step. microk8s could be restarted after aborting the auto-refresh, but this only worked after manually killing snapd.. For a production-ready Kubernetes distribution I really think this is a far from acceptable default.. Perhaps until snapd allows disabling auto-refreshes microk8s scripts could recommend runningsudo snap set system refresh.hold=2050-01-01T15:04:05Z
or similar. Also a kubernetes-native integration with snapd refreshes could be considered (e.g. a prometheus/grafana dashboard/alert) to prompt manual updates - presumably one node at a time to begin with.Otherwise microk8s is working rather well so thank you very much.
More details about the outage:
microk8s is disabled..
Data copy appears hanged
There doesn't seem to be much to copy anyway:
Starting microk8s fails
Fails to abort..
snapd service hangs when trying to stop it...
have to resort to manually stopping the process
finally change is undone..
Nothing much in snapd logs except for a polkit error - unsure if related:
The text was updated successfully, but these errors were encountered: