Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

restart_on_update considered harmful #288

Closed
nicwaller opened this issue Mar 10, 2016 · 6 comments
Closed

restart_on_update considered harmful #288

nicwaller opened this issue Mar 10, 2016 · 6 comments

Comments

@nicwaller
Copy link

This cookbook restarts Consul if the configuration changes. That seems reasonable at first.

But Consul relies heavily on maintaining quorum; if quorum is not maintained, the Consul cluster goes into an outage. All that's needed is for ceil(N/2) servers to restart at the same time. Allowing Chef to restart Consul servers dramatically increases the likelihood of encountering this scenario. I believe that I have experienced this several times now. Here's what happens.

  1. A new version of Consul is released. Nice!
  2. Tweak attributes in Chef to obtain the latest version of Consul.
  3. Send signal USR1 to chef-client on all nodes to immediately obtain the update.
  4. Two or more Consul servers restart at the same time, and quorum is lost.

At that point, my only option is to rebuild the Consul cluster. Not good.

I recommend that this cookbook make one of the following changes:

  • Never restart Consul servers. Let admins do that using other tools. Restarting agents is always safe.
  • Add an attribute to toggle whether servers are restarted when configuration changes. This should default to false and come with a stern warning.
  • Support some kind of cluster mutex (using Cluster, Redis, etc.) to guarantee safety when restarting servers

Isn't it strange that Consul doesn't have a built-in way of safely restarting individual servers, or even the entire cluster?

@nicwaller
Copy link
Author

Perhaps a better solution for server nodes would be to send SIGHUP instead of restarting the process. That would allow changes to services, health checks, etc. to get picked up, but still avoid the risk of losing quorum.

@shortdudey123
Copy link
Contributor

Support some kind of cluster mutex (using Cluster, Redis, etc.) to guarantee safety when restarting servers

I have not used it before, but this is one thought ~ https://github.com/websterclay/chef-dominodes
A Chef resource for mutual exclusion of blocks of recipe code. Useful for cross-cluster rolling restarts.

@johnbellone
Copy link
Contributor

I am comfortable with making the change to using SIGHUP.

@johnbellone
Copy link
Contributor

I think that having any kind of cluster mutex is a code smell. I would much rather leave it to the system administrators to upgrade the agent. What I'll probably end up doing is making sure that the configuration resource sends a reload (SIGHUP) and that updates to the installation don't actually change the running service.

@jasonmcintosh
Copy link

Please note consul itself is designed to handle and support these use cases. See examples below for a method which would prevent multiple nodes from restarting simultaneously.
https://www.consul.io/docs/guides/semaphore.html
Chef itself supports this through handlers (working on open sourcing an example of this) to allow semaphores and mutex locking.

Second thing which is more a policy issue - IMHO at the end of the day restricting service restarts and associated behavior is NOT the responsibility of individual cookbooks (e.g. the consul-cookbook) - or if there's need, it should be configurable with a default to restart. It's up to the END user to manage server/cluster state safely vs. a specific cookbook to manage that state. PagerDuty uses etcd to orchestrate cluster changes. See the RFC for an example on where they added cluster state management support to chef. https://github.com/chef/chef-rfc/blob/master/rfc039-event-handler-dsl.md

This way for systems that do NOT care about such a restart, this becomes a moot point. And it continues to work for those of us who DO have cluster state tackled already ;)

@lock
Copy link

lock bot commented Apr 25, 2020

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Apr 25, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants