restart_on_update considered harmful #288

nicwaller · 2016-03-10T18:57:22Z

This cookbook restarts Consul if the configuration changes. That seems reasonable at first.

But Consul relies heavily on maintaining quorum; if quorum is not maintained, the Consul cluster goes into an outage. All that's needed is for ceil(N/2) servers to restart at the same time. Allowing Chef to restart Consul servers dramatically increases the likelihood of encountering this scenario. I believe that I have experienced this several times now. Here's what happens.

A new version of Consul is released. Nice!
Tweak attributes in Chef to obtain the latest version of Consul.
Send signal USR1 to chef-client on all nodes to immediately obtain the update.
Two or more Consul servers restart at the same time, and quorum is lost.

At that point, my only option is to rebuild the Consul cluster. Not good.

I recommend that this cookbook make one of the following changes:

Never restart Consul servers. Let admins do that using other tools. Restarting agents is always safe.
Add an attribute to toggle whether servers are restarted when configuration changes. This should default to false and come with a stern warning.
Support some kind of cluster mutex (using Cluster, Redis, etc.) to guarantee safety when restarting servers

Isn't it strange that Consul doesn't have a built-in way of safely restarting individual servers, or even the entire cluster?

The text was updated successfully, but these errors were encountered:

nicwaller · 2016-03-10T19:51:08Z

Perhaps a better solution for server nodes would be to send SIGHUP instead of restarting the process. That would allow changes to services, health checks, etc. to get picked up, but still avoid the risk of losing quorum.

shortdudey123 · 2016-03-12T00:26:30Z

Support some kind of cluster mutex (using Cluster, Redis, etc.) to guarantee safety when restarting servers

I have not used it before, but this is one thought ~ https://github.com/websterclay/chef-dominodes
A Chef resource for mutual exclusion of blocks of recipe code. Useful for cross-cluster rolling restarts.

johnbellone · 2016-03-17T12:26:47Z

I am comfortable with making the change to using SIGHUP.

johnbellone · 2016-03-17T15:03:54Z

I think that having any kind of cluster mutex is a code smell. I would much rather leave it to the system administrators to upgrade the agent. What I'll probably end up doing is making sure that the configuration resource sends a reload (SIGHUP) and that updates to the installation don't actually change the running service.

jasonmcintosh · 2017-01-31T21:51:36Z

Please note consul itself is designed to handle and support these use cases. See examples below for a method which would prevent multiple nodes from restarting simultaneously.
https://www.consul.io/docs/guides/semaphore.html
Chef itself supports this through handlers (working on open sourcing an example of this) to allow semaphores and mutex locking.

Second thing which is more a policy issue - IMHO at the end of the day restricting service restarts and associated behavior is NOT the responsibility of individual cookbooks (e.g. the consul-cookbook) - or if there's need, it should be configurable with a default to restart. It's up to the END user to manage server/cluster state safely vs. a specific cookbook to manage that state. PagerDuty uses etcd to orchestrate cluster changes. See the RFC for an example on where they added cluster state management support to chef. https://github.com/chef/chef-rfc/blob/master/rfc039-event-handler-dsl.md

This way for systems that do NOT care about such a restart, this becomes a moot point. And it continues to work for those of us who DO have cluster state tackled already ;)

lock · 2020-04-25T10:42:18Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

johnbellone closed this as completed in 80f4a1c Mar 17, 2016

This was referenced Apr 14, 2016

Consul service should restart instead of reload #307

Closed

Prevent "vault" service to be restarted on update sous-chefs/hashicorp-vault#52

Merged

Prevent "consul" service to be restarted on update #309

Merged

legal90 mentioned this issue Aug 24, 2016

Restart service instead reload as only few configurations are reloadble. #341

Closed

legal90 mentioned this issue Dec 6, 2016

Update to consul.json does not trigger service restart #381

Closed

legal90 mentioned this issue Oct 1, 2017

Restart consul service after installation #450

Closed

lock bot locked as resolved and limited conversation to collaborators Apr 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

restart_on_update considered harmful #288

restart_on_update considered harmful #288

nicwaller commented Mar 10, 2016

nicwaller commented Mar 10, 2016

shortdudey123 commented Mar 12, 2016

johnbellone commented Mar 17, 2016

johnbellone commented Mar 17, 2016

jasonmcintosh commented Jan 31, 2017

lock bot commented Apr 25, 2020

restart_on_update considered harmful #288

restart_on_update considered harmful #288

Comments

nicwaller commented Mar 10, 2016

nicwaller commented Mar 10, 2016

shortdudey123 commented Mar 12, 2016

johnbellone commented Mar 17, 2016

johnbellone commented Mar 17, 2016

jasonmcintosh commented Jan 31, 2017

lock bot commented Apr 25, 2020