Zero-downtime restart #22

vassilevsky · 2018-10-22T12:27:59Z

Hi :)

Zero-downtime restart has always been a PITA in both Unicorn and Puma.

What's Falcon's story?

ioquatix · 2018-10-22T21:08:22Z

That's a really good question and one I've thought about. Here is a brain dump:

async-container is responsible for starting and stopping child tasks.

I recently implemented using a shared file descriptor for multiple processes which works pretty well. That means it's possible to kill old processes and start new ones without rejecting incoming connections.

Right now I'm working on an opinionated virtual hosting solution for Falcon and part of this will be figuring out how to do zero-downtime restarts.

The idea right now is to have both a signal and perhaps an IPC mechanism for restarting specific hosts (could be as simple as touch tmp/restart.txt).

I think part of this can involve more reliability at the container level.

So, one limitation with async-container is that there is no logic for "keep 8 processes running at all times". If you start a container with 8 processes, they won't be restarted when the crash. If there was a policy for restarting them, it would be easy to have another process gracefully terminate them.

With HTTP/2 we need to be careful to use GOAWAY to ensure that we gracefully shut down existing connections.

guard-falcon already implements sort of zero-downtime restarts in a very limited context. It uses a shared socket in the parent process and restarts the application server processes as required. It's not perfect but it does work for its intended purpose. It doesn't do rolling restarts.

Maybe what you can do to help is clarify exactly how you think zero-downtime restarts should work and ideally what the process model should be so I can get a handle on exactly how this should work from your POV.

ioquatix · 2018-10-22T21:18:33Z

Guard::Falcon uses a shared endpoint and restarts the entire container every time file changes.

https://github.com/socketry/guard-falcon/blob/42fdf89439c7e077182f2affeca2bed4bc97dd21/lib/guard/falcon/controller.rb#L73-L123

There are potential options for optimising this - e.g. pre-fork which is only killed when files that affect the actual app server code change.

ioquatix · 2018-10-22T23:51:40Z

I think this is also a related issue: #17

ioquatix · 2018-10-24T20:28:08Z

I am going to brain dump some more things I've been thinking about.

Right now, the Falcon::Hosts is used to implement virtual hosting.

I feel like it's a very wide design. As in, the surface area of the API is proportional to the features required. As a test case, I'm thinking about how to add support for Let's Encrypt. This requires a user optionally specify per host whether they want to provision a certificate, or not, and the appropriate infrastructure to update the certs periodically, along with restarting the appropriate processes when the certs are updated in a graceful way.

I always liked the Rack way of layering middleware. I'm trying to figure out if rather than going "wide" we can go "deep" like Rack, for the specification of web hosts. To me, it seems like in order to go deep, some level of "configuration abstraction" is required. e.g. using a hash to define the configuration, then using a collection of these hashes (one per host?) to start the appropriate daemons. It would function like a database of configuration and then a stack of host middleware could act on it to do the right thing. To me, though, it seems a little bit too much indirection.

Configuration hashes are hard to document, hard to reason about, and it might just be simpler to go with some "wide" class like how Falcon::Host currently works, e.g. host.ssl_certificate_path = xyz.

As always, thoughts are welcome.

hmspider · 2019-01-06T16:16:21Z

Virtual hosting seems IMHO to be in opposition to self-contained web apps.
That being said, I think you'd need some kind of supervisor, however lightweight, to coordinate graceful hot restarts and shutdowns.
You might want to have a look at Sidekiq: The Manager actor monitors the state of working actors and besides the restart/shutdown thing it adds some overall resilience by killing stuck actors and reassigning their tasks (jobs).

vassilevsky · 2019-01-16T16:28:10Z

I work with Erlang now and what I see is a lot of supervision going on :) Practically every process has a supervisor that detects when a child is terminated and restarts it if needed. The entire application is like a tree of processes, and only leaf processes do the heavy work. The rest are supervising. That very well might be the reason why Erlang is considered rock solid. It’s actually tree solid, or tree enduring.

So yes, I would advise a supervising process.

ioquatix · 2019-06-27T00:30:31Z

In theory this is now possible. If you kill a process, it will restart. If the process has a connection in-flight it will be dropped, but if you use falcon virtual (experimental) the request would be retried up to 3 times (internally) if it is idempotent. You can access this functionality by falcon supervisor restart and the examples will show you how to do it, e..g falcon virtual examples/hello/falcon.rb. Feel free to try it out.

ioquatix added enhancement help wanted labels Nov 10, 2018

ioquatix closed this as completed Jun 27, 2019

trevorturk mentioned this issue Feb 21, 2023

Gracefully restart running application #188

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zero-downtime restart #22

Zero-downtime restart #22

vassilevsky commented Oct 22, 2018

ioquatix commented Oct 22, 2018 •

edited

Loading

ioquatix commented Oct 22, 2018

ioquatix commented Oct 22, 2018

ioquatix commented Oct 24, 2018

hmspider commented Jan 6, 2019

vassilevsky commented Jan 16, 2019

ioquatix commented Jun 27, 2019

Zero-downtime restart #22

Zero-downtime restart #22

Comments

vassilevsky commented Oct 22, 2018

ioquatix commented Oct 22, 2018 • edited Loading

ioquatix commented Oct 22, 2018

ioquatix commented Oct 22, 2018

ioquatix commented Oct 24, 2018

hmspider commented Jan 6, 2019

vassilevsky commented Jan 16, 2019

ioquatix commented Jun 27, 2019

ioquatix commented Oct 22, 2018 •

edited

Loading