Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zero-downtime restart #22

Closed
vassilevsky opened this issue Oct 22, 2018 · 7 comments
Closed

Zero-downtime restart #22

vassilevsky opened this issue Oct 22, 2018 · 7 comments

Comments

@vassilevsky
Copy link

Hi :)

Zero-downtime restart has always been a PITA in both Unicorn and Puma.

What's Falcon's story?

@ioquatix
Copy link
Member

ioquatix commented Oct 22, 2018

That's a really good question and one I've thought about. Here is a brain dump:

async-container is responsible for starting and stopping child tasks.

I recently implemented using a shared file descriptor for multiple processes which works pretty well. That means it's possible to kill old processes and start new ones without rejecting incoming connections.

Right now I'm working on an opinionated virtual hosting solution for Falcon and part of this will be figuring out how to do zero-downtime restarts.

The idea right now is to have both a signal and perhaps an IPC mechanism for restarting specific hosts (could be as simple as touch tmp/restart.txt).

I think part of this can involve more reliability at the container level.

So, one limitation with async-container is that there is no logic for "keep 8 processes running at all times". If you start a container with 8 processes, they won't be restarted when the crash. If there was a policy for restarting them, it would be easy to have another process gracefully terminate them.

With HTTP/2 we need to be careful to use GOAWAY to ensure that we gracefully shut down existing connections.

guard-falcon already implements sort of zero-downtime restarts in a very limited context. It uses a shared socket in the parent process and restarts the application server processes as required. It's not perfect but it does work for its intended purpose. It doesn't do rolling restarts.

Maybe what you can do to help is clarify exactly how you think zero-downtime restarts should work and ideally what the process model should be so I can get a handle on exactly how this should work from your POV.

@ioquatix
Copy link
Member

Guard::Falcon uses a shared endpoint and restarts the entire container every time file changes.

https://github.com/socketry/guard-falcon/blob/42fdf89439c7e077182f2affeca2bed4bc97dd21/lib/guard/falcon/controller.rb#L73-L123

There are potential options for optimising this - e.g. pre-fork which is only killed when files that affect the actual app server code change.

@ioquatix
Copy link
Member

I think this is also a related issue: #17

@ioquatix
Copy link
Member

I am going to brain dump some more things I've been thinking about.

Right now, the Falcon::Hosts is used to implement virtual hosting.

I feel like it's a very wide design. As in, the surface area of the API is proportional to the features required. As a test case, I'm thinking about how to add support for Let's Encrypt. This requires a user optionally specify per host whether they want to provision a certificate, or not, and the appropriate infrastructure to update the certs periodically, along with restarting the appropriate processes when the certs are updated in a graceful way.

I always liked the Rack way of layering middleware. I'm trying to figure out if rather than going "wide" we can go "deep" like Rack, for the specification of web hosts. To me, it seems like in order to go deep, some level of "configuration abstraction" is required. e.g. using a hash to define the configuration, then using a collection of these hashes (one per host?) to start the appropriate daemons. It would function like a database of configuration and then a stack of host middleware could act on it to do the right thing. To me, though, it seems a little bit too much indirection.

Configuration hashes are hard to document, hard to reason about, and it might just be simpler to go with some "wide" class like how Falcon::Host currently works, e.g. host.ssl_certificate_path = xyz.

As always, thoughts are welcome.

@hmspider
Copy link

hmspider commented Jan 6, 2019

Virtual hosting seems IMHO to be in opposition to self-contained web apps.
That being said, I think you'd need some kind of supervisor, however lightweight, to coordinate graceful hot restarts and shutdowns.
You might want to have a look at Sidekiq: The Manager actor monitors the state of working actors and besides the restart/shutdown thing it adds some overall resilience by killing stuck actors and reassigning their tasks (jobs).

@vassilevsky
Copy link
Author

I work with Erlang now and what I see is a lot of supervision going on :) Practically every process has a supervisor that detects when a child is terminated and restarts it if needed. The entire application is like a tree of processes, and only leaf processes do the heavy work. The rest are supervising. That very well might be the reason why Erlang is considered rock solid. It’s actually tree solid, or tree enduring.

So yes, I would advise a supervising process.

@ioquatix
Copy link
Member

In theory this is now possible. If you kill a process, it will restart. If the process has a connection in-flight it will be dropped, but if you use falcon virtual (experimental) the request would be retried up to 3 times (internally) if it is idempotent. You can access this functionality by falcon supervisor restart and the examples will show you how to do it, e..g falcon virtual examples/hello/falcon.rb. Feel free to try it out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants