Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Graceful shutdown #71

Open
atitan opened this issue Aug 22, 2019 · 21 comments
Open

Graceful shutdown #71

atitan opened this issue Aug 22, 2019 · 21 comments
Assignees

Comments

@atitan
Copy link
Contributor

atitan commented Aug 22, 2019

Hello,

We made the transition from puma to falcon recently, and it works really well.

We're seeing some random Interrupt exception in the log, found out to be caused by auto scaling's scale in, which send signal to containers to stop them.

I'd like to know if falcon was capable of handling graceful shutdown like puma do?

@ioquatix
Copy link
Member

Not yet but it’s on track for the 1.0 release. If you have some time to help out, maybe you can explain your use case in a bit more detail.

@atitan
Copy link
Contributor Author

atitan commented Aug 22, 2019

We're using AWS EC2+ECS for production.
Every EC2 instance has 4 cores, where ECS places one task(containers) to each of them.
In the container, Falcon is forked to 5 processes to keep maximum usage of CPUs.

For deployment, we use full rolling update, which starts new container on every instances while old one still running.
ELB+ECS will eventually drain old containers and switch to new ones, so the signal exception will not cause error here.

When it comes to instance auto scaling, it operates directly on instances because it is a feature in EC2.
At the time instance getting terminated, it may have requests still being processed.

Hope the described scenario would help.

@ioquatix
Copy link
Member

Thanks, that is helpful.

Are you using falcon serve and sigint?

@ioquatix
Copy link
Member

Also it would be great to add a list of websites using falcon, or have used falcon. If you want to make a PR, including a link to your website, we can add it to the README.md.

@ioquatix
Copy link
Member

I've got a cold so I may be unattentive on this issue for the next week, but hopefully I can pick it up soon. I've just spent some time playing around with async-io to flesh out what is required support graceful shutdowns in general, then that can be used by falcon.

@atitan
Copy link
Contributor Author

atitan commented Aug 22, 2019

Take care.

We use dumb-init as container entrypoint to rewrite and forward SIGTERM with SIGINT.
It's because Puma once had issue handling SIGTERM, which makes GKE unhappy about container status.
Using SIGINT makes Puma to return 0 on exit.
The outcome seemed great so we adopted this practice on every project we had since then.

@veesahni
Copy link

veesahni commented Nov 1, 2021

I have similar need for a rolling restart. Slightly different than what @atitan described, so I'd like to describe the use case:

  • We have nginx proxying to multiple ruby servers
  • We do a rolling restart of the ruby servers
  • falcon would be managed by systemd
  • systemd would be configured to send a signal defined by falcon (SIGQUIT, SIGTERM, etc) to the controller process to trigger a graceful shutdown
  • Upon receiving the signal, it is expected that falcon immediately stop listening for and accepting new connections.. and thus nginx will take the node out of rotation automatically
  • Upon receiving the signal, it is also expected that any idle child processes processes shut themselves down
  • Once all idle processes are gone, it's expected that the main controller will shut down
  • If the main controller is still up after 90s (this is systemd default), systemd will send a SIGKILL to forcefully bring it down
  • Once the falcon is shutdown (either via graceful shutdown or via SIGKILL), systemd will start the process

In practice, we setup systemd with a KillSignal parameter configured for a graceful shutdown.
Thus systemctl stop falcon would gracefully shutdown..
.. and systemctl restart falcon would be nothing special - just a stop followed by a start.

More details about systemd's standard restart procedure is described in man systemd.kill or see here, online

@yard
Copy link

yard commented Mar 14, 2024

Hey guys! I have played a bit with graceful shutdown for falcon host trying to make it work locally but unfortunately didn't manage to make it work reliably. Will dump my approach and findings here, maybe someone (well, @ioquatix most likely :) will be able to suggest how to proceed.

Essentially, I decided to approach the graceful shutdown in the dumbest way possible: upon receiving a signal (I was handling INT, but it doesn't really matter and can be made configurable anyway) I invoke a (newly added) drain! method on the SharedEndpoint instance. This method, in its turn, is expected to stop accepting new connections immediately, wait for currently running requests to complete and shut down any active connections hanging around (there could be some even after completing all the request due to Keep-Alive).

I managed to get SharedEndpoint to stop accepting new connections by running calls to accept method in a loop with a small timeout until a flag is set (which gets enabled in drain! method). Once we complete handling the request(s), the connection(s) get closed, the loop exits, the application gets killed and everything is nice.

What I didn't manage to handle properly is exactly the Keep-Alive case. Whenever a connection is not closed by the client, it just stays in #each loop waiting for new requests and does not die – and since I was thinking about adding another signalling like drain! to connection object inside Async::HTTP::Server accept loop, but figured it might be better to ask first.

To sum up, I think the main challenge with my implementation is about implementing a proper signalling mechanism that can stop all these nested loops properly – almost like we use Thread.current.raise(Interrupt) to stop whatever code is running on INT signal normally (but we actually need something more fine-grained, since we don't want to stop the client logic). Adding drain! method throughout the whole stack seems weird and I believe there might be a better pattern?

@ioquatix
Copy link
Member

ioquatix commented Mar 14, 2024

I agree with improving this. The changes will need to take place in falcon and async-container. Using signals for everything might be tricky - there aren't really enough, so it might be the case we need to use some kind of RPC for detailed rolling restarts + status information.

I believe for the current implementation, sending SIGINT should start the shutdown process, followed by SIGTERM (or SIGKILL) to stop everything.

@yard
Copy link

yard commented Mar 15, 2024

Well, as far as graceful shutdown are considered – I think signals should be enough. All in all, the child processes/threads (spawned by a Container instance) should gracefully teardown, and the parent process should exit once these are done; doing that in a reaction to a signal seems alright.

The area I am a bit stuck at is terminating the connections gracefully. A SharedEndpoint instance created in the parent process gets passed down to child threads, which run an accept loop on top of it, where evert accepted connection is wrapped into a socket, where a stream is created off a socket (if I am correct with the hierarchy). This makes it a tad hard to handle properly as I don't see a nice (and preferably universal) way to signal every stream to tear down – is there one though? Or we do have to push some sort of a drain! chain of calls down to the very last level of nesting?

@ioquatix
Copy link
Member

I will revisit this problem this week. I understand the use case.

@ioquatix
Copy link
Member

Okay, here is where I'm landing on this:

Async::IO::SharedEndpoint (and IO::Endpoint::BoundEndpoint) both support #close which terminates the accept loop. See socketry/async-io#80 and socketry/io-endpoint#6 for details (and tests showing how it works).

Async::Container::Controller will introduce a graceful parameter which controls the shutdown sequence:

  • false: instant SIGTERM.
  • true: 1 second timeout.
  • Integer/Float: the timeout.

If a timeout is set, the existing logic handles it as so:

  1. Send child processes SIGINT and wait for the given timeout.
  2. Send child processes SIGTERM.

We do not attempt to send multiple SIGINT signals. That means, with graceful = 5 for example, the child process will receive SIGINT followed by SIGTERM 5 seconds later if it has not exited.

Falcon::Service::Server will intercept SIGINT and call @bound_endpoint.close. New connections won't be accepted. After the timeout, the server will receive SIGTERM which will immediately and forcefully terminate the event loop.

One part that's missing from this is broadcasting SIGINT further, i.e. if you have persistent connections like WebSockets, you might like to begin the shutdown sequence. It's not realistic for every WebSocket connection to intercept SIGINT... In other words, not accepting new connections is one piece. However, for persistent connections, what we want to do is also stop serving new requests, and for streaming responses, we want to start the shutdown process. I need to think a bit more about this signalling mechanism should work.

@ioquatix
Copy link
Member

ioquatix commented Mar 25, 2024

I think we will need to mark the request/response life cycle in order to defer the shutdown sequence, e.g. a bit like Thread#handle_interrupt.

In other words:

while request = read_request
  Async::Task.defer_stop do
    write_response(generate_response)
  end
  # If a shutdown caused this task to stop, we will stop here. If we took took long, SIGTERM will kill the above block.
end

I'll need to think about the design carefully, but this would make sense for most code which has to handle graceful shutdown.

@ioquatix
Copy link
Member

socketry/async-container#31 adds tests and support for graceful stop. It was already implemented, but not exposed to user control.

@zarqman
Copy link

zarqman commented Mar 28, 2024

@ioquatix, I added graceful stop to our async-http powered app here and thought I'd share a couple notes on the current design in case it helps.

To handle keep-alive and websockets, we added a new field, Async::Task#shutdown_data. It's value is an indicator as to how to handle the shutdown for that connection / task.

HTTP1 connections set it to :h1_connection right before waiting for a new request line, and clear it once the HTTP1 request has been received and is actively processing. When present, it indicates the HTTP1 underlying connection can be terminated freely, i.e. when waiting for a new request during a keep-alive connection.

HTTP2 connections set it to [:h2_connection, connection] inside our HttpService < Async::HTTP::Server class. Since the HTTP2 protocol has its own graceful close facility, this is set at the very beginning of the connection / request cycle.

Finally, for websockets, it's set to :ws_stream on both the up and down tasks. (We use dedicated tasks to handle each side of the established WS connection.)

When a graceful stop is triggered, our HttpService < Async::HTTP::Server class traverses its entire task list, checking for tasks that are running? and have shutdown_data. If present, it calls subtask.stop for both HTTP1 and WS, and for HTTP2, calls conn.send_goaway(0) unless conn.closed?. Tasks that don't have shutdown_data are ignored. (WS may eventually get a more proper shutdown than simply stopping the tasks and closing the connections, but that's sufficient for our present needs.)

Additionally, on HTTP1, at the end of the response generation, if our HttpService < Async::HTTP::Server class has been marked stopping (aka in graceful shutdown), connection.persistent = false is forced right before returning, which catches the in-flight requests that the task list traversal above would have skipped.

I won't speak to the trigger mechanism as our app here is threads-only and doesn't use async-container, so it's probably mostly not applicable. Same as your recent merges, it closes the inbound shared endpoints first, then triggers the graceful stop on each thread.

There might be a better place to add shutdown_data than on Async::Task (and a better name), but that seemed handy enough for our purposes. If placing it on Async::Task is warranted, perhaps making it into a callback would keep it from being http-specific.

Worth noting that we also rely on socket timeouts everywhere and never have sockets without them. I don't recall whether we're relying on this separately from the above anywhere. FWIW, our graceful shutdown timeout is longer than our socket timeouts.

Anyway, hope that provides food for thought. Happy to answer questions or share code snippets if such would be helpful.

@ioquatix
Copy link
Member

ioquatix commented Mar 29, 2024

Thanks @zarqman that's extremely helpful to understand your use case.

It will take some iterations to get the exact behaviour correct, but I do like the simplicity of the defer_stop model. We probably just need to ensure that it's scoped around the right parts of the code, typically the response part.

I suppose any kind of graceful shutdown is best effort, but I'd love if our best graceful shutdown was extremely good and left clients in a good state. As you said, this also includes things like WebSockets.

For WebSockets, it's a little different, and I imagine we want something like this:

begin
  websocket = receive_inbound_websocket
  while message = queue.next_message
    websocket.send(message)
  end
ensure
  websocket.close # When SIGINT -> Async::Stop occurs, we will trigger this.
  # We should ensure that `websocket.close` correctly shuts down the connection.
end

My current thinking is that the above logic is sufficient for correctly shutting down a WebSocket. The ensure block will execute in response to the Async::Stop, but internally we might still block during the ensure block while sending the close frame.

@yard
Copy link

yard commented Apr 3, 2024

My 5c: like defer_stop approach, but a small clarification would be helpful. This would only prevent #stop from functioning on the top-level task (=effectively, in response to a shutdown sequence), right? The user's code can still rely on #stop and it's not a good idea to "mask" all the invocations as otherwise processing requests during graceful shutdown might break.

Another bit to watch out for is keep-alive connections. IIRC closing the bound endpoint won't do anything to those, so we still need to stop listening for new http requests on connections which are currently active (maybe close them as soon as the response to the last one is written or smth).

@ioquatix
Copy link
Member

ioquatix commented Apr 4, 2024

My 5c: like defer_stop approach, but a small clarification would be helpful. This would only prevent #stop from functioning on the top-level task (=effectively, in response to a shutdown sequence), right? The user's code can still rely on #stop and it's not a good idea to "mask" all the invocations as otherwise processing requests during graceful shutdown might break.

More generally, there is nothing stopping user code doing this:

while true
  begin
    sleep
  rescue => Async::Stop
    # ignore
  end
end

There is simply not guarantee that it's possible to prevent stopping. However, Ruby 3.3 introduced Fiber#kill which is much harder to ignore. In the future, that may be used in Async::Node#terminate.

Therefore, I don't think it really matters - bad code is bad code - good code is good code, and stopping will always be tricky. Additional features to mask user-level stop or some other feature won't solve the problem, but will make things more complicated, which I'd like to avoid.

Another bit to watch out for is keep-alive connections. IIRC closing the bound endpoint won't do anything to those, so we still need to stop listening for new http requests on connections which are currently active (maybe close them as soon as the response to the last one is written or smth).

When we issue stop, it will cause the accept loop to shut down. So no new connections would be shut down. In addition, the HTTP/1 connection request/response loop will terminate after the response is finished sending. HTTP/2 is a little more tricky, but conceptually similar. In fact, in most cases, you don't need to worry about closing the bound endpoint, as you actually want the new process to start accepting the inbound connections, and thus don't want to close the bound endpoint (but of course the accept loop in the old processes is terminated).

@yard
Copy link

yard commented Apr 4, 2024

More generally, there is nothing stopping user code doing this:

You are right, but my point is that the user code should not be stopped during graceful termination. Sure thing, when the timeout expires and the code still runs, we are happy to kill everything, but while there is time left, the user's code should "just work" – including awaiting the responses from APIs, database and whatnot. Otherwise, it's not exactly a graceful termination :)

In addition, the HTTP/1 connection request/response loop will terminate after the response is finished sending.

What would happen if the request is still being read during the graceful termination initiation? Will it be read and full and then executed or the read will bail out with an error?

@ioquatix
Copy link
Member

ioquatix commented Apr 5, 2024

You are right, but my point is that the user code should not be stopped during graceful termination. Sure thing, when the timeout expires and the code still runs, we are happy to kill everything, but while there is time left, the user's code should "just work" – including awaiting the responses from APIs, database and whatnot. Otherwise, it's not exactly a graceful termination :)

Yes, agreed.

What would happen if the request is still being read during the graceful termination initiation? Will it be read and full and then executed or the read will bail out with an error?

This is much more tricky.

I suppose for HTTP/1, we need to respond at least once with connection: close, so that the client does not try to send any more requests, or in other words, the client is informed clearly that the connection is shutting down.

For HTTP/2, there might be a similar mechanic we can use.

@zarqman
Copy link

zarqman commented Apr 5, 2024

In addition, the HTTP/1 connection request/response loop will terminate after the response is finished sending.

This needs to happen, but the defer_stop model cannot do this alone. The HTTP1 server will need a stopping = true flag that will cause it to not loop in Async::HTTP::Protocol::HTTP1::Server#each after the end of the present request. We can also use this flag at the end of the request to override persistent? to false and send connection: close.

It may be possible to reuse/override @persistent = false (instead of a new stopping flag) on each open connection. There is a slight risk of a race here though in the early stages of the request (after read_request_line and before @persistent=... finishes in read_request.)

What would happen if the request is still being read during the graceful termination initiation? Will it be read and full and then executed or the read will bail out with an error?

With a flag, as long as the request has started (the first line has been read), it's no problem to let it finish.

I suppose for HTTP/1, we need to respond at least once with connection: close, so that the client does not try to send any more requests, or in other words, the client is informed clearly that the connection is shutting down.

If there's an active request, we should definitely respond with connection: close. However, if we're already waiting for a 2nd+ (ie: @count >= 1) request (specifically, waiting in h1's read_request_line), we should just close the connection immediately. But, if it's the first request (@count == 0), which just snuck in before the graceful stop, we should still process it.

In my experience, h1.1 clients are generally aware that the 2nd+ request might fail and often have quick retry mechanisms. For the first request, that's not a given though (doubly so for h1.0 clients).

HTTP/2 is a little more tricky, but conceptually similar.

Shouldn't we trigger h2's own graceful stop mechanism (send_goaway(0)) for each existing connection too? Otherwise existing h2 connections will be allowed to continue initiating new streams (aka keep-alive) and thus graceful shutdown will either never complete or eventually have to be forcefully shutdown.

This is particularly important when async-http or falcon is fronted by an LB that multiplexes inbound requests into a smaller number of outbound h2 requests to the backend and reuses those h2 connections extensively. Actually, same is true for h1 LBs that also perform connection reuse.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants