New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More ergonomic support for graceful shutdown #147

Open
merrellb opened this Issue May 2, 2017 · 17 comments

Comments

Projects
None yet
3 participants
@merrellb

merrellb commented May 2, 2017

[Edit: This issue started as a discussion of the API for cancel scope shielding, motivated by @merrellb using it to catch Cancelled errors and do clean up activities like sending "goodbye" messages in #143, but then morphed into a more general discussion of how trio could help users implement "graceful shutdown" functionality in general, so let's make it the standard issue for that. Original text follows. – @njsmith]


The documentation suggests assigning a cancel scope and setting shield as two separate steps:

with move_on_after(CLEANUP_TIMEOUT) as cleanup_scope:
    cleanup_scope.shield = True
    await conn.send_goodbye_msg()

However trio.open_cancel_scope has a shield argument and it seems like it would also make sense with the convenience functions (trio.move_on_after, trio.move_on_at, trio.fail_after and trio.fail_at). eg:

with move_on_after(CLEANUP_TIMEOUT, shield=True):
    await conn.send_goodbye_msg()

I would be happy to make the pull-request if this change makes sense.

@njsmith

This comment has been minimized.

Member

njsmith commented May 7, 2017

For the record, the reason I didn't do this in the first place is that I feel like shield should be a pretty obscure and rarely used feature. It's​ very easy to use in a way that breaks your program in subtle ways (e.g. by making it unresponsive to rare network failure conditions), it breaks the rule that timeout policy should be set as far up the stack as possible, and I don't want it to become an attractive nuisance. So I included it in open_cancel_scope since that's a kind of low level "raw" API, but left it out of the high level timeout apis.

Maybe it will turn out the I'm wrong and shield really is something that gets used a lot, but I can't help but feel that that would indicate that we've messed up somehow.

In the mean time, I'm sort of feeling like maybe we should remove the kwarg from open_cancel_scope :-). It should be a bit annoying to use...

While we're at it, here's another idea: maybe we can have a way to say "yes, I know I'm cancelled, but please stop raising it unless an operation actually blocks". Like a shield with a zero second per operation timeout. So you could attempt to do things like send a goodbye message, but if it doesn't succeed immediately, oh well, let's move on. It would at least remove the issue of having to pick some magic timeout value way down inside the code...

... Though, I guess this doesn't actually help for your use case with websockets, since the main reason websockets have a goodbye message is so the initiator can then wait until they see the corresponding goodbye message from their peer, and know that both sides successfully processed all the data that was sent. (Since otherwise TCP can lose some data at the end of a connection.) Sending a goodbye message and then immediately closing the socket is pretty useless, but that's all my idea above would allow for.

Otoh I'm not entirely sure I buy that it makes sense to send websocket goodbye messages on control-C in the first place :-). "Potential data loss" sounds bad, but here it just means that the connection might act like you hit control-C a little earlier than you actually did... and who's to say that's not what happened? (Unless you're carefully monitoring the connection state and choosing when to hit control-C with millisecond precision, which seems like a weird hobby.) But maybe the problem is more general – certainly there are a lot of protocols and applications that have some concept of a graceful shutdown.

@merrellb

This comment has been minimized.

merrellb commented May 8, 2017

At the application level, I think a graceful shutdown might be fairly common. For example, I may want to make a call to a local database (ie something with time constraints I understand fairly well), storing the last record acknowledged by a remote WebSocket client. With nursery cancellation as the best way to manage a group of tasks, I think this may come up in many circumstances other than control-C (eg explicitly deciding to close a single connection)

I am not sure I understand your concerns about adding shield to move_on_after. Isn't that exactly what we should be wrapping our graceful connection shutdown code within? I would almost want it to be automatically enabled when exiting a canceled scope :-) . With a server spawning a nursery per incoming connection, how could we handle this at a higher level?

@njsmith

This comment has been minimized.

Member

njsmith commented May 9, 2017

I think there are two things making me uneasy here:

  • I don't feel like I have a very clear picture of what the goal here should be. Cancelling work is inherently a messy process, networks are unreliable, and in a distributed systems you rarely have complete information. Like in your example, you say you want to record the last record acknowledged by the remote WebSocket client, but for all you know there might be another acknowledgement that they already sent, just it's still sitting in a router somewhere and you haven't seen it yet. (This is the Byzantine generals problem.) I totally understand the desire to make this kind of thing as reliable as possible! I want that too. But in my experience, the first and most important step in making a distributed system reliable is to have a clear picture of what we {have to / are willing to} give up. Complexity is the number 1 enemy of reliability, so papering over individual problems as they arise just doesn't work. And even in your example of making a last call to a local database, I don't know how to judge whether the complexity cost is worth it. If the control-C had arrived 0.1 seconds earlier, then you wouldn't even know that there was a last record to write, you'd just have exited, and apparently that would have been OK – so is it really worth bending over backwards to handle it? Maybe it is, I don't know.

  • As a matter of software design, I don't like the idea of low-level code imposing policy around timeouts and shutdowns. What if later on you decide that actually, on control-C exiting in 2 seconds is not good enough, you really need to exit in 1 second (or vice-versa). Do you have to go around and modify all the different places in your code that clean up resources?

    What if you're using several libraries, that each have their own internal policy on how long they'll hold out when you try to cancel them? What if you really want to shut down ASAP? Or on the other hand, what if one library wants to wait for 5 seconds and another wants to wait for 10 seconds, so the whole system ends up waiting for 10 seconds and that's fine, but the first library gave up after 5 seconds, and loses data that could have been saved if it had known that it actually had 10 seconds to work with?

These two concerns are closely related, because if we're hard-coding timeouts into little pieces of the code that can't see the big picture, then we're very likely to make the wrong trade-offs; OTOH if timeouts are being configured as late as possible and at as high a level as possible, then it's much more plausible that we can make the right trade-offs.

I almost feel like what we want is for a top-down cancellation to be able to specify a "grace period" – so the code that calls cancel() can also specify how long it's willing to wait for the cancellation to complete. The problem then would be in communicating this to the code being cancelled. The simple version would be:

  • by default, code is immediately cancelled
  • but code that wants to allow for graceful shutdown can have something like a "soft shield" that shields against cancellation during the grace period, but not after the grace period expires
  • and there's some mechanism that code running under the "soft shield" can check whether a cancellation has occurred, so that it knows that it should shut down the connection or whatever, e.g. doing if currently_cancelled(): ... or something.

That's enough for cases where the connection is active, so e.g. we can imagine a HTTP server that every time it goes to send a response, it checks for currently_cancelled() and if so then it marks the response as being the last one of this connection and immediately closes it afterwards. But what it's missing is some way for the code that wants to support graceful cleanup to get notified when a cancellation happens – like if an HTTP server or your websocket server is sitting idle waiting for something to arrive, then when it's cancelled it should immediately wake up and shut down the connection without having to poll currently_cancelled().

One obvious way to do this would be to send in an exception, and then code that wants to handle it gracefully can catch it and do whatever clever thing it wants. But... in general, you can't necessarily catch a Cancelled exception and do something clever with it. (This is an important point that I missed before, but is relevant to what you're trying to do in general!). The mere fact of delivering a Cancelled exception can itself break your connection irrecoverably. Like, consider if you were in the middle of doing a sendall to transmit a websocket message frame, and that got interrupted – you can't necessarily catch that exception and send a goodbye frame, because the previous frame might have been only half-transmitted and now you've lost synchronization. A Cancelled exception is better behaved than a generic KeyboardInterrupt, but it's still a bomb going off in the middle of your program state.

Maybe it's enough to make the soft-shield state toggleable, so you could do something like:

async def handler(ws_conn):
    with soft_shield_enabled:
        while True:
            try:
                with soft_shield_disabled:
                    msg = await ws_conn.next_message()
            except Cancelled:
                await do_cleanup(ws_conn)
                return
            await ws_conn.send_message(make_response(msg))

...Need to think about this some more. I feel like there's something here, but it isn't fully gelled yet :-).

@njsmith

This comment has been minimized.

Member

njsmith commented May 9, 2017

Note: if we do promote graceful cancellation to a first-class concept, then I guess we'll want to rethink restrict_keyboard_interrupt_to_checkpoints. Maybe something like an argument that (a) makes it so KI cancels everything (i.e. actually triggers a cancellation inside init), and (b) lets you specify the grace period on this cancellation?

@merrellb

This comment has been minimized.

merrellb commented May 9, 2017

  • I am thinking along the lines of long running WebSockets where I am not too worried about losing a few messages but rather several hours of accumulated state. A largely "in memory" scenario with lazy persistence for performance reasons. Being able to save a snapshot on disconnect is definitely worth some complexity vs potentially replaying hours of logs/journals.
  • As you can imagine I do like the idea of graceful cancellation being a first-class concept :-). I understand how it can be messy but I don't see why it needs to be messy for common cases, especially when one controls both ends of the connection.
  • I like your idea of a graceful shutdown period (from above) and soft-shield.
  • Do we also need to worry about partial receives when canceling? I could see potentially getting out of sync if we are expecting an ack as part of our shutdown. Is there some way to cancel only if the receive is idle?
@njsmith

This comment has been minimized.

Member

njsmith commented May 10, 2017

I am thinking along the lines of long running WebSockets where I am not too worried about losing a few messages but rather several hours of accumulated state. A largely "in memory" scenario with lazy persistence for performance reasons. Being able to save a snapshot on disconnect is definitely worth some complexity vs potentially replaying hours of logs/journals.

I'm dubious about the wisdom of getting into the state where you have hours of valuable data accumulated in memory in the first place – maybe you should flush snapshots regularly on a timer? But that said, assuming there's a good reason to do things this way, I think you'd really want to implement this as a task that listens for control-C and then flushes a snapshot. There's nothing wrong with doing that if you have needs that are a little more unusual/extreme, and I think "I have hours of valuable data that I need to make sure get flushed to a database over an async connection while shutting down" counts.

Do we also need to worry about partial receives when canceling? I could see potentially getting out of sync if we are expecting an ack as part of our shutdown. Is there some way to cancel only if the receive is idle?

Yeah, you'd have to be careful that the code you put inside the soft_shield_disabled block (like ws_conn.next_message in the example above) is written in a cancellation-safe manner. So for example if it looks like:

    async def next_message(self):
        while True:
            message = self._proto_state.get_next_message()
            if message is not None:
                return message
            else:
                # Receive some data and then try again
                data = await self._transport.recv(BUFSIZE)
                self._proto_state.receive_bytes(data)

then we're pretty much OK – the only await is in the call to self._transport.recv, and if it raises Cancelled then we'll exit early and leave behind any data that we already received in _proto_state's internal buffer to finish reading next time. So as long recv itself is cancellation safe then next_message will be too. And SocketType.recv is cancellation-safe.

However... this gets tricky for some protocols.

If self._transport is an SSL-wrapped socket instead of a regular socket, then it's possible that calling SSLStream.recv will need to send data on the underlying socket (because of a horrible thing called renegotiation). The version of SSLStream in my branch (#107) definitely does not guarantee that the connection is still usable after recv is cancelled.

A similar issue arises at the protocol level. For websockets, the next_message method looks to the user like it's just reading, but if it ends up reading a ping frame then it might want to send back a pong. And then if this gets cancelled half-way through then again we've lost sync.

Solving these problems seems really hard :-(. The way these protocols work, once you start trying to send something then you're really committed to doing so – like you just can't stop in the middle of a websocket Pong frame and switch to a Close frame. (For OpenSSL you're even committing to retrying the same OpenSSL API call repeatedly; it's not enough to make sure that the bytes it gave you eventually get sent. But let's ignore that for right now.)

So what would be annoying but possible is to switch the sending to always use a send-style API call (which can do partial sends and reports how much data was sent) instead of a sendall-call, and only remove bytes from the internal protocol object's outgoing buffer when send accepts them. (I wasn't planning to provide send in the high-level stream interface at all but let's assume we have it.) This would have the effect that if one operation gets interrupted during a send, then the next operation on the same object would pick it up and finish it before doing whatever else it needs to do. So e.g. if we get cancelled in the middle of sending a Ping, then when we go to send the Close it'll send the second half of the Ping + the Close.

Possibly this would become simpler if we change the semantics of sendall (in the high level stream interface, not for the raw socket interface) so that on cancellation it holds onto the data and automatically prepends it to the next call to sendall? I'm wary because this moves us back towards the infinite buffers of twisted-like systems and somewhat violates the no-implicit-concurrency rule, but if it's restricted to cancellations, where currently the only thing you can do is just close the connection, then maybe it's good. Basically this would be implementing the buffering logic inside the stream object instead of the protocol object. ...though, maybe it wouldn't work with SSL, because we could get in a state where we need to call sendall(b"") to flush out the internal send buffer before anything will work, but we don't know it and block waiting to receive instead.


Another possibility that we add more shielding states. Specifically, a task could be in the four states listed down the left column here, and the table shows what checkpoints do in each case:

Soft-cancelled Hard-cancelled
Default raise Cancelled raise Cancelled
Soft-shield, exceptions OK raise CancelledIsComing raise Cancelled
Soft-shield, exceptions not OK nothing raise Cancelled
Hard shield nothing nothing

But we don't set these states directly; instead, there's are two context managers: one that marks its contents as "soft-cancellation enabled", and one that marks its contents as "state may become corrupted if a cancellation happens". Then the states are:

  • if there are any shield=True scopes on the stack below the cancelled scope, then we're in state "Hard shield"

  • else, if there is at least one "soft-cancellation enabled" manager in effect, and no "state may become corrupted if a cancellation happens" managers in effect, then we're in state "Soft-shield, exceptions OK"

  • else, if there is at least one "soft-cancellation enabled" manager in effect, and at least one "state may become corrupted if a cancellation happens" managers in effect, then we're in state "Soft-shield, exceptions not OK"

  • else, we're in state "Default"

So... this is really complicated. But the idea is that in the end, a method like SSLStream.recv puts the "state may become corrupted" manager around its calls to sendall, and by default this doesn't do anything, but now it can guarantee that if some code that supports soft-cancellation calls it, and a soft-cancellation happens, then SSLStream.recv will leave the stream in a consistent state.

Hmm.

@njsmith njsmith changed the title from Add shield argument to move_on_after and other convenience functions to More ergonomic support for graceful shutdown May 10, 2017

@merrellb

This comment has been minimized.

merrellb commented May 10, 2017

Can we have a "task that listens for control-C"? I thought this was usually caught by the outer nursery itself after all tasks had canceled. While control-C is an important special case, I would think a problem with a single connection and the cancellation of its associated nursery would be the more routine scenario. I would want a graceful shutdown of these intermediate nurseries, giving a chance to persist a snapshot of the connection's associated application state.

Of the options you mention, the four shielding state approach sounds the most elegant although I realize this is beginning to also threaten the elegant simplicity you currently have :-) Still,l I do like this idea of providing extra protection where dragons lie while adding soft-cancellation to give time for a graceful shutdown (although that is easy for the guy not implementing it to say!)

@njsmith

This comment has been minimized.

Member

njsmith commented May 10, 2017

Can we have a "task that listens for control-C"? I thought this was usually caught by the outer nursery itself after all tasks had canceled.

By default, KeyboardInterrupt gets delivered to some totally arbitrary place. In practice, it will often end up being delivered to the main task, which is sitting in the nursery __aexit__, and gets handled like any exception that happens inside the nursery scope (i.e., triggers a cancellation of everything in that nursery, and then gets propagated). The reason this is common is that it's what trio does if the control-C happens at a moment when the program is idle (no tasks currently running). But there's no guarantee of this – by default, it could be delivered into the middle of some delicate connection management code or whatever.

This blog post goes into a lot more detail about the trade-offs of different ways of handling control-C. In particular, there's discussion of why the default KeyboardInterrupt behavior (in both regular Python and in trio) is pretty much incompatible with guaranteeing a correct graceful cleanup, and also an example of how to write task that listens for control-C. In the example it just does raise KeyboardInterrupt, but it could just as easily do an await program_state.take_snapshot() first and then cancel everything, or any other arbitrary thing you want.

the four shielding state approach sounds the most elegant although I realize this is beginning to also threaten the elegant simplicity you currently have :-)

Yeah, I'm not sure... some kind of graceful shutdown is a pretty common/fundamental desire, so maybe it's worth it? But this is an area where trio's already ahead of the competition, and it's extremely tricky, and novel, so I'll probably let it stew for a while before making any decisions.

One thing that did occur to me is that I've already been thinking that in code like SSLStream that's "fragile" against cancellation (i.e., it can cause the object's internal state to become corrupted and unusable), it might be a good idea to have some guard that force-closes the stream on cancellation, like:

async def recv(self, max_bytes):
    try:
        # ... complicated stuff here ...
    except Cancelled:
        # our state is broken, but at least we can make sure that any attempt to use it fails fast
        self.forceful_close()
        raise

If I wrapped that up into a context manager like with cleanup_on_cancelled(self.forceful_close): ..., then that context manager could also set the flag that prevents soft cancellations from being delivered inside it if soft-cancellation handling is enabled – since these are two strategies for handling this kind of fragile state, they're needed in the same places.

Hmm.

@njsmith

This comment has been minimized.

Member

njsmith commented May 10, 2017

I wonder if we should care that in principle someone might want to do some sort of graceful shutdown but doesn't care about the integrity of the stream they're reading from right now, it's some other object that needs cleanup.

@njsmith

This comment has been minimized.

Member

njsmith commented Jul 10, 2017

This seems like a relevant paper for the whole "cancellation fragility" issue: Kill-Safe Synchronization Abstractions

@njsmith

This comment has been minimized.

Member

njsmith commented Aug 10, 2017

Was thinking about this again in the shower today. It seems like could maybe simplify the whole "soft shield" idea by making the default state be soft-shielded. That is, the semantics would be something like:

cancel_scope.graceful_cancel(timeout), cancel_scope.graceful_cancel_at(time): set the deadline to min(current_deadline, implied_deadline), and also set the gracefully_cancelled flag

Code that's cancellation-oblivious gets the full time to complete. This seems pretty reasonable -- it's a bit annoying for code that's cancellation oblivious and runs forever, because it will use up the full graceful period before getting killed, but it's definitely a better default for any code that normally runs for a finite amount of time, and if infinite loops want good graceful cancellation then they generally need to do something special anyway.

If you don't want to set a timeout, you can use inf, but probably almost everyone wants to set some timeout so might as well make it an opt-out thing in the API.

Then code that wants to do something clever on graceful cancellation can explicitly register some interest -- I guess by flipping a flag to say "please consider graceful cancellations to be cancellations". (That's sufficient to implement wait_for_graceful_cancellation().) We might also want a currently_in_grace_period() function to poll that state – not sure.

Certainly that's enough for an HTTP server that wants to stop accepting new keepalive requests when the graceful shutdown is triggered. For the case of the websocket server that wants to send a goodbye over TLS... well, one option is to say that grace periods just don't mix well with renegotiation, sorry. In general renegotiation and similar is even less well-supported than I realized when I wrote the above. Or we could make SSLStream.receive_some guaranteed to survive cancellation; that's #198. Or we could have the ability to toggle the "please consider graceful cancellations to be cancellations" back to disabled, when calling transport_stream.send_all() from inside ssl_stream.recv_some.

@njsmith

This comment has been minimized.

Member

njsmith commented Aug 10, 2017

The terminology here could use some work... graceful is on the tip of my tongue because I just stripped it out of AsyncResource, but we still have trio.aclose_forcefully, which is pretty unrelated. Grace period? cancel_with_grace_period(timeout)?

@njsmith

This comment has been minimized.

Member

njsmith commented Aug 21, 2017

Another thing I just realized: for the minimal version of this idea where setting a grace period acts as (a) setting an overall deadline, and (b) setting a flag indicating that everyone should hurry up, and (c) letting specific pieces of code opt-in to receiving a cancellation when this flag is set, then it's actually totally possible to prototype as a third-party library. Having it built in would potentially have extra advantages, specifically in terms of letting the built-in server code integrate and letting SSLStream do clever things with renegotiation, and also being a standard thing that the whole ecosystem can coordinate around, but I don't think those are necessary for a prototype to be useful.

@njsmith

This comment has been minimized.

Member

njsmith commented Dec 23, 2017

There's some interesting discussion of graceful shutdown in Martin Sustrik's Structured Concurrency post, from a rather different perspective.

@cjerdonek

This comment has been minimized.

cjerdonek commented Jan 15, 2018

A few thoughts that could be useful (after seeing the thread here: https://mail.python.org/pipermail/async-sig/2018-January/000437.html):

First, I'd make separate the distinction of what "knobs" a library API might want to expose to its users. These could be things like "finish in this total amount of time no matter what (including clean-up)," "allow x time for clean-up step y (no matter what)," or "aim to finish in this amount of time (excluding clean-up)." Then it would be up to the library author to use that information as it sees fit. For example, if a user specified 30 seconds absolute total time and 10 seconds for clean-up, then the library would compute and allot 20 seconds for the "success" case. Or the library could even tell the user their knob values are incompatible. Lastly, it would be up to the framework (trio in this case) to expose to the library author the primitives to support using that information. One consequence of this way of thinking is that it seems like it would be best if the framework didn't even have to know about concepts like "shutdown" and "graceful." Instead, it could expose lower level operations like increasing or decreasing the allotted time, or execute this block of code within this computed amount of time, and it would be up to the library author to map those operations to support graceful shutdown, etc.

@njsmith

This comment has been minimized.

Member

njsmith commented Jan 15, 2018

I think once you get into defining a set of knobs whose semantics are specific to the particular function being called, then those are what we call function arguments :-).

In general, there's a big trade-off between semantic richness and universality -- if Trio adds something to the core cancellation semantics, then everyone using trio has to deal with it to some extent, whether it makes sense in their case or not. This can be good – the reason trio has core cancellation semantics in the first place is that libraries that don't support cancellation/timeouts are essentially impossible to use correctly, and getting it right requires some global coordination across different libraries, so it's actually good that we force everyone to think about it and provide a common vocabulary for doing so.

The need for universality definitely constrains what we can do though. We can't in general phrase things as "make sure to be finished by time X", because this requires being able to estimate ahead of time how long cleanup operations will take, and there's no way to do that for arbitrary code. And just in general I don't want to provide tons of bells and whistles that nobody uses, or are only useful in corner cases.

The semantic difference between regular execution and a "soft/graceful cancel" is that you're saying you want things to finish up, but allowing them to take their natural time. This is particularly relevant for operations that run forever, like a server accept loop – for them a graceful cancel is basically the same as a cancel. But for an operation with (hopefully) bounded time, like running an HTTP request handler, it can keep going. So it's... at least kinda general? And it definitely has some important use cases (HTTP servers!). But I'm not sure yet whether it's a good generic abstraction or not.

@njsmith

This comment has been minimized.

Member

njsmith commented Aug 21, 2018

Another thing I just realized: for the minimal version of this idea where setting a grace period acts as (a) setting an overall deadline, and (b) setting a flag indicating that everyone should hurry up, and (c) letting specific pieces of code opt-in to receiving a cancellation when this flag is set, then it's actually totally possible to prototype as a third-party library

Prototype here in case anyone wants to experiment with this: https://gist.github.com/njsmith/1c83788289aaed49e091c8281d85a85e

(Please do, that would be very helpful for figuring otu if it's something we really want to do :-))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment