New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ability to cancel long-standing operations #3826
Comments
More or less. This is about individual, internal operations, whereas that issue is about cancelling a higher-level operation, composed of many streaming requests. Right now we can't cancel those individual streaming requests properly (I think), but using the using the mechanisms proposed they would be. |
View updates are sent with a timeout of 5 minutes, unrelated to any user-defined value and meant as a protection mechanism. During normal operation we don’t benefit from timing out view writes and offloading them to the hinted-handoff queue, since they are an internal, non-real time workload that we already spent resources on. Refs scylladb#2538 Refs scylladb#3826 Signed-off-by: Duarte Nunes <duarte@scylladb.com>
View updates are sent with a timeout of 5 minutes, unrelated to any user-defined value and meant as a protection mechanism. During normal operation we don’t benefit from timing out view writes and offloading them to the hinted-handoff queue, since they are an internal, non-real time workload that we already spent resources on. Refs scylladb#2538 Refs scylladb#3826 Signed-off-by: Duarte Nunes <duarte@scylladb.com>
View updates are sent with a timeout of 5 minutes, unrelated to any user-defined value and meant as a protection mechanism. During normal operation we don’t benefit from timing out view writes and offloading them to the hinted-handoff queue, since they are an internal, non-real time workload that we already spent resources on. Refs scylladb#2538 Refs scylladb#3826 Signed-off-by: Duarte Nunes <duarte@scylladb.com>
View updates are sent with a timeout of 5 minutes, unrelated to any user-defined value and meant as a protection mechanism. During normal operation we don’t benefit from timing out view writes and offloading them to the hinted-handoff queue, since they are an internal, non-real time workload that we already spent resources on. Refs scylladb#2538 Refs scylladb#3826 Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Another example is a view replica being restarted. Gossip can detect this through the change in generation. We should also be able to detect this, and cancel the relevant operations (i.e., pending view updates). |
This issue is particularly important for view updates, which have a high timeout. Fixing it helps with memory consumption and decreases shutdown time in some cases. We should give it higher priority. |
View updates are sent with a timeout of 5 minutes, unrelated to any user-defined value and meant as a protection mechanism. During normal operation we don’t benefit from timing out view writes and offloading them to the hinted-handoff queue, since they are an internal, non-real time workload that we already spent resources on. This value should be increases further, but that change depends on Refs scylladb#2538 Refs scylladb#3826 Signed-off-by: Duarte Nunes <duarte@scylladb.com>
View updates are sent with a timeout of 5 minutes, unrelated to any user-defined value and meant as a protection mechanism. During normal operation we don’t benefit from timing out view writes and offloading them to the hinted-handoff queue, since they are an internal, non-real time workload that we already spent resources on. This value should be increases further, but that change depends on Refs scylladb#2538 Refs scylladb#3826 Signed-off-by: Duarte Nunes <duarte@scylladb.com>
View updates are sent with a timeout of 5 minutes, unrelated to any user-defined value and meant as a protection mechanism. During normal operation we don’t benefit from timing out view writes and offloading them to the hinted-handoff queue, since they are an internal, non-real time workload that we already spent resources on. This value should be increases further, but that change depends on Refs scylladb#2538 Refs scylladb#3826 Signed-off-by: Duarte Nunes <duarte@scylladb.com>
View updates are sent with a timeout of 5 minutes, unrelated to any user-defined value and meant as a protection mechanism. During normal operation we don’t benefit from timing out view writes and offloading them to the hinted-handoff queue, since they are an internal, non-real time workload that we already spent resources on. This value should be increases further, but that change depends on Refs scylladb#2538 Refs scylladb#3826 Signed-off-by: Duarte Nunes <duarte@scylladb.com>
View updates are sent with a timeout of 5 minutes, unrelated to any user-defined value and meant as a protection mechanism. During normal operation we don’t benefit from timing out view writes and offloading them to the hinted-handoff queue, since they are an internal, non-real time workload that we already spent resources on. This value should be increases further, but that change depends on Refs scylladb#2538 Refs scylladb#3826 Signed-off-by: Duarte Nunes <duarte@scylladb.com>
View updates are sent with a timeout of 5 minutes, unrelated to any user-defined value and meant as a protection mechanism. During normal operation we don’t benefit from timing out view writes and offloading them to the hinted-handoff queue, since they are an internal, non-real time workload that we already spent resources on. This value should be increases further, but that change depends on Refs scylladb#2538 Refs scylladb#3826 Signed-off-by: Duarte Nunes <duarte@scylladb.com>
View updates are sent with a timeout of 5 minutes, unrelated to any user-defined value and meant as a protection mechanism. During normal operation we don’t benefit from timing out view writes and offloading them to the hinted-handoff queue, since they are an internal, non-real time workload that we already spent resources on. This value should be increases further, but that change depends on Refs scylladb#2538 Refs scylladb#3826 Signed-off-by: Duarte Nunes <duarte@scylladb.com>
The three dtests mentioned in #4019 are a good example of why this feature is needed, and can be used to test it. We want the view-write timeout to be as high as possible, so as to allow long view-update backlogs, but because RPC currently may wait this entire duration even if the remote node goes down, a shutdown can hang for this timeout while it is waiting for all pending view writes, or an ongoing view-build step, to complete. Since the tests mentioned in #4019 only wait 2 minutes for the shutdown, they can't currently work with a timeout in This feature (stopping instead of waiting for the full timeout) should be optional, in case somebody really needs the current behavior of continuing to wait even if the remote node goes down. But before we do that, we should verify that this waiting even make sense - is the RPC retried after the dead node goes back up? If not, continuing to wait is pointless. |
@duarten, I'm not sure what concrete proposal #3966 is making. According to @gleb-cloudius (if I understood what he said correctly...) the write that a view build is hung on in #4019 (a write which started when we didn't know the node went down) will never stop until the timeout. So we need a way to tell a write that has already started that it needs to give up the network send immediately. This is this issue, no? Then, indeed, once such a write it stops, it should indeed write to a hint as #3966 suggests, but I assumed this will happen automatically anyway - because when the network send fails, the CL=ANY should try to write a hint. |
#3966 is about completing a shutdown in a timely fashion, regardless of nodes being down or just being slow. We can detect that there is a shutdown in progress without enlisting the gossiper. All we need to do in the For this, it doesn't matter whether a node is down or just being slow: we can start a shutdown timer (of, say, 1 minute), after which we'll write pending writes as hints. |
In order to be able to act when node joins/leaves, storage proxy is registered as an endpoint lifecycle subscriber. Fixes scylladb#3826 Fixes scylladb#3966 Fixes scylladb#4028
In order to be able to act when node joins/leaves, storage proxy is registered as an endpoint lifecycle subscriber. Fixes scylladb#3826 Fixes scylladb#4028
In order to be able to act when node joins/leaves, storage proxy is registered as an endpoint lifecycle subscriber. Fixes scylladb#3826 Fixes scylladb#4028
In order to be able to act when node joins/leaves, storage proxy is registered as an endpoint lifecycle subscriber. Fixes scylladb#3826 Fixes scylladb#4028
In order to be able to act when node joins/leaves, storage proxy is registered as an endpoint lifecycle subscriber. Fixes scylladb#3826 Fixes scylladb#4028
In order to be able to act when node joins/leaves, storage proxy is registered as an endpoint lifecycle subscriber. Fixes scylladb#3826 Fixes scylladb#4028
In order to be able to act when node joins/leaves, storage proxy is registered as an endpoint lifecycle subscriber. Fixes scylladb#3826 Fixes scylladb#4028
In order to be able to act when node joins/leaves, storage proxy is registered as an endpoint lifecycle subscriber. Fixes scylladb#3826 Fixes scylladb#4028
In order to be able to act when node joins/leaves, storage proxy is registered as an endpoint lifecycle subscriber. Fixes scylladb#3826 Fixes scylladb#4028
In order to be able to act when node joins/leaves, storage proxy is registered as an endpoint lifecycle subscriber. Fixes scylladb#3826 Fixes scylladb#4028
In order to be able to act when node joins/leaves, storage proxy is registered as an endpoint lifecycle subscriber. Fixes scylladb#3826 Fixes scylladb#4028
In order to be able to act when node joins/leaves, storage proxy is registered as an endpoint lifecycle subscriber. Fixes scylladb#3826 Fixes scylladb#4028
In order to be able to act when node joins/leaves, storage proxy is registered as an endpoint lifecycle subscriber. Fixes scylladb#3826 Fixes scylladb#4028
In order to be able to act when node joins/leaves, storage proxy is registered as an endpoint lifecycle subscriber. Fixes scylladb#3826 Fixes scylladb#4028
In order to be able to act when node joins/leaves, storage proxy is registered as an endpoint lifecycle subscriber. Fixes scylladb#3826 Fixes scylladb#4028
In order to be able to act when node joins/leaves, storage proxy is registered as an endpoint lifecycle subscriber. Fixes scylladb#3826 Fixes scylladb#4028
In order to be able to act when node joins/leaves, storage proxy is registered as an endpoint lifecycle subscriber. Refs scylladb#3826 Refs scylladb#4028
In order to be able to act when node joins/leaves, storage proxy is registered as an endpoint lifecycle subscriber. Fixes scylladb#3826 Fixes scylladb#4028
In order to be able to act when node joins/leaves, storage proxy is registered as an endpoint lifecycle subscriber. Fixes scylladb#3826 Fixes scylladb#4028
" This series allows canceling view update requests when a node is discovered DOWN. View updates are sent in the background with long timeout (5 minutes), and in case we discover that the node is unavailable, there's no point in waiting that long for the request to finish. What's more, waiting for these requests occurs on shutdown, which may result in waiting 5 minutes until Scylla properly shuts down, which is bad for both users and dtests. This series implements storage_proxy as a lifecycle subscriber, so it can react to membership changes. It also keeps track of all "interruptible" writes per endpoint, so once a node is detected as DOWN, an artificial timeout can be triggered for all aforementioned write requests. Fixes #3826 Fixes #3966 Fixes #4028 " * 'write_hints_for_view_updates_on_shutdown_3' of https://github.com/psarna/scylla: service: remove unused stop_hints_manager storage_proxy: add drain_on_shutdown implementation main: register storage proxy as lifecycle subscriber storage_proxy: add endpoint_lifecycle_subscriber interface storage_proxy: register view update handlers for view write type storage_proxy: add intrusive list of view write handlers storage_proxy: add view_update_write_response_handler
" This series allows canceling view update requests when a node is discovered DOWN. View updates are sent in the background with long timeout (5 minutes), and in case we discover that the node is unavailable, there's no point in waiting that long for the request to finish. What's more, waiting for these requests occurs on shutdown, which may result in waiting 5 minutes until Scylla properly shuts down, which is bad for both users and dtests. This series implements storage_proxy as a lifecycle subscriber, so it can react to membership changes. It also keeps track of all "interruptible" writes per endpoint, so once a node is detected as DOWN, an artificial timeout can be triggered for all aforementioned write requests. Fixes #3826 Fixes #3966 Fixes #4028 " * 'write_hints_for_view_updates_on_shutdown_4' of https://github.com/psarna/scylla: service: remove unused stop_hints_manager storage_proxy: add drain_on_shutdown implementation main: register storage proxy as lifecycle subscriber storage_proxy: add endpoint_lifecycle_subscriber interface storage_proxy: register view update handlers for view write type storage_proxy: add intrusive list of view write handlers storage_proxy: add view_update_write_response_handler
" This series allows canceling view update requests when a node is discovered DOWN. View updates are sent in the background with long timeout (5 minutes), and in case we discover that the node is unavailable, there's no point in waiting that long for the request to finish. What's more, waiting for these requests occurs on shutdown, which may result in waiting 5 minutes until Scylla properly shuts down, which is bad for both users and dtests. This series implements storage_proxy as a lifecycle subscriber, so it can react to membership changes. It also keeps track of all "interruptible" writes per endpoint, so once a node is detected as DOWN, an artificial timeout can be triggered for all aforementioned write requests. Fixes #3826 Fixes #3966 Fixes #4028 " * 'write_hints_for_view_updates_on_shutdown_4' of https://github.com/psarna/scylla: service: remove unused stop_hints_manager storage_proxy: add drain_on_shutdown implementation main: register storage proxy as lifecycle subscriber storage_proxy: add endpoint_lifecycle_subscriber interface storage_proxy: register view update handlers for view write type storage_proxy: add intrusive list of view write handlers storage_proxy: add view_update_write_response_handler
" This series allows canceling view update requests when a node is discovered DOWN. View updates are sent in the background with long timeout (5 minutes), and in case we discover that the node is unavailable, there's no point in waiting that long for the request to finish. What's more, waiting for these requests occurs on shutdown, which may result in waiting 5 minutes until Scylla properly shuts down, which is bad for both users and dtests. This series implements storage_proxy as a lifecycle subscriber, so it can react to membership changes. It also keeps track of all "interruptible" writes per endpoint, so once a node is detected as DOWN, an artificial timeout can be triggered for all aforementioned write requests. Fixes #3826 Fixes #3966 Fixes #4028 " * 'write_hints_for_view_updates_on_shutdown_4' of https://github.com/psarna/scylla: service: remove unused stop_hints_manager storage_proxy: add drain_on_shutdown implementation main: register storage proxy as lifecycle subscriber storage_proxy: add endpoint_lifecycle_subscriber interface storage_proxy: register view update handlers for view write type storage_proxy: add intrusive list of view write handlers storage_proxy: add view_update_write_response_handler (cherry picked from commit 2718c90)
… Piotr" This series introduces regression in dtests materialized_views_test/TestMaterializedViews/interrupt_build_process_with_resharding_*. This reverts commit b2227c7. Ref #3826 Ref #3966 Ref #4028 Signed-off-by: Shlomi Livne <shlomi@scylladb.com> Message-Id: <a3aea137bfde956241acc6d57e1c387a8202486c.1554116404.git.shlomi@scylladb.com>
Some operations are executed with no timeout (or equivalently, a minutes-scale timeout). These are mostly internal operations for which we've already done work that doesn't benefit from being thrown away due to a timeout, such as sending view updates.
We should add support for cancelling such operations when they are no longer relevant, for example if the failure detector marks a view replica as unavailable while there are pending view updates.
We can leverage the
seastar::abort_source
to implement such cancellation. In the particular case of view updates, a possible solution would be to associate aseastar::abort_source
with thegms::init_address
of the view replica we're updating, and request cancellation from aservice::endpoint_lifecycle_subscriber
when view replica is marked as unavailable.The text was updated successfully, but these errors were encountered: