Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to cancel long-standing operations #3826

Closed
duarten opened this issue Oct 8, 2018 · 8 comments
Closed

Add ability to cancel long-standing operations #3826

duarten opened this issue Oct 8, 2018 · 8 comments
Labels
area/coordination Changes related to the coordination layer (i.e., storage_proxy). area/materialized views feature/enhancement symptom/performance Issues causing performance problems
Milestone

Comments

@duarten
Copy link
Contributor

duarten commented Oct 8, 2018

Some operations are executed with no timeout (or equivalently, a minutes-scale timeout). These are mostly internal operations for which we've already done work that doesn't benefit from being thrown away due to a timeout, such as sending view updates.

We should add support for cancelling such operations when they are no longer relevant, for example if the failure detector marks a view replica as unavailable while there are pending view updates.

We can leverage the seastar::abort_source to implement such cancellation. In the particular case of view updates, a possible solution would be to associate a seastar::abort_source with the gms::init_address of the view replica we're updating, and request cancellation from a service::endpoint_lifecycle_subscriber when view replica is marked as unavailable.

@duarten duarten added feature/enhancement symptom/performance Issues causing performance problems area/materialized views area/coordination Changes related to the coordination layer (i.e., storage_proxy). labels Oct 8, 2018
@tzach
Copy link
Contributor

tzach commented Oct 11, 2018

@duarten does #2159 counts as such operation?

@tzach tzach added this to the 3.x milestone Oct 11, 2018
@duarten
Copy link
Contributor Author

duarten commented Oct 11, 2018

More or less. This is about individual, internal operations, whereas that issue is about cancelling a higher-level operation, composed of many streaming requests. Right now we can't cancel those individual streaming requests properly (I think), but using the using the mechanisms proposed they would be.

duarten added a commit to duarten/scylla that referenced this issue Oct 19, 2018
View updates are sent with a timeout of 5 minutes, unrelated to
any user-defined value and meant as a protection mechanism. During
normal operation we don’t benefit from timing out view writes and
offloading them to the hinted-handoff queue, since they are an
internal, non-real time workload that we already spent resources on.

Refs scylladb#2538
Refs scylladb#3826

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
duarten added a commit to duarten/scylla that referenced this issue Oct 20, 2018
View updates are sent with a timeout of 5 minutes, unrelated to
any user-defined value and meant as a protection mechanism. During
normal operation we don’t benefit from timing out view writes and
offloading them to the hinted-handoff queue, since they are an
internal, non-real time workload that we already spent resources on.

Refs scylladb#2538
Refs scylladb#3826

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
duarten added a commit to duarten/scylla that referenced this issue Dec 3, 2018
View updates are sent with a timeout of 5 minutes, unrelated to
any user-defined value and meant as a protection mechanism. During
normal operation we don’t benefit from timing out view writes and
offloading them to the hinted-handoff queue, since they are an
internal, non-real time workload that we already spent resources on.

Refs scylladb#2538
Refs scylladb#3826

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
duarten added a commit to duarten/scylla that referenced this issue Dec 5, 2018
View updates are sent with a timeout of 5 minutes, unrelated to
any user-defined value and meant as a protection mechanism. During
normal operation we don’t benefit from timing out view writes and
offloading them to the hinted-handoff queue, since they are an
internal, non-real time workload that we already spent resources on.

Refs scylladb#2538
Refs scylladb#3826

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
@duarten
Copy link
Contributor Author

duarten commented Dec 6, 2018

Another example is a view replica being restarted. Gossip can detect this through the change in generation. We should also be able to detect this, and cancel the relevant operations (i.e., pending view updates).

@duarten
Copy link
Contributor Author

duarten commented Dec 6, 2018

This issue is particularly important for view updates, which have a high timeout. Fixing it helps with memory consumption and decreases shutdown time in some cases. We should give it higher priority.

duarten added a commit to duarten/scylla that referenced this issue Dec 13, 2018
View updates are sent with a timeout of 5 minutes, unrelated to
any user-defined value and meant as a protection mechanism. During
normal operation we don’t benefit from timing out view writes and
offloading them to the hinted-handoff queue, since they are an
internal, non-real time workload that we already spent resources on.

This value should be increases further, but that change depends on

Refs scylladb#2538
Refs scylladb#3826

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
duarten added a commit to duarten/scylla that referenced this issue Dec 13, 2018
View updates are sent with a timeout of 5 minutes, unrelated to
any user-defined value and meant as a protection mechanism. During
normal operation we don’t benefit from timing out view writes and
offloading them to the hinted-handoff queue, since they are an
internal, non-real time workload that we already spent resources on.

This value should be increases further, but that change depends on

Refs scylladb#2538
Refs scylladb#3826

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
duarten added a commit to duarten/scylla that referenced this issue Dec 16, 2018
View updates are sent with a timeout of 5 minutes, unrelated to
any user-defined value and meant as a protection mechanism. During
normal operation we don’t benefit from timing out view writes and
offloading them to the hinted-handoff queue, since they are an
internal, non-real time workload that we already spent resources on.

This value should be increases further, but that change depends on

Refs scylladb#2538
Refs scylladb#3826

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
duarten added a commit to duarten/scylla that referenced this issue Dec 17, 2018
View updates are sent with a timeout of 5 minutes, unrelated to
any user-defined value and meant as a protection mechanism. During
normal operation we don’t benefit from timing out view writes and
offloading them to the hinted-handoff queue, since they are an
internal, non-real time workload that we already spent resources on.

This value should be increases further, but that change depends on

Refs scylladb#2538
Refs scylladb#3826

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
duarten added a commit to duarten/scylla that referenced this issue Dec 17, 2018
View updates are sent with a timeout of 5 minutes, unrelated to
any user-defined value and meant as a protection mechanism. During
normal operation we don’t benefit from timing out view writes and
offloading them to the hinted-handoff queue, since they are an
internal, non-real time workload that we already spent resources on.

This value should be increases further, but that change depends on

Refs scylladb#2538
Refs scylladb#3826

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
duarten added a commit to duarten/scylla that referenced this issue Dec 17, 2018
View updates are sent with a timeout of 5 minutes, unrelated to
any user-defined value and meant as a protection mechanism. During
normal operation we don’t benefit from timing out view writes and
offloading them to the hinted-handoff queue, since they are an
internal, non-real time workload that we already spent resources on.

This value should be increases further, but that change depends on

Refs scylladb#2538
Refs scylladb#3826

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
duarten added a commit to duarten/scylla that referenced this issue Dec 19, 2018
View updates are sent with a timeout of 5 minutes, unrelated to
any user-defined value and meant as a protection mechanism. During
normal operation we don’t benefit from timing out view writes and
offloading them to the hinted-handoff queue, since they are an
internal, non-real time workload that we already spent resources on.

This value should be increases further, but that change depends on

Refs scylladb#2538
Refs scylladb#3826

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
@nyh
Copy link
Contributor

nyh commented Dec 25, 2018

The three dtests mentioned in #4019 are a good example of why this feature is needed, and can be used to test it. We want the view-write timeout to be as high as possible, so as to allow long view-update backlogs, but because RPC currently may wait this entire duration even if the remote node goes down, a shutdown can hang for this timeout while it is waiting for all pending view writes, or an ongoing view-build step, to complete. Since the tests mentioned in #4019 only wait 2 minutes for the shutdown, they can't currently work with a timeout in storage_proxy::send_to_endpoint() higher than 2 minutes. But with the approach proposed here, we could allow a much longer timeout with the addition that if the remote node goes down (even temporarily), we immediately stop the rpc attempt and instead write a hint.

This feature (stopping instead of waiting for the full timeout) should be optional, in case somebody really needs the current behavior of continuing to wait even if the remote node goes down. But before we do that, we should verify that this waiting even make sense - is the RPC retried after the dead node goes back up? If not, continuing to wait is pointless.

@duarten
Copy link
Contributor Author

duarten commented Dec 25, 2018

We don't need this for #4019, we need #3966.

@nyh
Copy link
Contributor

nyh commented Dec 25, 2018

@duarten, I'm not sure what concrete proposal #3966 is making. According to @gleb-cloudius (if I understood what he said correctly...) the write that a view build is hung on in #4019 (a write which started when we didn't know the node went down) will never stop until the timeout. So we need a way to tell a write that has already started that it needs to give up the network send immediately. This is this issue, no? Then, indeed, once such a write it stops, it should indeed write to a hint as #3966 suggests, but I assumed this will happen automatically anyway - because when the network send fails, the CL=ANY should try to write a hint.

@duarten
Copy link
Contributor Author

duarten commented Dec 25, 2018

#3966 is about completing a shutdown in a timely fashion, regardless of nodes being down or just being slow.

We can detect that there is a shutdown in progress without enlisting the gossiper. All we need to do in the storage_proxy is to complete any outstanding writes with an exception, which for the case of view writes (but not only), will result in a hint being stored.

For this, it doesn't matter whether a node is down or just being slow: we can start a shutdown timer (of, say, 1 minute), after which we'll write pending writes as hints.

psarna added a commit to psarna/scylla that referenced this issue Jan 31, 2019
In order to be able to act when node joins/leaves, storage proxy
is registered as an endpoint lifecycle subscriber.

Fixes scylladb#3826
Fixes scylladb#3966
Fixes scylladb#4028
psarna added a commit to psarna/scylla that referenced this issue Feb 1, 2019
In order to be able to act when node joins/leaves, storage proxy
is registered as an endpoint lifecycle subscriber.

Fixes scylladb#3826
Fixes scylladb#4028
psarna added a commit to psarna/scylla that referenced this issue Feb 1, 2019
In order to be able to act when node joins/leaves, storage proxy
is registered as an endpoint lifecycle subscriber.

Fixes scylladb#3826
Fixes scylladb#4028
psarna added a commit to psarna/scylla that referenced this issue Feb 4, 2019
In order to be able to act when node joins/leaves, storage proxy
is registered as an endpoint lifecycle subscriber.

Fixes scylladb#3826
Fixes scylladb#4028
psarna added a commit to psarna/scylla that referenced this issue Feb 4, 2019
In order to be able to act when node joins/leaves, storage proxy
is registered as an endpoint lifecycle subscriber.

Fixes scylladb#3826
Fixes scylladb#4028
psarna added a commit to psarna/scylla that referenced this issue Feb 4, 2019
In order to be able to act when node joins/leaves, storage proxy
is registered as an endpoint lifecycle subscriber.

Fixes scylladb#3826
Fixes scylladb#4028
psarna added a commit to psarna/scylla that referenced this issue Feb 4, 2019
In order to be able to act when node joins/leaves, storage proxy
is registered as an endpoint lifecycle subscriber.

Fixes scylladb#3826
Fixes scylladb#4028
psarna added a commit to psarna/scylla that referenced this issue Feb 4, 2019
In order to be able to act when node joins/leaves, storage proxy
is registered as an endpoint lifecycle subscriber.

Fixes scylladb#3826
Fixes scylladb#4028
psarna added a commit to psarna/scylla that referenced this issue Feb 4, 2019
In order to be able to act when node joins/leaves, storage proxy
is registered as an endpoint lifecycle subscriber.

Fixes scylladb#3826
Fixes scylladb#4028
psarna added a commit to psarna/scylla that referenced this issue Feb 4, 2019
In order to be able to act when node joins/leaves, storage proxy
is registered as an endpoint lifecycle subscriber.

Fixes scylladb#3826
Fixes scylladb#4028
psarna added a commit to psarna/scylla that referenced this issue Feb 4, 2019
In order to be able to act when node joins/leaves, storage proxy
is registered as an endpoint lifecycle subscriber.

Fixes scylladb#3826
Fixes scylladb#4028
psarna added a commit to psarna/scylla that referenced this issue Feb 4, 2019
In order to be able to act when node joins/leaves, storage proxy
is registered as an endpoint lifecycle subscriber.

Fixes scylladb#3826
Fixes scylladb#4028
psarna added a commit to psarna/scylla that referenced this issue Feb 4, 2019
In order to be able to act when node joins/leaves, storage proxy
is registered as an endpoint lifecycle subscriber.

Fixes scylladb#3826
Fixes scylladb#4028
psarna added a commit to psarna/scylla that referenced this issue Feb 4, 2019
In order to be able to act when node joins/leaves, storage proxy
is registered as an endpoint lifecycle subscriber.

Fixes scylladb#3826
Fixes scylladb#4028
psarna added a commit to psarna/scylla that referenced this issue Feb 5, 2019
In order to be able to act when node joins/leaves, storage proxy
is registered as an endpoint lifecycle subscriber.

Fixes scylladb#3826
Fixes scylladb#4028
psarna added a commit to psarna/scylla that referenced this issue Feb 5, 2019
In order to be able to act when node joins/leaves, storage proxy
is registered as an endpoint lifecycle subscriber.

Fixes scylladb#3826
Fixes scylladb#4028
psarna added a commit to psarna/scylla that referenced this issue Feb 5, 2019
In order to be able to act when node joins/leaves, storage proxy
is registered as an endpoint lifecycle subscriber.

Refs scylladb#3826
Refs scylladb#4028
psarna added a commit to psarna/scylla that referenced this issue Feb 6, 2019
In order to be able to act when node joins/leaves, storage proxy
is registered as an endpoint lifecycle subscriber.

Fixes scylladb#3826
Fixes scylladb#4028
psarna added a commit to psarna/scylla that referenced this issue Mar 6, 2019
In order to be able to act when node joins/leaves, storage proxy
is registered as an endpoint lifecycle subscriber.

Fixes scylladb#3826
Fixes scylladb#4028
duarten added a commit that referenced this issue Mar 6, 2019
"
This series allows canceling view update requests when a node is
discovered DOWN. View updates are sent in the background with long
timeout (5 minutes), and in case we discover that the node is
unavailable, there's no point in waiting that long for the request
to finish. What's more, waiting for these requests occurs on shutdown,
which may result in waiting 5 minutes until Scylla properly shuts down,
which is bad for both users and dtests.

This series implements storage_proxy as a lifecycle subscriber,
so it can react to membership changes. It also keeps track of all
"interruptible" writes per endpoint, so once a node is detected as DOWN,
an artificial timeout can be triggered for all aforementioned write
requests.

Fixes #3826
Fixes #3966
Fixes #4028
"

* 'write_hints_for_view_updates_on_shutdown_3' of https://github.com/psarna/scylla:
  service: remove unused stop_hints_manager
  storage_proxy: add drain_on_shutdown implementation
  main: register storage proxy as lifecycle subscriber
  storage_proxy: add endpoint_lifecycle_subscriber interface
  storage_proxy: register view update handlers for view write type
  storage_proxy: add intrusive list of view write handlers
  storage_proxy: add view_update_write_response_handler
duarten added a commit that referenced this issue Mar 7, 2019
"
This series allows canceling view update requests when a node is
discovered DOWN. View updates are sent in the background with long
timeout (5 minutes), and in case we discover that the node is
unavailable, there's no point in waiting that long for the request
to finish. What's more, waiting for these requests occurs on shutdown,
which may result in waiting 5 minutes until Scylla properly shuts down,
which is bad for both users and dtests.

This series implements storage_proxy as a lifecycle subscriber,
so it can react to membership changes. It also keeps track of all
"interruptible" writes per endpoint, so once a node is detected as DOWN,
an artificial timeout can be triggered for all aforementioned write
requests.

Fixes #3826
Fixes #3966
Fixes #4028
"

* 'write_hints_for_view_updates_on_shutdown_4' of https://github.com/psarna/scylla:
  service: remove unused stop_hints_manager
  storage_proxy: add drain_on_shutdown implementation
  main: register storage proxy as lifecycle subscriber
  storage_proxy: add endpoint_lifecycle_subscriber interface
  storage_proxy: register view update handlers for view write type
  storage_proxy: add intrusive list of view write handlers
  storage_proxy: add view_update_write_response_handler
duarten added a commit that referenced this issue Mar 8, 2019
"
This series allows canceling view update requests when a node is
discovered DOWN. View updates are sent in the background with long
timeout (5 minutes), and in case we discover that the node is
unavailable, there's no point in waiting that long for the request
to finish. What's more, waiting for these requests occurs on shutdown,
which may result in waiting 5 minutes until Scylla properly shuts down,
which is bad for both users and dtests.

This series implements storage_proxy as a lifecycle subscriber,
so it can react to membership changes. It also keeps track of all
"interruptible" writes per endpoint, so once a node is detected as DOWN,
an artificial timeout can be triggered for all aforementioned write
requests.

Fixes #3826
Fixes #3966
Fixes #4028
"

* 'write_hints_for_view_updates_on_shutdown_4' of https://github.com/psarna/scylla:
  service: remove unused stop_hints_manager
  storage_proxy: add drain_on_shutdown implementation
  main: register storage proxy as lifecycle subscriber
  storage_proxy: add endpoint_lifecycle_subscriber interface
  storage_proxy: register view update handlers for view write type
  storage_proxy: add intrusive list of view write handlers
  storage_proxy: add view_update_write_response_handler
avikivity pushed a commit that referenced this issue Mar 24, 2019
"
This series allows canceling view update requests when a node is
discovered DOWN. View updates are sent in the background with long
timeout (5 minutes), and in case we discover that the node is
unavailable, there's no point in waiting that long for the request
to finish. What's more, waiting for these requests occurs on shutdown,
which may result in waiting 5 minutes until Scylla properly shuts down,
which is bad for both users and dtests.

This series implements storage_proxy as a lifecycle subscriber,
so it can react to membership changes. It also keeps track of all
"interruptible" writes per endpoint, so once a node is detected as DOWN,
an artificial timeout can be triggered for all aforementioned write
requests.

Fixes #3826
Fixes #3966
Fixes #4028
"

* 'write_hints_for_view_updates_on_shutdown_4' of https://github.com/psarna/scylla:
  service: remove unused stop_hints_manager
  storage_proxy: add drain_on_shutdown implementation
  main: register storage proxy as lifecycle subscriber
  storage_proxy: add endpoint_lifecycle_subscriber interface
  storage_proxy: register view update handlers for view write type
  storage_proxy: add intrusive list of view write handlers
  storage_proxy: add view_update_write_response_handler

(cherry picked from commit 2718c90)
avikivity pushed a commit that referenced this issue Apr 1, 2019
… Piotr"

This series introduces regression in dtests
materialized_views_test/TestMaterializedViews/interrupt_build_process_with_resharding_*.

This reverts commit b2227c7.

Ref #3826
Ref #3966
Ref #4028

Signed-off-by: Shlomi Livne <shlomi@scylladb.com>
Message-Id: <a3aea137bfde956241acc6d57e1c387a8202486c.1554116404.git.shlomi@scylladb.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/coordination Changes related to the coordination layer (i.e., storage_proxy). area/materialized views feature/enhancement symptom/performance Issues causing performance problems
Projects
None yet
Development

No branches or pull requests

3 participants