New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cql_tracing_test.TestCqlTracing.tracing_startup_test hangs on shutdown in dtest debug #8079
Comments
Looks like we hit it here: https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-release/745/artifact/logs-release.2/1613471599651_nodetool_additional_test.TestNodetool.concurrent_restart_test/node1.log
(forced) core dump of scylla here: https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-release/745/artifact/logs-release.2/1613471599651_nodetool_additional_test.TestNodetool.concurrent_restart_test/node1-scylla.28002.1613471598.core.gz |
One of hints managers didn't stop. The hint-related tasks (obtained with
|
I see |
Here's the wait sequence that fits the list of tasks:
the |
@elcallio can you please have a look |
from Calle
|
This wasn't hit since added53 so let's close the issue and consider reopening if hit again. |
I saw a similar stop hung in https://jenkins.scylladb.com/job/scylla-master/job/longevity/job/longevity-cdc-100gb-4h-test/207/
This version doesn't contain the fix (added53) Note: hung for ~15 mins and reached the systemd stop timeout of scylla-server
|
We're also hitting I'm reopening this issue. |
@DoronArazii I suggest to remove the master/high label as this issue is hit rarely and is only blocking the shutdown path. |
I'm gonna try investigating the hangs as part of the quality effort. |
Reproducer: This is what happens:
This is pretty much what happens in the CI runs, except that the write is done internally to
unfortunately the hint write uses the crazy 5 minutes timeout from possible solutions?
|
Cancelling hint writes in shutdown sounds like a good idea to me but how difficult is it ? Can we just abandon them, without cancelling ? |
I don't see any other way than abandoning them - cancelling must mean abandoning in some cases since we may request cancel after the message is out, so we just cancel the waiting for response part.
|
Please see eb79e45 and the following patches in https://github.com/scylladb/scylladb/pull/12296/commits for a useful mechanism for canceling outstanding messages that also have a timeout |
The `view_update_write_response_handler` class, which is a subclass of `abstract_write_response_handler`, was created for a single purpose: to make it possible to 'retire' a handler for a view update write, which means we stop waiting for a response to the write, timing out the handler immediately. This was done to solve issue with node shutdown hanging because it was waiting for a view update to finish; view updates were configured with 5 minute timeout. See scylladb#3966, scylladb#4028. Now we're having a similar problem with hint updates causing shutdown to hang in tests (scylladb#8079). `view_update_write_response_handler` implements retiring by adding itself to an intrusive list which we then iterate over to timeout each handler when we shutdown or when gossiper notifies `storage_proxy` that a node is down. To make it possible to reuse this algorithm for other handlers, move the functionality into `abstract_write_response_handler`. We inherit from `bi::list_base_hook` so it introduces small memory overhead to each write handler (2 pointers) which was only present for view update handlers before. But those handlers are already quite large, the overhead is small compared to their size. Not all handlers are added to the retiring list, this is controlled by the `retirable` parameter passed to the constructor. For now we're only retiring view handlers as before. In following commits we'll also retire hint handlers.
Whether a write handler should be retirable is now controlled by a parameter passed to `create_write_response_handler`. We plumb it down from `send_to_endpoint` which is called by hints manager. This will cause hint write handlers to immediately timeout when we shutdown or when a destination node is marked as dead. Fixes scylladb#8079
…r types The `view_update_write_response_handler` class, which is a subclass of `abstract_write_response_handler`, was created for a single purpose: to make it possible to 'retire' a handler for a view update write, which means we stop waiting for a response to the write, timing out the handler immediately. This was done to solve issue with node shutdown hanging because it was waiting for a view update to finish; view updates were configured with 5 minute timeout. See scylladb#3966, scylladb#4028. Now we're having a similar problem with hint updates causing shutdown to hang in tests (scylladb#8079). `view_update_write_response_handler` implements retiring by adding itself to an intrusive list which we then iterate over to timeout each handler when we shutdown or when gossiper notifies `storage_proxy` that a node is down. To make it possible to reuse this algorithm for other handlers, move the functionality into `abstract_write_response_handler`. We inherit from `bi::list_base_hook` so it introduces small memory overhead to each write handler (2 pointers) which was only present for view update handlers before. But those handlers are already quite large, the overhead is small compared to their size. Not all handlers are added to the retiring list, this is controlled by the `retirable` parameter passed to the constructor. For now we're only retiring view handlers as before. In following commits we'll also retire hint handlers.
Whether a write handler should be retirable is now controlled by a parameter passed to `create_write_response_handler`. We plumb it down from `send_to_endpoint` which is called by hints manager. This will cause hint write handlers to immediately timeout when we shutdown or when a destination node is marked as dead. Fixes scylladb#8079
…types The `view_update_write_response_handler` class, which is a subclass of `abstract_write_response_handler`, was created for a single purpose: to make it possible to cancel a handler for a view update write, which means we stop waiting for a response to the write, timing out the handler immediately. This was done to solve issue with node shutdown hanging because it was waiting for a view update to finish; view updates were configured with 5 minute timeout. See scylladb#3966, scylladb#4028. Now we're having a similar problem with hint updates causing shutdown to hang in tests (scylladb#8079). `view_update_write_response_handler` implements cancelling by adding itself to an intrusive list which we then iterate over to timeout each handler when we shutdown or when gossiper notifies `storage_proxy` that a node is down. To make it possible to reuse this algorithm for other handlers, move the functionality into `abstract_write_response_handler`. We inherit from `bi::list_base_hook` so it introduces small memory overhead to each write handler (2 pointers) which was only present for view update handlers before. But those handlers are already quite large, the overhead is small compared to their size. Not all handlers are added to the cancelling list, this is controlled by the `cancellable` parameter passed to the constructor. For now we're only cancelling view handlers as before. In following commits we'll also cancel hint handlers.
…from Kamil Braun The `view_update_write_response_handler` class, which is a subclass of `abstract_write_response_handler`, was created for a single purpose: to make it possible to cancel a handler for a view update write, which means we stop waiting for a response to the write, timing out the handler immediately. This was done to solve issue with node shutdown hanging because it was waiting for a view update to finish; view updates were configured with 5 minute timeout. See #3966, #4028. Now we're having a similar problem with hint updates causing shutdown to hang in tests (#8079). `view_update_write_response_handler` implements cancelling by adding itself to an intrusive list which we then iterate over to timeout each handler when we shutdown or when gossiper notifies `storage_proxy` that a node is down. To make it possible to reuse this algorithm for other handlers, move the functionality into `abstract_write_response_handler`. We inherit from `bi::list_base_hook` so it introduces small memory overhead to each write handler (2 pointers) which was only present for view update handlers before. But those handlers are already quite large, the overhead is small compared to their size. Use this new functionality to also cancel hint write handlers when we shutdown. This fixes #8079. Closes #14047 * github.com:scylladb/scylladb: test: reproducer for hints manager shutdown hang test: pylib: ScyllaCluster: generalize config type for `server_add` test: pylib: scylla_cluster: add explicit timeout for graceful server stop service: storage_proxy: make hint write handlers cancellable service: storage_proxy: rename `view_update_handlers_list` service: storage_proxy: make it possible to cancel all write handler types
…types The `view_update_write_response_handler` class, which is a subclass of `abstract_write_response_handler`, was created for a single purpose: to make it possible to cancel a handler for a view update write, which means we stop waiting for a response to the write, timing out the handler immediately. This was done to solve issue with node shutdown hanging because it was waiting for a view update to finish; view updates were configured with 5 minute timeout. See scylladb#3966, scylladb#4028. Now we're having a similar problem with hint updates causing shutdown to hang in tests (scylladb#8079). `view_update_write_response_handler` implements cancelling by adding itself to an intrusive list which we then iterate over to timeout each handler when we shutdown or when gossiper notifies `storage_proxy` that a node is down. To make it possible to reuse this algorithm for other handlers, move the functionality into `abstract_write_response_handler`. We inherit from `bi::list_base_hook` so it introduces small memory overhead to each write handler (2 pointers) which was only present for view update handlers before. But those handlers are already quite large, the overhead is small compared to their size. Not all handlers are added to the cancelling list, this is controlled by the `cancellable` parameter passed to the constructor. For now we're only cancelling view handlers as before. In following commits we'll also cancel hint handlers.
Whether a write handler should be cancellable is now controlled by a parameter passed to `create_write_response_handler`. We plumb it down from `send_to_endpoint` which is called by hints manager. This will cause hint write handlers to immediately timeout when we shutdown or when a destination node is marked as dead. Fixes scylladb#8079
…types The `view_update_write_response_handler` class, which is a subclass of `abstract_write_response_handler`, was created for a single purpose: to make it possible to cancel a handler for a view update write, which means we stop waiting for a response to the write, timing out the handler immediately. This was done to solve issue with node shutdown hanging because it was waiting for a view update to finish; view updates were configured with 5 minute timeout. See scylladb#3966, scylladb#4028. Now we're having a similar problem with hint updates causing shutdown to hang in tests (scylladb#8079). `view_update_write_response_handler` implements cancelling by adding itself to an intrusive list which we then iterate over to timeout each handler when we shutdown or when gossiper notifies `storage_proxy` that a node is down. To make it possible to reuse this algorithm for other handlers, move the functionality into `abstract_write_response_handler`. We inherit from `bi::list_base_hook` so it introduces small memory overhead to each write handler (2 pointers) which was only present for view update handlers before. But those handlers are already quite large, the overhead is small compared to their size. Not all handlers are added to the cancelling list, this is controlled by the `cancellable` parameter passed to the constructor. For now we're only cancelling view handlers as before. In following commits we'll also cancel hint handlers.
Whether a write handler should be cancellable is now controlled by a parameter passed to `create_write_response_handler`. We plumb it down from `send_to_endpoint` which is called by hints manager. This will cause hint write handlers to immediately timeout when we shutdown or when a destination node is marked as dead. Fixes scylladb#8079
Seems like this issue mainly affects tests, so not backporting. |
Regression started at https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-debug/721/artifact/logs-debug.2/1612888868753_cql_tracing_test.TestCqlTracing.tracing_startup_test/node2.log
Scylla version c16e4a0
Last passed at e1261d1
git log --oneline e1261d10f17c2d219f159c01ddf7fd2b122f9d6a..c16e4a04239798c41fa140c14eabb250b5422983
c16e4a0 migration_manager: Propagate schema changes with reads like we do on writes
4082f57 Merge 'Make commitlog disk limit a hard limit.' from Calle Wilund
af2d1fa Update abseil submodule
b9a5aff distributed_loader: drop execute_futures function
a05adb8 database: Remove global storage proxy reference
8490c9f transport: Remove global storage service reference
4498bb0 API: Fix aggregation in column_familiy
4be718e commitlog: Force earlier cycle/flush iff segment reserve is empty
be8c359 commitlog: Make segment allocation wait iff disk usage > max
ab55a1b commitlog: Do partial (memtable) flushing based on threshold
7c84b16 commitlog: Make flush threshold configurable
c3d9581 table: Add a flush RP mark to table, and shortcut if not above
The text was updated successfully, but these errors were encountered: