-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Topology tests: node bootstrap sometimes fails due to Operation timed out for system_distributed_everywhere.cdc_generation_descriptions_v2
#13302
Comments
This patch looks like it should have fixed the problem, at least from the description: But it seems that the Perhaps we should move it earlier, before this line: cdc_gen_id = cdc_gen_service.make_new_generation(bootstrap_tokens, !is_first_node()).get0(); cc @asias |
…rocedure The `wait_for_normal_state_handled_on_boot` function waits until `handle_state_normal` finishes for the given set of nodes. It was used in `run_bootstrap_ops` and `run_replace_ops` to wait until NORMAL states of existing nodes in the cluster are processed by the joining node before continuing the joining process. One reason to do it is because at the end of `handle_state_normal` the joining node might drop connections to the NORMAL nodes in order to reestablish new connections using correct encryption settings. In tests we observed that the connection drop was happening in the middle of repair/streaming, causing repair/streaming to abort. Unfortunately, calling `wait_for_normal_state_handled_on_boot` in `run_bootstrap_ops`/`run_replace_ops` is too late to fix all problems. Before either of these two functions, we create a new CDC generation and write the data to `system_distributed_everywhere.cdc_generation_descriptions_v2`. In tests, the connections were sometimes dropped while this write was in-flight. This would cause the write to never arrive to other nodes, and the joining node would timeout waiting for confirmations. To fix this, call `wait_for_normal_state_handled_on_boot` earlier in the boot procedure, before `make_new_generation` call which does the write. Fixes: scylladb#13302
I don't think it's test specific. The tests show that a lot of bootstrap code is not prepared for RPC errors. The bootstrap simply fails in this case, moreover, the reason is hard to nail down. An external user will never figure out from the logs that this is a flaky failure - and won't know how to recover from it. Besides, going forward, for topology testing, we will have a lot of concurrent node operations going on. How are you planning to make these tests reliable unless the problem is fixed in scylla code, not in the testing code? |
@avikivity please provide your input |
Note that this will not happen in production because of ring_delay / wait_for_gossip_to_settle sleeps. |
… boot procedure' from Kamil Braun The `wait_for_normal_state_handled_on_boot` function waits until `handle_state_normal` finishes for the given set of nodes. It was used in `run_bootstrap_ops` and `run_replace_ops` to wait until NORMAL states of existing nodes in the cluster are processed by the joining node before continuing the joining process. One reason to do it is because at the end of `handle_state_normal` the joining node might drop connections to the NORMAL nodes in order to reestablish new connections using correct encryption settings. In tests we observed that the connection drop was happening in the middle of repair/streaming, causing repair/streaming to abort. Unfortunately, calling `wait_for_normal_state_handled_on_boot` in `run_bootstrap_ops`/`run_replace_ops` is too late to fix all problems. Before either of these two functions, we create a new CDC generation and write the data to `system_distributed_everywhere.cdc_generation_descriptions_v2`. In tests, the connections were sometimes dropped while this write was in-flight. This would cause the write to never arrive to other nodes, and the joining node would timeout waiting for confirmations. To fix this, call `wait_for_normal_state_handled_on_boot` earlier in the boot procedure, before `make_new_generation` call which does the write. Fixes: #13302 Closes #13317 * github.com:scylladb/scylladb: storage_service: wait for normal state handlers earlier in the boot procedure storage_service: bootstrap: wait for normal tokens to arrive in all cases storage_service: extract get_nodes_to_sync_with helper storage_service: return unordered_set from get_ignore_dead_nodes_for_replace
This happens rarely in CI builds / next promotions and is very hard to reproduce, but I think I found the cause.
When a node is booting, it tries to write some data to
system_distributed_everywhere.cdc_generation_descriptions_v2
. The write is sent to all existing nodes and uses CL=ALL.Sometimes in topology tests when we create a cluster, when we boot the second node in the cluster, the write sent by the second node to the first node is lost: the mutation RPC call never arrives at the first node. Because it's a one-way RPC,
storage_proxy
is not "aware" of this and waits asynchronously for the response, then times out after 60s (which is the hardcoded timeout for this write).After some investigation, I'm pretty sure that this is caused by the second node dropping connections to the first node in this code:
I think this is the cause because after putting a log message there, I observed in the failed runs that the message is printed in the same millisecond that the mutation write RPC is sent.
This
notify_joined
function is called when a node observes that another node is in NORMAL state. In this case, the joining node observes that the first node is indeed NORMAL,gossiper
callsstorage_service::handle_state_normal
which performs some operations and then callsstorage_service::notify_joined
. This races with the bootstrap procedure, and sometimes happens exactly at the moment of writing the CDC data, causing the write to be lost.This doesn't happen in production clusters where a lot of sleeps are included in the boot procedure, so by the time we make the CDC write,
handle_state_normal
had been called and finished a "long time ago". The problem is test specific, where we setskip_wait_for_gossip_to_settle
andring_delay: 0
.I'd say that this is another example of the larger umbrella issue: #12015.
The text was updated successfully, but these errors were encountered: