Topology tests: node bootstrap sometimes fails due to `Operation timed out for system_distributed_everywhere.cdc_generation_descriptions_v2` #13302

kbr-scylla · 2023-03-23T12:57:41Z

This happens rarely in CI builds / next promotions and is very hard to reproduce, but I think I found the cause.

When a node is booting, it tries to write some data to system_distributed_everywhere.cdc_generation_descriptions_v2. The write is sent to all existing nodes and uses CL=ALL.

Sometimes in topology tests when we create a cluster, when we boot the second node in the cluster, the write sent by the second node to the first node is lost: the mutation RPC call never arrives at the first node. Because it's a one-way RPC, storage_proxy is not "aware" of this and waits asynchronously for the response, then times out after 60s (which is the hardcoded timeout for this write).

After some investigation, I'm pretty sure that this is caused by the second node dropping connections to the first node in this code:

future<> storage_service::notify_joined(inet_address endpoint) {
...
    co_await container().invoke_on_all([endpoint] (auto&& ss) {
        ss._messaging.local().remove_rpc_client_with_ignored_topology(netw::msg_addr{endpoint, 0});
        return ss._lifecycle_notifier.notify_joined(endpoint);
    });
...
}

I think this is the cause because after putting a log message there, I observed in the failed runs that the message is printed in the same millisecond that the mutation write RPC is sent.

This notify_joined function is called when a node observes that another node is in NORMAL state. In this case, the joining node observes that the first node is indeed NORMAL, gossiper calls storage_service::handle_state_normal which performs some operations and then calls storage_service::notify_joined. This races with the bootstrap procedure, and sometimes happens exactly at the moment of writing the CDC data, causing the write to be lost.

This doesn't happen in production clusters where a lot of sleeps are included in the boot procedure, so by the time we make the CDC write, handle_state_normal had been called and finished a "long time ago". The problem is test specific, where we set skip_wait_for_gossip_to_settle and ring_delay: 0.

I'd say that this is another example of the larger umbrella issue: #12015.

The text was updated successfully, but these errors were encountered:

kbr-scylla · 2023-03-23T14:32:49Z

This patch looks like it should have fixed the problem, at least from the description:
5363616

But it seems that the wait_for_normal_state_handled_on_boot call happens too late - in run_bootstrap_ops, which is called after the CDC write is done.

Perhaps we should move it earlier, before this line:

            cdc_gen_id = cdc_gen_service.make_new_generation(bootstrap_tokens, !is_first_node()).get0();

cc @asias

…rocedure The `wait_for_normal_state_handled_on_boot` function waits until `handle_state_normal` finishes for the given set of nodes. It was used in `run_bootstrap_ops` and `run_replace_ops` to wait until NORMAL states of existing nodes in the cluster are processed by the joining node before continuing the joining process. One reason to do it is because at the end of `handle_state_normal` the joining node might drop connections to the NORMAL nodes in order to reestablish new connections using correct encryption settings. In tests we observed that the connection drop was happening in the middle of repair/streaming, causing repair/streaming to abort. Unfortunately, calling `wait_for_normal_state_handled_on_boot` in `run_bootstrap_ops`/`run_replace_ops` is too late to fix all problems. Before either of these two functions, we create a new CDC generation and write the data to `system_distributed_everywhere.cdc_generation_descriptions_v2`. In tests, the connections were sometimes dropped while this write was in-flight. This would cause the write to never arrive to other nodes, and the joining node would timeout waiting for confirmations. To fix this, call `wait_for_normal_state_handled_on_boot` earlier in the boot procedure, before `make_new_generation` call which does the write. Fixes: scylladb#13302

kostja · 2023-03-26T09:01:33Z

I don't think it's test specific. The tests show that a lot of bootstrap code is not prepared for RPC errors. The bootstrap simply fails in this case, moreover, the reason is hard to nail down. An external user will never figure out from the logs that this is a flaky failure - and won't know how to recover from it.
Bootstrap code should tolerate RPC errors. This includes both gossiper and internal writes to schema_distributed tables. Writing it off a testing issue won't cut it.

Besides, going forward, for topology testing, we will have a lot of concurrent node operations going on. How are you planning to make these tests reliable unless the problem is fixed in scylla code, not in the testing code?

kostja · 2023-03-26T09:02:48Z

@avikivity please provide your input

kbr-scylla · 2023-03-27T09:06:26Z

An external user will never figure out from the logs that this is a flaky failure - and won't know how to recover from it.

Note that this will not happen in production because of ring_delay / wait_for_gossip_to_settle sleeps.

… boot procedure' from Kamil Braun The `wait_for_normal_state_handled_on_boot` function waits until `handle_state_normal` finishes for the given set of nodes. It was used in `run_bootstrap_ops` and `run_replace_ops` to wait until NORMAL states of existing nodes in the cluster are processed by the joining node before continuing the joining process. One reason to do it is because at the end of `handle_state_normal` the joining node might drop connections to the NORMAL nodes in order to reestablish new connections using correct encryption settings. In tests we observed that the connection drop was happening in the middle of repair/streaming, causing repair/streaming to abort. Unfortunately, calling `wait_for_normal_state_handled_on_boot` in `run_bootstrap_ops`/`run_replace_ops` is too late to fix all problems. Before either of these two functions, we create a new CDC generation and write the data to `system_distributed_everywhere.cdc_generation_descriptions_v2`. In tests, the connections were sometimes dropped while this write was in-flight. This would cause the write to never arrive to other nodes, and the joining node would timeout waiting for confirmations. To fix this, call `wait_for_normal_state_handled_on_boot` earlier in the boot procedure, before `make_new_generation` call which does the write. Fixes: #13302 Closes #13317 * github.com:scylladb/scylladb: storage_service: wait for normal state handlers earlier in the boot procedure storage_service: bootstrap: wait for normal tokens to arrive in all cases storage_service: extract get_nodes_to_sync_with helper storage_service: return unordered_set from get_ignore_dead_nodes_for_replace

kbr-scylla added symptom/ci stability Issues that failed in ScyllaDB CI - tests and framework area/test Issues related to the testing system code and environment area/cdc tests/test.py labels Mar 23, 2023

kbr-scylla mentioned this issue Mar 23, 2023

Test IP change support with Raft enabled. #12257

Open

kbr-scylla mentioned this issue Mar 24, 2023

storage_service: wait for normal state handlers earlier in the boot procedure #13317

Merged

DoronArazii added this to the 5.3 milestone Mar 27, 2023

kostja mentioned this issue Mar 27, 2023

test/cql-pytest: reproducer for bug crashing Scylla on mismatched tuple #13244

Closed

scylladb-promoter closed this as completed in 0b19a61 Mar 27, 2023

scylladb-promoter added the Backport candidate label Mar 27, 2023

kbr-scylla removed the Backport candidate label Mar 30, 2023

kbr-scylla mentioned this issue Mar 30, 2023

Topology tests are flaky #11490

Closed

kbr-scylla mentioned this issue Jun 21, 2023

service: raft: force initial snapshot transfer in new cluster #14336

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Topology tests: node bootstrap sometimes fails due to `Operation timed out for system_distributed_everywhere.cdc_generation_descriptions_v2` #13302

Topology tests: node bootstrap sometimes fails due to `Operation timed out for system_distributed_everywhere.cdc_generation_descriptions_v2` #13302

kbr-scylla commented Mar 23, 2023

kbr-scylla commented Mar 23, 2023

kostja commented Mar 26, 2023

kostja commented Mar 26, 2023

kbr-scylla commented Mar 27, 2023

Topology tests: node bootstrap sometimes fails due to Operation timed out for system_distributed_everywhere.cdc_generation_descriptions_v2 #13302

Topology tests: node bootstrap sometimes fails due to Operation timed out for system_distributed_everywhere.cdc_generation_descriptions_v2 #13302

Comments

kbr-scylla commented Mar 23, 2023

kbr-scylla commented Mar 23, 2023

kostja commented Mar 26, 2023

kostja commented Mar 26, 2023

kbr-scylla commented Mar 27, 2023

Topology tests: node bootstrap sometimes fails due to `Operation timed out for system_distributed_everywhere.cdc_generation_descriptions_v2` #13302

Topology tests: node bootstrap sometimes fails due to `Operation timed out for system_distributed_everywhere.cdc_generation_descriptions_v2` #13302