Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Topology tests: node bootstrap sometimes fails due to Operation timed out for system_distributed_everywhere.cdc_generation_descriptions_v2 #13302

Closed
kbr-scylla opened this issue Mar 23, 2023 · 4 comments · Fixed by #13317
Labels
area/cdc area/test Issues related to the testing system code and environment symptom/ci stability Issues that failed in ScyllaDB CI - tests and framework tests/test.py
Milestone

Comments

@kbr-scylla
Copy link
Contributor

This happens rarely in CI builds / next promotions and is very hard to reproduce, but I think I found the cause.

When a node is booting, it tries to write some data to system_distributed_everywhere.cdc_generation_descriptions_v2. The write is sent to all existing nodes and uses CL=ALL.

Sometimes in topology tests when we create a cluster, when we boot the second node in the cluster, the write sent by the second node to the first node is lost: the mutation RPC call never arrives at the first node. Because it's a one-way RPC, storage_proxy is not "aware" of this and waits asynchronously for the response, then times out after 60s (which is the hardcoded timeout for this write).

After some investigation, I'm pretty sure that this is caused by the second node dropping connections to the first node in this code:

future<> storage_service::notify_joined(inet_address endpoint) {
...
    co_await container().invoke_on_all([endpoint] (auto&& ss) {
        ss._messaging.local().remove_rpc_client_with_ignored_topology(netw::msg_addr{endpoint, 0});
        return ss._lifecycle_notifier.notify_joined(endpoint);
    });
...
}

I think this is the cause because after putting a log message there, I observed in the failed runs that the message is printed in the same millisecond that the mutation write RPC is sent.

This notify_joined function is called when a node observes that another node is in NORMAL state. In this case, the joining node observes that the first node is indeed NORMAL, gossiper calls storage_service::handle_state_normal which performs some operations and then calls storage_service::notify_joined. This races with the bootstrap procedure, and sometimes happens exactly at the moment of writing the CDC data, causing the write to be lost.

This doesn't happen in production clusters where a lot of sleeps are included in the boot procedure, so by the time we make the CDC write, handle_state_normal had been called and finished a "long time ago". The problem is test specific, where we set skip_wait_for_gossip_to_settle and ring_delay: 0.

I'd say that this is another example of the larger umbrella issue: #12015.

@kbr-scylla kbr-scylla added symptom/ci stability Issues that failed in ScyllaDB CI - tests and framework area/test Issues related to the testing system code and environment area/cdc tests/test.py labels Mar 23, 2023
@kbr-scylla
Copy link
Contributor Author

This patch looks like it should have fixed the problem, at least from the description:
5363616

But it seems that the wait_for_normal_state_handled_on_boot call happens too late - in run_bootstrap_ops, which is called after the CDC write is done.

Perhaps we should move it earlier, before this line:

            cdc_gen_id = cdc_gen_service.make_new_generation(bootstrap_tokens, !is_first_node()).get0();

cc @asias

kbr-scylla added a commit to kbr-scylla/scylladb that referenced this issue Mar 24, 2023
…rocedure

The `wait_for_normal_state_handled_on_boot` function waits until
`handle_state_normal` finishes for the given set of nodes. It was used
in `run_bootstrap_ops` and `run_replace_ops` to wait until NORMAL states
of existing nodes in the cluster are processed by the joining node
before continuing the joining process. One reason to do it is because at
the end of `handle_state_normal` the joining node might drop connections
to the NORMAL nodes in order to reestablish new connections using
correct encryption settings. In tests we observed that the connection
drop was happening in the middle of repair/streaming, causing
repair/streaming to abort.

Unfortunately, calling `wait_for_normal_state_handled_on_boot` in
`run_bootstrap_ops`/`run_replace_ops` is too late to fix all problems.
Before either of these two functions, we create a new CDC generation and
write the data to `system_distributed_everywhere.cdc_generation_descriptions_v2`.
In tests, the connections were sometimes dropped while this write was
in-flight. This would cause the write to never arrive to other nodes,
and the joining node would timeout waiting for confirmations.

To fix this, call `wait_for_normal_state_handled_on_boot` earlier in the
boot procedure, before `make_new_generation` call which does the write.

Fixes: scylladb#13302
@kostja
Copy link
Contributor

kostja commented Mar 26, 2023

I don't think it's test specific. The tests show that a lot of bootstrap code is not prepared for RPC errors. The bootstrap simply fails in this case, moreover, the reason is hard to nail down. An external user will never figure out from the logs that this is a flaky failure - and won't know how to recover from it.
Bootstrap code should tolerate RPC errors. This includes both gossiper and internal writes to schema_distributed tables. Writing it off a testing issue won't cut it.

Besides, going forward, for topology testing, we will have a lot of concurrent node operations going on. How are you planning to make these tests reliable unless the problem is fixed in scylla code, not in the testing code?

@kostja
Copy link
Contributor

kostja commented Mar 26, 2023

@avikivity please provide your input

@kbr-scylla
Copy link
Contributor Author

An external user will never figure out from the logs that this is a flaky failure - and won't know how to recover from it.

Note that this will not happen in production because of ring_delay / wait_for_gossip_to_settle sleeps.

@DoronArazii DoronArazii added this to the 5.3 milestone Mar 27, 2023
tgrabiec added a commit that referenced this issue Mar 27, 2023
… boot procedure' from Kamil Braun

The `wait_for_normal_state_handled_on_boot` function waits until
`handle_state_normal` finishes for the given set of nodes. It was used
in `run_bootstrap_ops` and `run_replace_ops` to wait until NORMAL states
of existing nodes in the cluster are processed by the joining node
before continuing the joining process. One reason to do it is because at
the end of `handle_state_normal` the joining node might drop connections
to the NORMAL nodes in order to reestablish new connections using
correct encryption settings. In tests we observed that the connection
drop was happening in the middle of repair/streaming, causing
repair/streaming to abort.

Unfortunately, calling `wait_for_normal_state_handled_on_boot` in
`run_bootstrap_ops`/`run_replace_ops` is too late to fix all problems.
Before either of these two functions, we create a new CDC generation and
write the data to `system_distributed_everywhere.cdc_generation_descriptions_v2`.
In tests, the connections were sometimes dropped while this write was
in-flight. This would cause the write to never arrive to other nodes,
and the joining node would timeout waiting for confirmations.

To fix this, call `wait_for_normal_state_handled_on_boot` earlier in the
boot procedure, before `make_new_generation` call which does the write.

Fixes: #13302

Closes #13317

* github.com:scylladb/scylladb:
  storage_service: wait for normal state handlers earlier in the boot procedure
  storage_service: bootstrap: wait for normal tokens to arrive in all cases
  storage_service: extract get_nodes_to_sync_with helper
  storage_service: return unordered_set from get_ignore_dead_nodes_for_replace
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cdc area/test Issues related to the testing system code and environment symptom/ci stability Issues that failed in ScyllaDB CI - tests and framework tests/test.py
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants