[dtest] test failed due to a node getting stuck in bootstrap #15393

patjed41 · 2023-09-13T12:42:04Z

Build https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/3602/ failed on resharding_test.TestReshardingVariants.test_resharding_by_smp_increase[4-SizeTieredCompactionStrategy-15].

Attached logs:
dtest-gw4.log
1694518082201_resharding_test.py TestReshardingVariants test_resharding_by_smp_increase[4-SizeTieredCompactionStrategy-15].zip

This test runs on a 4-node cluster. Before the test even started, the bootstrap procedure of the 3rd node got stuck during "waiting for schema information to complete":

INFO  2023-09-12 11:13:52,529 [shard 0:stre] storage_service - waiting for schema information to complete
INFO  2023-09-12 11:13:52,530 [shard 0:stre] schema_tables - Schema version changed to 4fd068ed-d970-3076-a296-1f6f23a86d17
...
INFO  2023-09-12 11:27:59,193 [shard 0:main] compaction_manager - Asked to stop

The text was updated successfully, but these errors were encountered:

kbr-scylla · 2023-11-08T15:56:38Z

Taking off infra? label, the issue reproduces.

I posted some observations on #15911

kbr-scylla · 2023-11-09T14:08:08Z

KQL query for ElasticSearch to find similar failures:

outcome: failure and failure_message : "Missing.*Starting listening for CQL clients" and failure_message : "raise TimeoutError" and not outcome : error

looks like it's happening with Raft enabled and disabled.

I don't understand why waiting for schema information to complete hangs --- according to the logs, schema version is the same on all nodes.

I see two possible explanations

whatever reported by the logs is not true, maybe there's a codepath that changes the schema version but doesn't cause a message to be printed
or there's some problem with gossiping schema versions; migration_manager::have_schema_agreement (which we're checking in a loop until it returns true) needs to obtain gms::application_state::SCHEMA from other nodes, and if it's not there for some node, it will return false.

Also, this looks kinda similar to #16004, but there it was not the global schema digest being different, it was the per-table versions (which we don't wait for to be synchronized on bootstrap).

kbr-scylla · 2023-11-09T14:51:00Z

Perhaps different calls of migration_manager::passive_announce are racing with each other, and node may end up gossiping an old version. I don't see what prevents this from happening during feature enabling phase on boot, where we're concurrently calling recalculate_schema_version in the background. It seems that one of these tasks, that used a schema version calculated without all features, could update gossiper after a task that calculated a schema version with all features.

cc @tgrabiec

kbr-scylla · 2023-11-09T15:00:21Z

On the other hand, maybe not?

    auto schema_change_announce = _db.local().observable_schema_version().observe([this] (table_schema_version schema_version) mutable {
        _migration_manager.local().passive_announce(std::move(schema_version));
    });

void migration_manager::passive_announce(table_schema_version version) {
    _schema_version_to_publish = version;
    (void)_schema_push.trigger().handle_exception([version = std::move(version)] (std::exception_ptr ex) {
        mlogger.warn("Passive announcing of version {} failed: {}. Ignored.", version, ex);
    });
}

let's assume that this observe mechanism serializes the callback, so that the newest version will be always be handled last. In that case, the last passive_announce should set _schema_version_to_publish to the latest version and that version will be published in the end.

But not sure.

kbr-scylla · 2023-11-09T15:02:40Z

Maybe the race could happen inside here:

static
future<> update_schema_version_and_announce(sharded<db::system_keyspace>& sys_ks, distributed<service::storage_proxy>& proxy, schema_features features) {
    auto uuid = co_await calculate_schema_digest(proxy, features);
    co_await sys_ks.local().update_schema_version(uuid);
    co_await proxy.local().get_db().invoke_on_all([uuid] (replica::database& db) {
        db.update_version(uuid);
    });
    slogger.info("Schema version changed to {}", uuid);
}

if a task with obsolete schema version hanged before the db.update_version step, we would end up with this version eventually.

Edit: but then we'd see Schema version changed to with a different version at the end... which we don't...

kbr-scylla · 2023-11-09T15:31:09Z

It could be that the node is actually leaving the have_schema_agreement loop and getting stuck in update_topology_change_info trying to acquire merge lock --- obviously, to confuse the enemy, there is no log message in between. But what would be holding the merge lock? cc @bhalevy

I feel stuck in the investigation, so I'll send a PR to improve the logging. Next time the issue reproduces in one of our CI runs, we'll have more details.

We're observing nodes getting stuck during bootstrap inside `storage_service::wait_for_ring_to_settle()`, which periodically checks `migration_manager::have_schema_agreement()` until it becomes `true`: scylladb#15393. There is no obvious reason why that happens -- according to the nodes' logs, their latest in-memory schema version is the same. So either the gossiped schema version is for some reason different (perhaps there is a race in publishing `application_state::SCHEMA`) or missing entirely. Alternatively, `wait_for_ring_to_settle` is leaving the `have_schema_agreement` loop and getting stuck in `update_topology_change_info` trying to acquire a lock. Modify logging inside `have_schema_agreement` so details about missing schema or version mismatch are logged on INFO level, and an INFO level message is printed before we return `true`. To prevent logs from getting spammed, rate-limit the periodic messages to once every 5 seconds. This will still show the reason in our tests which allow the node to hang for many minutes before timing out. Also these schema agreement checks are done on relatively rare occasions such as bootstrap, so the additional logs should not be harmful. Furthermore, when publishing schema version to gossip, log it on INFO level. This is happening at most once per schema change so it's a rare version. If there's a race in publishing schema versions, this should allow us to observe it. Ref: scylladb#15393

kbr-scylla · 2023-11-09T15:33:13Z

I feel stuck in the investigation, so I'll send a PR to improve the logging. Next time the issue reproduces in one of our CI runs, we'll have more details.

#16021

We're observing nodes getting stuck during bootstrap inside `storage_service::wait_for_ring_to_settle()`, which periodically checks `migration_manager::have_schema_agreement()` until it becomes `true`: scylladb#15393. There is no obvious reason why that happens -- according to the nodes' logs, their latest in-memory schema version is the same. So either the gossiped schema version is for some reason different (perhaps there is a race in publishing `application_state::SCHEMA`) or missing entirely. Alternatively, `wait_for_ring_to_settle` is leaving the `have_schema_agreement` loop and getting stuck in `update_topology_change_info` trying to acquire a lock. Modify logging inside `have_schema_agreement` so details about missing schema or version mismatch are logged on INFO level, and an INFO level message is printed before we return `true`. To prevent logs from getting spammed, rate-limit the periodic messages to once every 5 seconds. This will still show the reason in our tests which allow the node to hang for many minutes before timing out. Also these schema agreement checks are done on relatively rare occasions such as bootstrap, so the additional logs should not be harmful. Furthermore, when publishing schema version to gossip, log it on INFO level. This is happening at most once per schema change so it's a rare version. If there's a race in publishing schema versions, this should allow us to observe it. Ref: scylladb#15393

We're observing nodes getting stuck during bootstrap inside `storage_service::wait_for_ring_to_settle()`, which periodically checks `migration_manager::have_schema_agreement()` until it becomes `true`: scylladb#15393. There is no obvious reason why that happens -- according to the nodes' logs, their latest in-memory schema version is the same. So either the gossiped schema version is for some reason different (perhaps there is a race in publishing `application_state::SCHEMA`) or missing entirely. Alternatively, `wait_for_ring_to_settle` is leaving the `have_schema_agreement` loop and getting stuck in `update_topology_change_info` trying to acquire a lock. Modify logging inside `have_schema_agreement` so details about missing schema or version mismatch are logged on INFO level, and an INFO level message is printed before we return `true`. To prevent logs from getting spammed, rate-limit the periodic messages to once every 5 seconds. This will still show the reason in our tests which allow the node to hang for many minutes before timing out. Also these schema agreement checks are done on relatively rare occasions such as bootstrap, so the additional logs should not be harmful. Furthermore, when publishing schema version to gossip, log it on INFO level. This is happening at most once per schema change so it's a rare message. If there's a race in publishing schema versions, this should allow us to observe it. Ref: scylladb#15393

We're observing nodes getting stuck during bootstrap inside `storage_service::wait_for_ring_to_settle()`, which periodically checks `migration_manager::have_schema_agreement()` until it becomes `true`: #15393. There is no obvious reason why that happens -- according to the nodes' logs, their latest in-memory schema version is the same. So either the gossiped schema version is for some reason different (perhaps there is a race in publishing `application_state::SCHEMA`) or missing entirely. Alternatively, `wait_for_ring_to_settle` is leaving the `have_schema_agreement` loop and getting stuck in `update_topology_change_info` trying to acquire a lock. Modify logging inside `have_schema_agreement` so details about missing schema or version mismatch are logged on INFO level, and an INFO level message is printed before we return `true`. To prevent logs from getting spammed, rate-limit the periodic messages to once every 5 seconds. This will still show the reason in our tests which allow the node to hang for many minutes before timing out. Also these schema agreement checks are done on relatively rare occasions such as bootstrap, so the additional logs should not be harmful. Furthermore, when publishing schema version to gossip, log it on INFO level. This is happening at most once per schema change so it's a rare message. If there's a race in publishing schema versions, this should allow us to observe it. Ref: #15393 Closes #16021

kbr-scylla · 2023-11-27T10:06:46Z

Issue reproduced with more logs
node1.log
node2.log
node3.log
node4.log

kbr-scylla · 2023-11-27T10:13:07Z

node4 got stuck during boot printing in a loop:

INFO  2023-11-26 20:02:33,901 [shard 0:strm] migration_manager - (rate limiting dropped 497 similar messages) Schema mismatch for 127.0.3.2 (5269ca7d-7b22-3edc-8795-73f8591117d9 != 30a31e52-f9e0-33e9-8241-6f648af22fbe).

The last "Schema version changed to" message on node 4 is:

INFO  2023-11-26 20:02:28,915 [shard 0:strm] schema_tables - Schema version changed to 5269ca7d-7b22-3edc-8795-73f8591117d9

consistent with the above message.

The last "Schema version changed to" message on node 2 (127.0.3.2) is:

INFO  2023-11-26 20:02:01,931 [shard 0:strm] schema_tables - Schema version changed to 5269ca7d-7b22-3edc-8795-73f8591117d9

not consistent.

so why is node 4 claiming that node 2's schema version is 30a31e52-f9e0-33e9-8241-6f648af22fbe?

This version is indeed one of the previous versions:

INFO  2023-11-26 20:02:01,877 [shard 0:strm] schema_tables - Schema version changed to 30a31e52-f9e0-33e9-8241-6f648af22fbe

raphaelsc · 2023-12-13T14:50:12Z

Another occurrence: https://jenkins.scylladb.com/job/scylla-master/job/gating-dtest-release/6440/

kbr-scylla · 2024-03-08T14:17:45Z

I put this under #17493, now we have to wait and see if we observe it again after 111264e

kbr-scylla · 2024-03-13T16:23:46Z

Removing it from raft-topology backlog - required, there's nothing we can currently do about it other than waiting.

In testing, we've observed multiple cases where nodes would fail to observe updated application states of other nodes in gossiper. For example: - in scylladb#16902, a node would finish bootstrapping and enter NORMAL state, propagating this information through gossiper. However, other nodes would never observe that the node entered NORMAL state, still thinking that it is in joining state. This would lead to further bad consequences down the line. - in scylladb#15393, a node got stuck in bootstrap, waiting for schema versions to converge. Convergence would never be achieved and the test eventually timed out. The node was observing outdated schema state of some existing node in gossip. I created a test that would bootstrap 3 nodes, then wait until they all observe each other as NORMAL, with timeout. Unfortunately, thousands of runs of this test on different machines failed to reproduce the problem. After banging my head against the wall failing to reproduce, I decided to sprinkle randomized sleeps across multiple places in gossiper code and finally: the test started catching the problem in about 1 in 1000 runs. With additional logging and additional head-banging, I determined the root cause. The following scenario can happen, 2 nodes are sufficient, let's call them A and B: - Node B calls `add_local_application_state` to update its gossiper state, for example, to propagate its new NORMAL status. - `add_local_application_state` takes a copy of the endpoint_state, and updates the copy: ``` auto local_state = *ep_state_before; for (auto& p : states) { auto& state = p.first; auto& value = p.second; value = versioned_value::clone_with_higher_version(value); local_state.add_application_state(state, value); } ``` `clone_with_higher_version` bumps `version` inside gms/version_generator.cc. - `add_local_application_state` calls `gossiper.replicate(...)` - `replicate` works in 2 phases to achieve exception safety: in first phase it copies the updated `local_state` to all shards into a separate map. In second phase the values from separate map are used to overwrite the endpoint_state map used for gossiping. Due to the cross-shard calls of the 1 phase, there is a yield before the second phase. *During this yield* the following happens: - `gossiper::run()` loop on B executes and bumps node B's `heart_beat`. This uses the monotonic version_generator, so it uses a higher version then the ones we used for states added above. Let's call this new version X. Note that X is larger than the versions used by application_states added above. - now node B handles a SYN or ACK message from node A, creating an ACK or ACK2 message in response. This message contains: - old application states (now including the update described above, because `replicate` is still sleeping before phase 2), - but bumped heart_beat == X from `gossiper::run()` loop, and sends the message. - node A receives the message and remembers that the max version across all states (including heart_beat) of node B is X. This means that it will no longer request or apply states from node B with versions smaller than X. - `gossiper.replicate(...)` on B wakes up, and overwrites endpoint_state with the ones it saved in phase 1. In particular it reverts heart_beat back to smaller value, but the larger problem is that it saves updated application_states that use versions smaller than X. - now when node B sends the updated application_states in ACK or ACK2 message to node A, node A will ignore them, because their versions are smaller than X. Or node B will never send them, because whenever node A requests states from node B, it only requests states with versions > X. Either way, node A will fail to observe new states of node B. If I understand correctly, this is a regression introduced in 38c2347, which introduced a yield in `replicate`. Before that, the updated state would be saved atomically on shard 0, there could be no `heart_beat` bump in-between making a copy of the local state, updating it, and then saving it. With the description above, it's easy to make a consistent ~100% reproducer for the problem -- introduce a longer sleep in `add_local_application_state` before second phase of replicate, to increase the chance that gossiper loop will execute and bump heart_beat version during the yield. Further commit adds a test based on that. The fix is to bump the heart_beat under local endpoint lock, which is also taken by `replicate`. Fixes: scylladb#15393 Fixes: scylladb#15602 Fixes: scylladb#16668 Fixes: scylladb#16902 Fixes: scylladb#17493 Fixes: scylladb#18118 Fixes: scylladb/scylla-enterprise#3720

In testing, we've observed multiple cases where nodes would fail to observe updated application states of other nodes in gossiper. For example: - in scylladb#16902, a node would finish bootstrapping and enter NORMAL state, propagating this information through gossiper. However, other nodes would never observe that the node entered NORMAL state, still thinking that it is in joining state. This would lead to further bad consequences down the line. - in scylladb#15393, a node got stuck in bootstrap, waiting for schema versions to converge. Convergence would never be achieved and the test eventually timed out. The node was observing outdated schema state of some existing node in gossip. I created a test that would bootstrap 3 nodes, then wait until they all observe each other as NORMAL, with timeout. Unfortunately, thousands of runs of this test on different machines failed to reproduce the problem. After banging my head against the wall failing to reproduce, I decided to sprinkle randomized sleeps across multiple places in gossiper code and finally: the test started catching the problem in about 1 in 1000 runs. With additional logging and additional head-banging, I determined the root cause. The following scenario can happen, 2 nodes are sufficient, let's call them A and B: - Node B calls `add_local_application_state` to update its gossiper state, for example, to propagate its new NORMAL status. - `add_local_application_state` takes a copy of the endpoint_state, and updates the copy: ``` auto local_state = *ep_state_before; for (auto& p : states) { auto& state = p.first; auto& value = p.second; value = versioned_value::clone_with_higher_version(value); local_state.add_application_state(state, value); } ``` `clone_with_higher_version` bumps `version` inside gms/version_generator.cc. - `add_local_application_state` calls `gossiper.replicate(...)` - `replicate` works in 2 phases to achieve exception safety: in first phase it copies the updated `local_state` to all shards into a separate map. In second phase the values from separate map are used to overwrite the endpoint_state map used for gossiping. Due to the cross-shard calls of the 1 phase, there is a yield before the second phase. *During this yield* the following happens: - `gossiper::run()` loop on B executes and bumps node B's `heart_beat`. This uses the monotonic version_generator, so it uses a higher version then the ones we used for states added above. Let's call this new version X. Note that X is larger than the versions used by application_states added above. - now node B handles a SYN or ACK message from node A, creating an ACK or ACK2 message in response. This message contains: - old application states (now including the update described above, because `replicate` is still sleeping before phase 2), - but bumped heart_beat == X from `gossiper::run()` loop, and sends the message. - node A receives the message and remembers that the max version across all states (including heart_beat) of node B is X. This means that it will no longer request or apply states from node B with versions smaller than X. - `gossiper.replicate(...)` on B wakes up, and overwrites endpoint_state with the ones it saved in phase 1. In particular it reverts heart_beat back to smaller value, but the larger problem is that it saves updated application_states that use versions smaller than X. - now when node B sends the updated application_states in ACK or ACK2 message to node A, node A will ignore them, because their versions are smaller than X. Or node B will never send them, because whenever node A requests states from node B, it only requests states with versions > X. Either way, node A will fail to observe new states of node B. If I understand correctly, this is a regression introduced in 38c2347, which introduced a yield in `replicate`. Before that, the updated state would be saved atomically on shard 0, there could be no `heart_beat` bump in-between making a copy of the local state, updating it, and then saving it. With the description above, it's easy to make a consistent reproducer for the problem -- introduce a longer sleep in `add_local_application_state` before second phase of replicate, to increase the chance that gossiper loop will execute and bump heart_beat version during the yield. Further commit adds a test based on that. The fix is to bump the heart_beat under local endpoint lock, which is also taken by `replicate`. Fixes: scylladb#15393 Fixes: scylladb#15602 Fixes: scylladb#16668 Fixes: scylladb#16902 Fixes: scylladb#17493 Fixes: scylladb#18118 Fixes: scylladb/scylla-enterprise#3720

In testing, we've observed multiple cases where nodes would fail to observe updated application states of other nodes in gossiper. For example: - in scylladb#16902, a node would finish bootstrapping and enter NORMAL state, propagating this information through gossiper. However, other nodes would never observe that the node entered NORMAL state, still thinking that it is in joining state. This would lead to further bad consequences down the line. - in scylladb#15393, a node got stuck in bootstrap, waiting for schema versions to converge. Convergence would never be achieved and the test eventually timed out. The node was observing outdated schema state of some existing node in gossip. I created a test that would bootstrap 3 nodes, then wait until they all observe each other as NORMAL, with timeout. Unfortunately, thousands of runs of this test on different machines failed to reproduce the problem. After banging my head against the wall failing to reproduce, I decided to sprinkle randomized sleeps across multiple places in gossiper code and finally: the test started catching the problem in about 1 in 1000 runs. With additional logging and additional head-banging, I determined the root cause. The following scenario can happen, 2 nodes are sufficient, let's call them A and B: - Node B calls `add_local_application_state` to update its gossiper state, for example, to propagate its new NORMAL status. - `add_local_application_state` takes a copy of the endpoint_state, and updates the copy: ``` auto local_state = *ep_state_before; for (auto& p : states) { auto& state = p.first; auto& value = p.second; value = versioned_value::clone_with_higher_version(value); local_state.add_application_state(state, value); } ``` `clone_with_higher_version` bumps `version` inside gms/version_generator.cc. - `add_local_application_state` calls `gossiper.replicate(...)` - `replicate` works in 2 phases to achieve exception safety: in first phase it copies the updated `local_state` to all shards into a separate map. In second phase the values from separate map are used to overwrite the endpoint_state map used for gossiping. Due to the cross-shard calls of the 1 phase, there is a yield before the second phase. *During this yield* the following happens: - `gossiper::run()` loop on B executes and bumps node B's `heart_beat`. This uses the monotonic version_generator, so it uses a higher version then the ones we used for states added above. Let's call this new version X. Note that X is larger than the versions used by application_states added above. - now node B handles a SYN or ACK message from node A, creating an ACK or ACK2 message in response. This message contains: - old application states (NOT including the update described above, because `replicate` is still sleeping before phase 2), - but bumped heart_beat == X from `gossiper::run()` loop, and sends the message. - node A receives the message and remembers that the max version across all states (including heart_beat) of node B is X. This means that it will no longer request or apply states from node B with versions smaller than X. - `gossiper.replicate(...)` on B wakes up, and overwrites endpoint_state with the ones it saved in phase 1. In particular it reverts heart_beat back to smaller value, but the larger problem is that it saves updated application_states that use versions smaller than X. - now when node B sends the updated application_states in ACK or ACK2 message to node A, node A will ignore them, because their versions are smaller than X. Or node B will never send them, because whenever node A requests states from node B, it only requests states with versions > X. Either way, node A will fail to observe new states of node B. If I understand correctly, this is a regression introduced in 38c2347, which introduced a yield in `replicate`. Before that, the updated state would be saved atomically on shard 0, there could be no `heart_beat` bump in-between making a copy of the local state, updating it, and then saving it. With the description above, it's easy to make a consistent reproducer for the problem -- introduce a longer sleep in `add_local_application_state` before second phase of replicate, to increase the chance that gossiper loop will execute and bump heart_beat version during the yield. Further commit adds a test based on that. The fix is to bump the heart_beat under local endpoint lock, which is also taken by `replicate`. Fixes: scylladb#15393 Fixes: scylladb#15602 Fixes: scylladb#16668 Fixes: scylladb#16902 Fixes: scylladb#17493 Fixes: scylladb#18118 Fixes: scylladb/scylla-enterprise#3720

…amil Braun In testing, we've observed multiple cases where nodes would fail to observe updated application states of other nodes in gossiper. For example: - in #16902, a node would finish bootstrapping and enter NORMAL state, propagating this information through gossiper. However, other nodes would never observe that the node entered NORMAL state, still thinking that it is in joining state. This would lead to further bad consequences down the line. - in #15393, a node got stuck in bootstrap, waiting for schema versions to converge. Convergence would never be achieved and the test eventually timed out. The node was observing outdated schema state of some existing node in gossip. I created a test that would bootstrap 3 nodes, then wait until they all observe each other as NORMAL, with timeout. Unfortunately, thousands of runs of this test on different machines failed to reproduce the problem. After banging my head against the wall failing to reproduce, I decided to sprinkle randomized sleeps across multiple places in gossiper code and finally: the test started catching the problem in about 1 in 1000 runs. With additional logging and additional head-banging, I determined the root cause. The following scenario can happen, 2 nodes are sufficient, let's call them A and B: - Node B calls `add_local_application_state` to update its gossiper state, for example, to propagate its new NORMAL status. - `add_local_application_state` takes a copy of the endpoint_state, and updates the copy: ``` auto local_state = *ep_state_before; for (auto& p : states) { auto& state = p.first; auto& value = p.second; value = versioned_value::clone_with_higher_version(value); local_state.add_application_state(state, value); } ``` `clone_with_higher_version` bumps `version` inside gms/version_generator.cc. - `add_local_application_state` calls `gossiper.replicate(...)` - `replicate` works in 2 phases to achieve exception safety: in first phase it copies the updated `local_state` to all shards into a separate map. In second phase the values from separate map are used to overwrite the endpoint_state map used for gossiping. Due to the cross-shard calls of the 1 phase, there is a yield before the second phase. *During this yield* the following happens: - `gossiper::run()` loop on B executes and bumps node B's `heart_beat`. This uses the monotonic version_generator, so it uses a higher version then the ones we used for states added above. Let's call this new version X. Note that X is larger than the versions used by application_states added above. - now node B handles a SYN or ACK message from node A, creating an ACK or ACK2 message in response. This message contains: - old application states (NOT including the update described above, because `replicate` is still sleeping before phase 2), - but bumped heart_beat == X from `gossiper::run()` loop, and sends the message. - node A receives the message and remembers that the max version across all states (including heart_beat) of node B is X. This means that it will no longer request or apply states from node B with versions smaller than X. - `gossiper.replicate(...)` on B wakes up, and overwrites endpoint_state with the ones it saved in phase 1. In particular it reverts heart_beat back to smaller value, but the larger problem is that it saves updated application_states that use versions smaller than X. - now when node B sends the updated application_states in ACK or ACK2 message to node A, node A will ignore them, because their versions are smaller than X. Or node B will never send them, because whenever node A requests states from node B, it only requests states with versions > X. Either way, node A will fail to observe new states of node B. If I understand correctly, this is a regression introduced in 38c2347, which introduced a yield in `replicate`. Before that, the updated state would be saved atomically on shard 0, there could be no `heart_beat` bump in-between making a copy of the local state, updating it, and then saving it. With the description above, it's easy to make a consistent reproducer for the problem -- introduce a longer sleep in `add_local_application_state` before second phase of replicate, to increase the chance that gossiper loop will execute and bump heart_beat version during the yield. Further commit adds a test based on that. The fix is to bump the heart_beat under local endpoint lock, which is also taken by `replicate`. The PR also adds a regression test. Fixes: #15393 Fixes: #15602 Fixes: #16668 Fixes: #16902 Fixes: #17493 Fixes: #18118 Ref: scylladb/scylla-enterprise#3720 Closes #18184 * github.com:scylladb/scylladb: test: reproducer for missing gossiper updates gossiper: lock local endpoint when updating heart_beat

In testing, we've observed multiple cases where nodes would fail to observe updated application states of other nodes in gossiper. For example: - in #16902, a node would finish bootstrapping and enter NORMAL state, propagating this information through gossiper. However, other nodes would never observe that the node entered NORMAL state, still thinking that it is in joining state. This would lead to further bad consequences down the line. - in #15393, a node got stuck in bootstrap, waiting for schema versions to converge. Convergence would never be achieved and the test eventually timed out. The node was observing outdated schema state of some existing node in gossip. I created a test that would bootstrap 3 nodes, then wait until they all observe each other as NORMAL, with timeout. Unfortunately, thousands of runs of this test on different machines failed to reproduce the problem. After banging my head against the wall failing to reproduce, I decided to sprinkle randomized sleeps across multiple places in gossiper code and finally: the test started catching the problem in about 1 in 1000 runs. With additional logging and additional head-banging, I determined the root cause. The following scenario can happen, 2 nodes are sufficient, let's call them A and B: - Node B calls `add_local_application_state` to update its gossiper state, for example, to propagate its new NORMAL status. - `add_local_application_state` takes a copy of the endpoint_state, and updates the copy: ``` auto local_state = *ep_state_before; for (auto& p : states) { auto& state = p.first; auto& value = p.second; value = versioned_value::clone_with_higher_version(value); local_state.add_application_state(state, value); } ``` `clone_with_higher_version` bumps `version` inside gms/version_generator.cc. - `add_local_application_state` calls `gossiper.replicate(...)` - `replicate` works in 2 phases to achieve exception safety: in first phase it copies the updated `local_state` to all shards into a separate map. In second phase the values from separate map are used to overwrite the endpoint_state map used for gossiping. Due to the cross-shard calls of the 1 phase, there is a yield before the second phase. *During this yield* the following happens: - `gossiper::run()` loop on B executes and bumps node B's `heart_beat`. This uses the monotonic version_generator, so it uses a higher version then the ones we used for states added above. Let's call this new version X. Note that X is larger than the versions used by application_states added above. - now node B handles a SYN or ACK message from node A, creating an ACK or ACK2 message in response. This message contains: - old application states (NOT including the update described above, because `replicate` is still sleeping before phase 2), - but bumped heart_beat == X from `gossiper::run()` loop, and sends the message. - node A receives the message and remembers that the max version across all states (including heart_beat) of node B is X. This means that it will no longer request or apply states from node B with versions smaller than X. - `gossiper.replicate(...)` on B wakes up, and overwrites endpoint_state with the ones it saved in phase 1. In particular it reverts heart_beat back to smaller value, but the larger problem is that it saves updated application_states that use versions smaller than X. - now when node B sends the updated application_states in ACK or ACK2 message to node A, node A will ignore them, because their versions are smaller than X. Or node B will never send them, because whenever node A requests states from node B, it only requests states with versions > X. Either way, node A will fail to observe new states of node B. If I understand correctly, this is a regression introduced in 38c2347, which introduced a yield in `replicate`. Before that, the updated state would be saved atomically on shard 0, there could be no `heart_beat` bump in-between making a copy of the local state, updating it, and then saving it. With the description above, it's easy to make a consistent reproducer for the problem -- introduce a longer sleep in `add_local_application_state` before second phase of replicate, to increase the chance that gossiper loop will execute and bump heart_beat version during the yield. Further commit adds a test based on that. The fix is to bump the heart_beat under local endpoint lock, which is also taken by `replicate`. Fixes: #15393 Fixes: #15602 Fixes: #16668 Fixes: #16902 Fixes: #17493 Fixes: #18118 Ref: scylladb/scylla-enterprise#3720 (cherry picked from commit a0b331b)

…rt_beat' from ScyllaDB In testing, we've observed multiple cases where nodes would fail to observe updated application states of other nodes in gossiper. For example: - in #16902, a node would finish bootstrapping and enter NORMAL state, propagating this information through gossiper. However, other nodes would never observe that the node entered NORMAL state, still thinking that it is in joining state. This would lead to further bad consequences down the line. - in #15393, a node got stuck in bootstrap, waiting for schema versions to converge. Convergence would never be achieved and the test eventually timed out. The node was observing outdated schema state of some existing node in gossip. I created a test that would bootstrap 3 nodes, then wait until they all observe each other as NORMAL, with timeout. Unfortunately, thousands of runs of this test on different machines failed to reproduce the problem. After banging my head against the wall failing to reproduce, I decided to sprinkle randomized sleeps across multiple places in gossiper code and finally: the test started catching the problem in about 1 in 1000 runs. With additional logging and additional head-banging, I determined the root cause. The following scenario can happen, 2 nodes are sufficient, let's call them A and B: - Node B calls `add_local_application_state` to update its gossiper state, for example, to propagate its new NORMAL status. - `add_local_application_state` takes a copy of the endpoint_state, and updates the copy: ``` auto local_state = *ep_state_before; for (auto& p : states) { auto& state = p.first; auto& value = p.second; value = versioned_value::clone_with_higher_version(value); local_state.add_application_state(state, value); } ``` `clone_with_higher_version` bumps `version` inside gms/version_generator.cc. - `add_local_application_state` calls `gossiper.replicate(...)` - `replicate` works in 2 phases to achieve exception safety: in first phase it copies the updated `local_state` to all shards into a separate map. In second phase the values from separate map are used to overwrite the endpoint_state map used for gossiping. Due to the cross-shard calls of the 1 phase, there is a yield before the second phase. *During this yield* the following happens: - `gossiper::run()` loop on B executes and bumps node B's `heart_beat`. This uses the monotonic version_generator, so it uses a higher version then the ones we used for states added above. Let's call this new version X. Note that X is larger than the versions used by application_states added above. - now node B handles a SYN or ACK message from node A, creating an ACK or ACK2 message in response. This message contains: - old application states (NOT including the update described above, because `replicate` is still sleeping before phase 2), - but bumped heart_beat == X from `gossiper::run()` loop, and sends the message. - node A receives the message and remembers that the max version across all states (including heart_beat) of node B is X. This means that it will no longer request or apply states from node B with versions smaller than X. - `gossiper.replicate(...)` on B wakes up, and overwrites endpoint_state with the ones it saved in phase 1. In particular it reverts heart_beat back to smaller value, but the larger problem is that it saves updated application_states that use versions smaller than X. - now when node B sends the updated application_states in ACK or ACK2 message to node A, node A will ignore them, because their versions are smaller than X. Or node B will never send them, because whenever node A requests states from node B, it only requests states with versions > X. Either way, node A will fail to observe new states of node B. If I understand correctly, this is a regression introduced in 38c2347, which introduced a yield in `replicate`. Before that, the updated state would be saved atomically on shard 0, there could be no `heart_beat` bump in-between making a copy of the local state, updating it, and then saving it. With the description above, it's easy to make a consistent reproducer for the problem -- introduce a longer sleep in `add_local_application_state` before second phase of replicate, to increase the chance that gossiper loop will execute and bump heart_beat version during the yield. Further commit adds a test based on that. The fix is to bump the heart_beat under local endpoint lock, which is also taken by `replicate`. The PR also adds a regression test. Fixes: #15393 Fixes: #15602 Fixes: #16668 Fixes: #16902 Fixes: #17493 Fixes: #18118 Ref: scylladb/scylla-enterprise#3720 (cherry picked from commit a0b331b) (cherry picked from commit 7295509) Refs #18184 Closes #18245 * github.com:scylladb/scylladb: test: reproducer for missing gossiper updates gossiper: lock local endpoint when updating heart_beat

In testing, we've observed multiple cases where nodes would fail to observe updated application states of other nodes in gossiper. For example: - in scylladb#16902, a node would finish bootstrapping and enter NORMAL state, propagating this information through gossiper. However, other nodes would never observe that the node entered NORMAL state, still thinking that it is in joining state. This would lead to further bad consequences down the line. - in scylladb#15393, a node got stuck in bootstrap, waiting for schema versions to converge. Convergence would never be achieved and the test eventually timed out. The node was observing outdated schema state of some existing node in gossip. I created a test that would bootstrap 3 nodes, then wait until they all observe each other as NORMAL, with timeout. Unfortunately, thousands of runs of this test on different machines failed to reproduce the problem. After banging my head against the wall failing to reproduce, I decided to sprinkle randomized sleeps across multiple places in gossiper code and finally: the test started catching the problem in about 1 in 1000 runs. With additional logging and additional head-banging, I determined the root cause. The following scenario can happen, 2 nodes are sufficient, let's call them A and B: - Node B calls `add_local_application_state` to update its gossiper state, for example, to propagate its new NORMAL status. - `add_local_application_state` takes a copy of the endpoint_state, and updates the copy: ``` auto local_state = *ep_state_before; for (auto& p : states) { auto& state = p.first; auto& value = p.second; value = versioned_value::clone_with_higher_version(value); local_state.add_application_state(state, value); } ``` `clone_with_higher_version` bumps `version` inside gms/version_generator.cc. - `add_local_application_state` calls `gossiper.replicate(...)` - `replicate` works in 2 phases to achieve exception safety: in first phase it copies the updated `local_state` to all shards into a separate map. In second phase the values from separate map are used to overwrite the endpoint_state map used for gossiping. Due to the cross-shard calls of the 1 phase, there is a yield before the second phase. *During this yield* the following happens: - `gossiper::run()` loop on B executes and bumps node B's `heart_beat`. This uses the monotonic version_generator, so it uses a higher version then the ones we used for states added above. Let's call this new version X. Note that X is larger than the versions used by application_states added above. - now node B handles a SYN or ACK message from node A, creating an ACK or ACK2 message in response. This message contains: - old application states (NOT including the update described above, because `replicate` is still sleeping before phase 2), - but bumped heart_beat == X from `gossiper::run()` loop, and sends the message. - node A receives the message and remembers that the max version across all states (including heart_beat) of node B is X. This means that it will no longer request or apply states from node B with versions smaller than X. - `gossiper.replicate(...)` on B wakes up, and overwrites endpoint_state with the ones it saved in phase 1. In particular it reverts heart_beat back to smaller value, but the larger problem is that it saves updated application_states that use versions smaller than X. - now when node B sends the updated application_states in ACK or ACK2 message to node A, node A will ignore them, because their versions are smaller than X. Or node B will never send them, because whenever node A requests states from node B, it only requests states with versions > X. Either way, node A will fail to observe new states of node B. If I understand correctly, this is a regression introduced in 38c2347, which introduced a yield in `replicate`. Before that, the updated state would be saved atomically on shard 0, there could be no `heart_beat` bump in-between making a copy of the local state, updating it, and then saving it. With the description above, it's easy to make a consistent reproducer for the problem -- introduce a longer sleep in `add_local_application_state` before second phase of replicate, to increase the chance that gossiper loop will execute and bump heart_beat version during the yield. Further commit adds a test based on that. The fix is to bump the heart_beat under local endpoint lock, which is also taken by `replicate`. Fixes: scylladb#15393 Fixes: scylladb#15602 Fixes: scylladb#16668 Fixes: scylladb#16902 Fixes: scylladb#17493 Fixes: scylladb#18118 Ref: scylladb/scylla-enterprise#3720

patjed41 added tests/dtest symptom/ci stability Issues that failed in ScyllaDB CI - tests and framework labels Sep 13, 2023

kbr-scylla assigned kostja and kbr-scylla Sep 14, 2023

kostja added the infra? label Oct 20, 2023

mykaul mentioned this issue Nov 8, 2023

[5.4] test_resharding_by_murmur3_smp[4-SizeTieredCompactionStrategy-15]: node bootstrap timed out #15911

Closed

kbr-scylla removed the infra? label Nov 8, 2023

kbr-scylla mentioned this issue Nov 9, 2023

migration_manager: more verbose logging for schema versions #16021

Closed

kbr-scylla mentioned this issue Dec 5, 2023

test_putget_with_internode_rack_ssl_with_dc_compression: joining node did not start bootstrap #16099

Closed

raphaelsc mentioned this issue Dec 13, 2023

Add support to tablet storage splitting #16158

Merged

This was referenced Feb 23, 2024

Missing gossiper updates #17493

Closed

Flaky test test_raft_upgrade_no_schema.test_upgrade_with_no_schema_commitlog - fails to add node #15757

Closed

kbr-scylla mentioned this issue Apr 4, 2024

gossiper: lock local endpoint when updating heart_beat #18184

Merged

scylladb-promoter closed this as completed in a0b331b Apr 16, 2024

scylladb-promoter added the Backport candidate label Apr 16, 2024

mergify bot mentioned this issue Apr 16, 2024

[Backport 5.4] gossiper: lock local endpoint when updating heart_beat #18245

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[dtest] test failed due to a node getting stuck in bootstrap #15393

[dtest] test failed due to a node getting stuck in bootstrap #15393

patjed41 commented Sep 13, 2023

kbr-scylla commented Nov 8, 2023

kbr-scylla commented Nov 9, 2023 •

edited

kbr-scylla commented Nov 9, 2023

kbr-scylla commented Nov 9, 2023

kbr-scylla commented Nov 9, 2023 •

edited

kbr-scylla commented Nov 9, 2023

kbr-scylla commented Nov 9, 2023

kbr-scylla commented Nov 27, 2023

kbr-scylla commented Nov 27, 2023

raphaelsc commented Dec 13, 2023

kbr-scylla commented Mar 8, 2024

kbr-scylla commented Mar 13, 2024

[dtest] test failed due to a node getting stuck in bootstrap #15393

[dtest] test failed due to a node getting stuck in bootstrap #15393

Comments

patjed41 commented Sep 13, 2023

kbr-scylla commented Nov 8, 2023

kbr-scylla commented Nov 9, 2023 • edited

kbr-scylla commented Nov 9, 2023

kbr-scylla commented Nov 9, 2023

kbr-scylla commented Nov 9, 2023 • edited

kbr-scylla commented Nov 9, 2023

kbr-scylla commented Nov 9, 2023

kbr-scylla commented Nov 27, 2023

kbr-scylla commented Nov 27, 2023

raphaelsc commented Dec 13, 2023

kbr-scylla commented Mar 8, 2024

kbr-scylla commented Mar 13, 2024

kbr-scylla commented Nov 9, 2023 •

edited

kbr-scylla commented Nov 9, 2023 •

edited