Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dtest] test failed due to a node getting stuck in bootstrap #15393

Closed
Tracked by #17493
patjed41 opened this issue Sep 13, 2023 · 12 comments · Fixed by #18184
Closed
Tracked by #17493

[dtest] test failed due to a node getting stuck in bootstrap #15393

patjed41 opened this issue Sep 13, 2023 · 12 comments · Fixed by #18184
Assignees
Labels
Backport candidate symptom/ci stability Issues that failed in ScyllaDB CI - tests and framework tests/dtest

Comments

@patjed41
Copy link
Contributor

Build https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/3602/ failed on resharding_test.TestReshardingVariants.test_resharding_by_smp_increase[4-SizeTieredCompactionStrategy-15].

Attached logs:
dtest-gw4.log
1694518082201_resharding_test.py TestReshardingVariants test_resharding_by_smp_increase[4-SizeTieredCompactionStrategy-15].zip

This test runs on a 4-node cluster. Before the test even started, the bootstrap procedure of the 3rd node got stuck during "waiting for schema information to complete":

INFO  2023-09-12 11:13:52,529 [shard 0:stre] storage_service - waiting for schema information to complete
INFO  2023-09-12 11:13:52,530 [shard 0:stre] schema_tables - Schema version changed to 4fd068ed-d970-3076-a296-1f6f23a86d17
...
INFO  2023-09-12 11:27:59,193 [shard 0:main] compaction_manager - Asked to stop
@kbr-scylla
Copy link
Contributor

Taking off infra? label, the issue reproduces.

I posted some observations on #15911

@kbr-scylla
Copy link
Contributor

kbr-scylla commented Nov 9, 2023

KQL query for ElasticSearch to find similar failures:

outcome: failure and failure_message : "Missing.*Starting listening for CQL clients" and failure_message : "raise TimeoutError" and not outcome : error

looks like it's happening with Raft enabled and disabled.

I don't understand why waiting for schema information to complete hangs --- according to the logs, schema version is the same on all nodes.

I see two possible explanations

  • whatever reported by the logs is not true, maybe there's a codepath that changes the schema version but doesn't cause a message to be printed
  • or there's some problem with gossiping schema versions; migration_manager::have_schema_agreement (which we're checking in a loop until it returns true) needs to obtain gms::application_state::SCHEMA from other nodes, and if it's not there for some node, it will return false.

Also, this looks kinda similar to #16004, but there it was not the global schema digest being different, it was the per-table versions (which we don't wait for to be synchronized on bootstrap).

@kbr-scylla
Copy link
Contributor

Perhaps different calls of migration_manager::passive_announce are racing with each other, and node may end up gossiping an old version. I don't see what prevents this from happening during feature enabling phase on boot, where we're concurrently calling recalculate_schema_version in the background. It seems that one of these tasks, that used a schema version calculated without all features, could update gossiper after a task that calculated a schema version with all features.

cc @tgrabiec

@kbr-scylla
Copy link
Contributor

On the other hand, maybe not?

    auto schema_change_announce = _db.local().observable_schema_version().observe([this] (table_schema_version schema_version) mutable {
        _migration_manager.local().passive_announce(std::move(schema_version));
    });
void migration_manager::passive_announce(table_schema_version version) {
    _schema_version_to_publish = version;
    (void)_schema_push.trigger().handle_exception([version = std::move(version)] (std::exception_ptr ex) {
        mlogger.warn("Passive announcing of version {} failed: {}. Ignored.", version, ex);
    });
}

let's assume that this observe mechanism serializes the callback, so that the newest version will be always be handled last. In that case, the last passive_announce should set _schema_version_to_publish to the latest version and that version will be published in the end.

But not sure.

@kbr-scylla
Copy link
Contributor

kbr-scylla commented Nov 9, 2023

Maybe the race could happen inside here:

static
future<> update_schema_version_and_announce(sharded<db::system_keyspace>& sys_ks, distributed<service::storage_proxy>& proxy, schema_features features) {
    auto uuid = co_await calculate_schema_digest(proxy, features);
    co_await sys_ks.local().update_schema_version(uuid);
    co_await proxy.local().get_db().invoke_on_all([uuid] (replica::database& db) {
        db.update_version(uuid);
    });
    slogger.info("Schema version changed to {}", uuid);
}

if a task with obsolete schema version hanged before the db.update_version step, we would end up with this version eventually.

Edit: but then we'd see Schema version changed to with a different version at the end... which we don't...

@kbr-scylla
Copy link
Contributor

It could be that the node is actually leaving the have_schema_agreement loop and getting stuck in update_topology_change_info trying to acquire merge lock --- obviously, to confuse the enemy, there is no log message in between. But what would be holding the merge lock? cc @bhalevy

I feel stuck in the investigation, so I'll send a PR to improve the logging. Next time the issue reproduces in one of our CI runs, we'll have more details.

kbr-scylla added a commit to kbr-scylla/scylladb that referenced this issue Nov 9, 2023
We're observing nodes getting stuck during bootstrap inside
`storage_service::wait_for_ring_to_settle()`, which periodically checks
`migration_manager::have_schema_agreement()` until it becomes `true`:
scylladb#15393.

There is no obvious reason why that happens -- according to the nodes'
logs, their latest in-memory schema version is the same.

So either the gossiped schema version is for some reason different
(perhaps there is a race in publishing `application_state::SCHEMA`) or
missing entirely.

Alternatively, `wait_for_ring_to_settle` is leaving the
`have_schema_agreement` loop and getting stuck in
`update_topology_change_info` trying to acquire a lock.

Modify logging inside `have_schema_agreement` so details about missing
schema or version mismatch are logged on INFO level, and an INFO level
message is printed before we return `true`. To prevent logs from getting
spammed, rate-limit the periodic messages to once every 5 seconds. This
will still show the reason in our tests which allow the node to hang for
many minutes before timing out. Also these schema agreement checks are
done on relatively rare occasions such as bootstrap, so the additional
logs should not be harmful.

Furthermore, when publishing schema version to gossip, log it on INFO
level. This is happening at most once per schema change so it's a rare
version. If there's a race in publishing schema versions, this should
allow us to observe it.

Ref: scylladb#15393
@kbr-scylla
Copy link
Contributor

I feel stuck in the investigation, so I'll send a PR to improve the logging. Next time the issue reproduces in one of our CI runs, we'll have more details.

#16021

kbr-scylla added a commit to kbr-scylla/scylladb that referenced this issue Nov 13, 2023
We're observing nodes getting stuck during bootstrap inside
`storage_service::wait_for_ring_to_settle()`, which periodically checks
`migration_manager::have_schema_agreement()` until it becomes `true`:
scylladb#15393.

There is no obvious reason why that happens -- according to the nodes'
logs, their latest in-memory schema version is the same.

So either the gossiped schema version is for some reason different
(perhaps there is a race in publishing `application_state::SCHEMA`) or
missing entirely.

Alternatively, `wait_for_ring_to_settle` is leaving the
`have_schema_agreement` loop and getting stuck in
`update_topology_change_info` trying to acquire a lock.

Modify logging inside `have_schema_agreement` so details about missing
schema or version mismatch are logged on INFO level, and an INFO level
message is printed before we return `true`. To prevent logs from getting
spammed, rate-limit the periodic messages to once every 5 seconds. This
will still show the reason in our tests which allow the node to hang for
many minutes before timing out. Also these schema agreement checks are
done on relatively rare occasions such as bootstrap, so the additional
logs should not be harmful.

Furthermore, when publishing schema version to gossip, log it on INFO
level. This is happening at most once per schema change so it's a rare
version. If there's a race in publishing schema versions, this should
allow us to observe it.

Ref: scylladb#15393
kbr-scylla added a commit to kbr-scylla/scylladb that referenced this issue Nov 13, 2023
We're observing nodes getting stuck during bootstrap inside
`storage_service::wait_for_ring_to_settle()`, which periodically checks
`migration_manager::have_schema_agreement()` until it becomes `true`:
scylladb#15393.

There is no obvious reason why that happens -- according to the nodes'
logs, their latest in-memory schema version is the same.

So either the gossiped schema version is for some reason different
(perhaps there is a race in publishing `application_state::SCHEMA`) or
missing entirely.

Alternatively, `wait_for_ring_to_settle` is leaving the
`have_schema_agreement` loop and getting stuck in
`update_topology_change_info` trying to acquire a lock.

Modify logging inside `have_schema_agreement` so details about missing
schema or version mismatch are logged on INFO level, and an INFO level
message is printed before we return `true`. To prevent logs from getting
spammed, rate-limit the periodic messages to once every 5 seconds. This
will still show the reason in our tests which allow the node to hang for
many minutes before timing out. Also these schema agreement checks are
done on relatively rare occasions such as bootstrap, so the additional
logs should not be harmful.

Furthermore, when publishing schema version to gossip, log it on INFO
level. This is happening at most once per schema change so it's a rare
message. If there's a race in publishing schema versions, this should
allow us to observe it.

Ref: scylladb#15393
denesb pushed a commit that referenced this issue Nov 14, 2023
We're observing nodes getting stuck during bootstrap inside
`storage_service::wait_for_ring_to_settle()`, which periodically checks
`migration_manager::have_schema_agreement()` until it becomes `true`:
#15393.

There is no obvious reason why that happens -- according to the nodes'
logs, their latest in-memory schema version is the same.

So either the gossiped schema version is for some reason different
(perhaps there is a race in publishing `application_state::SCHEMA`) or
missing entirely.

Alternatively, `wait_for_ring_to_settle` is leaving the
`have_schema_agreement` loop and getting stuck in
`update_topology_change_info` trying to acquire a lock.

Modify logging inside `have_schema_agreement` so details about missing
schema or version mismatch are logged on INFO level, and an INFO level
message is printed before we return `true`. To prevent logs from getting
spammed, rate-limit the periodic messages to once every 5 seconds. This
will still show the reason in our tests which allow the node to hang for
many minutes before timing out. Also these schema agreement checks are
done on relatively rare occasions such as bootstrap, so the additional
logs should not be harmful.

Furthermore, when publishing schema version to gossip, log it on INFO
level. This is happening at most once per schema change so it's a rare
message. If there's a race in publishing schema versions, this should
allow us to observe it.

Ref: #15393

Closes #16021
@kbr-scylla
Copy link
Contributor

Issue reproduced with more logs
node1.log
node2.log
node3.log
node4.log

@kbr-scylla
Copy link
Contributor

node4 got stuck during boot printing in a loop:

INFO  2023-11-26 20:02:33,901 [shard 0:strm] migration_manager - (rate limiting dropped 497 similar messages) Schema mismatch for 127.0.3.2 (5269ca7d-7b22-3edc-8795-73f8591117d9 != 30a31e52-f9e0-33e9-8241-6f648af22fbe).

The last "Schema version changed to" message on node 4 is:

INFO  2023-11-26 20:02:28,915 [shard 0:strm] schema_tables - Schema version changed to 5269ca7d-7b22-3edc-8795-73f8591117d9

consistent with the above message.

The last "Schema version changed to" message on node 2 (127.0.3.2) is:

INFO  2023-11-26 20:02:01,931 [shard 0:strm] schema_tables - Schema version changed to 5269ca7d-7b22-3edc-8795-73f8591117d9

not consistent.

so why is node 4 claiming that node 2's schema version is 30a31e52-f9e0-33e9-8241-6f648af22fbe?

This version is indeed one of the previous versions:

INFO  2023-11-26 20:02:01,877 [shard 0:strm] schema_tables - Schema version changed to 30a31e52-f9e0-33e9-8241-6f648af22fbe

@raphaelsc
Copy link
Member

@kbr-scylla
Copy link
Contributor

I put this under #17493, now we have to wait and see if we observe it again after 111264e

@kbr-scylla
Copy link
Contributor

Removing it from raft-topology backlog - required, there's nothing we can currently do about it other than waiting.

kbr-scylla added a commit to kbr-scylla/scylladb that referenced this issue Apr 4, 2024
In testing, we've observed multiple cases where nodes would fail to
observe updated application states of other nodes in gossiper.

For example:
- in scylladb#16902, a node would finish bootstrapping and enter
NORMAL state, propagating this information through gossiper. However,
other nodes would never observe that the node entered NORMAL state,
still thinking that it is in joining state. This would lead to further
bad consequences down the line.
- in scylladb#15393, a node got stuck in bootstrap, waiting for
schema versions to converge. Convergence would never be achieved and the
test eventually timed out. The node was observing outdated schema state
of some existing node in gossip.

I created a test that would bootstrap 3 nodes, then wait until they all
observe each other as NORMAL, with timeout. Unfortunately, thousands of
runs of this test on different machines failed to reproduce the problem.

After banging my head against the wall failing to reproduce, I decided
to sprinkle randomized sleeps across multiple places in gossiper code
and finally: the test started catching the problem in about 1 in 1000
runs.

With additional logging and additional head-banging, I determined
the root cause.

The following scenario can happen, 2 nodes are sufficient, let's call
them A and B:
- Node B calls `add_local_application_state` to update its gossiper
  state, for example, to propagate its new NORMAL status.
- `add_local_application_state` takes a copy of the endpoint_state, and
  updates the copy:
```
            auto local_state = *ep_state_before;
            for (auto& p : states) {
                auto& state = p.first;
                auto& value = p.second;
                value = versioned_value::clone_with_higher_version(value);
                local_state.add_application_state(state, value);
            }
```
  `clone_with_higher_version` bumps `version` inside
  gms/version_generator.cc.
- `add_local_application_state` calls `gossiper.replicate(...)`
- `replicate` works in 2 phases to achieve exception safety: in first
  phase it copies the updated `local_state` to all shards into a
  separate map. In second phase the values from separate map are used to
  overwrite the endpoint_state map used for gossiping.

  Due to the cross-shard calls of the 1 phase, there is a yield before
  the second phase. *During this yield* the following happens:
- `gossiper::run()` loop on B executes and bumps node B's `heart_beat`.
  This uses the monotonic version_generator, so it uses a higher version
  then the ones we used for states added above. Let's call this new version
  X. Note that X is larger than the versions used by application_states
  added above.
- now node B handles a SYN or ACK message from node A, creating
  an ACK or ACK2 message in response. This message contains:
    - old application states (now including the update described above,
      because `replicate` is still sleeping before phase 2),
    - but bumped heart_beat == X from `gossiper::run()` loop,
  and sends the message.
- node A receives the message and remembers that the max
  version across all states (including heart_beat) of node B is X.
  This means that it will no longer request or apply states from node B
  with versions smaller than X.
- `gossiper.replicate(...)` on B wakes up, and overwrites
  endpoint_state with the ones it saved in phase 1. In particular it
  reverts heart_beat back to smaller value, but the larger problem is that it
  saves updated application_states that use versions smaller than X.
- now when node B sends the updated application_states in ACK or ACK2
  message to node A, node A will ignore them, because their versions are
  smaller than X. Or node B will never send them, because whenever node
  A requests states from node B, it only requests states with versions >
  X. Either way, node A will fail to observe new states of node B.

If I understand correctly, this is a regression introduced in
38c2347, which introduced a yield in
`replicate`. Before that, the updated state would be saved atomically on
shard 0, there could be no `heart_beat` bump in-between making a copy of
the local state, updating it, and then saving it.

With the description above, it's easy to make a consistent ~100%
reproducer for the problem -- introduce a longer sleep in
`add_local_application_state` before second phase of replicate, to
increase the chance that gossiper loop will execute and bump heart_beat
version during the yield. Further commit adds a test based on that.

The fix is to bump the heart_beat under local endpoint lock, which is
also taken by `replicate`.

Fixes: scylladb#15393
Fixes: scylladb#15602
Fixes: scylladb#16668
Fixes: scylladb#16902
Fixes: scylladb#17493
Fixes: scylladb#18118
Fixes: scylladb/scylla-enterprise#3720
kbr-scylla added a commit to kbr-scylla/scylladb that referenced this issue Apr 4, 2024
In testing, we've observed multiple cases where nodes would fail to
observe updated application states of other nodes in gossiper.

For example:
- in scylladb#16902, a node would finish bootstrapping and enter
NORMAL state, propagating this information through gossiper. However,
other nodes would never observe that the node entered NORMAL state,
still thinking that it is in joining state. This would lead to further
bad consequences down the line.
- in scylladb#15393, a node got stuck in bootstrap, waiting for
schema versions to converge. Convergence would never be achieved and the
test eventually timed out. The node was observing outdated schema state
of some existing node in gossip.

I created a test that would bootstrap 3 nodes, then wait until they all
observe each other as NORMAL, with timeout. Unfortunately, thousands of
runs of this test on different machines failed to reproduce the problem.

After banging my head against the wall failing to reproduce, I decided
to sprinkle randomized sleeps across multiple places in gossiper code
and finally: the test started catching the problem in about 1 in 1000
runs.

With additional logging and additional head-banging, I determined
the root cause.

The following scenario can happen, 2 nodes are sufficient, let's call
them A and B:
- Node B calls `add_local_application_state` to update its gossiper
  state, for example, to propagate its new NORMAL status.
- `add_local_application_state` takes a copy of the endpoint_state, and
  updates the copy:
```
            auto local_state = *ep_state_before;
            for (auto& p : states) {
                auto& state = p.first;
                auto& value = p.second;
                value = versioned_value::clone_with_higher_version(value);
                local_state.add_application_state(state, value);
            }
```
  `clone_with_higher_version` bumps `version` inside
  gms/version_generator.cc.
- `add_local_application_state` calls `gossiper.replicate(...)`
- `replicate` works in 2 phases to achieve exception safety: in first
  phase it copies the updated `local_state` to all shards into a
  separate map. In second phase the values from separate map are used to
  overwrite the endpoint_state map used for gossiping.

  Due to the cross-shard calls of the 1 phase, there is a yield before
  the second phase. *During this yield* the following happens:
- `gossiper::run()` loop on B executes and bumps node B's `heart_beat`.
  This uses the monotonic version_generator, so it uses a higher version
  then the ones we used for states added above. Let's call this new version
  X. Note that X is larger than the versions used by application_states
  added above.
- now node B handles a SYN or ACK message from node A, creating
  an ACK or ACK2 message in response. This message contains:
    - old application states (now including the update described above,
      because `replicate` is still sleeping before phase 2),
    - but bumped heart_beat == X from `gossiper::run()` loop,
  and sends the message.
- node A receives the message and remembers that the max
  version across all states (including heart_beat) of node B is X.
  This means that it will no longer request or apply states from node B
  with versions smaller than X.
- `gossiper.replicate(...)` on B wakes up, and overwrites
  endpoint_state with the ones it saved in phase 1. In particular it
  reverts heart_beat back to smaller value, but the larger problem is that it
  saves updated application_states that use versions smaller than X.
- now when node B sends the updated application_states in ACK or ACK2
  message to node A, node A will ignore them, because their versions are
  smaller than X. Or node B will never send them, because whenever node
  A requests states from node B, it only requests states with versions >
  X. Either way, node A will fail to observe new states of node B.

If I understand correctly, this is a regression introduced in
38c2347, which introduced a yield in
`replicate`. Before that, the updated state would be saved atomically on
shard 0, there could be no `heart_beat` bump in-between making a copy of
the local state, updating it, and then saving it.

With the description above, it's easy to make a consistent
reproducer for the problem -- introduce a longer sleep in
`add_local_application_state` before second phase of replicate, to
increase the chance that gossiper loop will execute and bump heart_beat
version during the yield. Further commit adds a test based on that.

The fix is to bump the heart_beat under local endpoint lock, which is
also taken by `replicate`.

Fixes: scylladb#15393
Fixes: scylladb#15602
Fixes: scylladb#16668
Fixes: scylladb#16902
Fixes: scylladb#17493
Fixes: scylladb#18118
Fixes: scylladb/scylla-enterprise#3720
kbr-scylla added a commit to kbr-scylla/scylladb that referenced this issue Apr 4, 2024
In testing, we've observed multiple cases where nodes would fail to
observe updated application states of other nodes in gossiper.

For example:
- in scylladb#16902, a node would finish bootstrapping and enter
NORMAL state, propagating this information through gossiper. However,
other nodes would never observe that the node entered NORMAL state,
still thinking that it is in joining state. This would lead to further
bad consequences down the line.
- in scylladb#15393, a node got stuck in bootstrap, waiting for
schema versions to converge. Convergence would never be achieved and the
test eventually timed out. The node was observing outdated schema state
of some existing node in gossip.

I created a test that would bootstrap 3 nodes, then wait until they all
observe each other as NORMAL, with timeout. Unfortunately, thousands of
runs of this test on different machines failed to reproduce the problem.

After banging my head against the wall failing to reproduce, I decided
to sprinkle randomized sleeps across multiple places in gossiper code
and finally: the test started catching the problem in about 1 in 1000
runs.

With additional logging and additional head-banging, I determined
the root cause.

The following scenario can happen, 2 nodes are sufficient, let's call
them A and B:
- Node B calls `add_local_application_state` to update its gossiper
  state, for example, to propagate its new NORMAL status.
- `add_local_application_state` takes a copy of the endpoint_state, and
  updates the copy:
```
            auto local_state = *ep_state_before;
            for (auto& p : states) {
                auto& state = p.first;
                auto& value = p.second;
                value = versioned_value::clone_with_higher_version(value);
                local_state.add_application_state(state, value);
            }
```
  `clone_with_higher_version` bumps `version` inside
  gms/version_generator.cc.
- `add_local_application_state` calls `gossiper.replicate(...)`
- `replicate` works in 2 phases to achieve exception safety: in first
  phase it copies the updated `local_state` to all shards into a
  separate map. In second phase the values from separate map are used to
  overwrite the endpoint_state map used for gossiping.

  Due to the cross-shard calls of the 1 phase, there is a yield before
  the second phase. *During this yield* the following happens:
- `gossiper::run()` loop on B executes and bumps node B's `heart_beat`.
  This uses the monotonic version_generator, so it uses a higher version
  then the ones we used for states added above. Let's call this new version
  X. Note that X is larger than the versions used by application_states
  added above.
- now node B handles a SYN or ACK message from node A, creating
  an ACK or ACK2 message in response. This message contains:
    - old application states (NOT including the update described above,
      because `replicate` is still sleeping before phase 2),
    - but bumped heart_beat == X from `gossiper::run()` loop,
  and sends the message.
- node A receives the message and remembers that the max
  version across all states (including heart_beat) of node B is X.
  This means that it will no longer request or apply states from node B
  with versions smaller than X.
- `gossiper.replicate(...)` on B wakes up, and overwrites
  endpoint_state with the ones it saved in phase 1. In particular it
  reverts heart_beat back to smaller value, but the larger problem is that it
  saves updated application_states that use versions smaller than X.
- now when node B sends the updated application_states in ACK or ACK2
  message to node A, node A will ignore them, because their versions are
  smaller than X. Or node B will never send them, because whenever node
  A requests states from node B, it only requests states with versions >
  X. Either way, node A will fail to observe new states of node B.

If I understand correctly, this is a regression introduced in
38c2347, which introduced a yield in
`replicate`. Before that, the updated state would be saved atomically on
shard 0, there could be no `heart_beat` bump in-between making a copy of
the local state, updating it, and then saving it.

With the description above, it's easy to make a consistent
reproducer for the problem -- introduce a longer sleep in
`add_local_application_state` before second phase of replicate, to
increase the chance that gossiper loop will execute and bump heart_beat
version during the yield. Further commit adds a test based on that.

The fix is to bump the heart_beat under local endpoint lock, which is
also taken by `replicate`.

Fixes: scylladb#15393
Fixes: scylladb#15602
Fixes: scylladb#16668
Fixes: scylladb#16902
Fixes: scylladb#17493
Fixes: scylladb#18118
Fixes: scylladb/scylla-enterprise#3720
denesb added a commit that referenced this issue Apr 16, 2024
…amil Braun

In testing, we've observed multiple cases where nodes would fail to
observe updated application states of other nodes in gossiper.

For example:
- in #16902, a node would finish bootstrapping and enter
NORMAL state, propagating this information through gossiper. However,
other nodes would never observe that the node entered NORMAL state,
still thinking that it is in joining state. This would lead to further
bad consequences down the line.
- in #15393, a node got stuck in bootstrap, waiting for
schema versions to converge. Convergence would never be achieved and the
test eventually timed out. The node was observing outdated schema state
of some existing node in gossip.

I created a test that would bootstrap 3 nodes, then wait until they all
observe each other as NORMAL, with timeout. Unfortunately, thousands of
runs of this test on different machines failed to reproduce the problem.

After banging my head against the wall failing to reproduce, I decided
to sprinkle randomized sleeps across multiple places in gossiper code
and finally: the test started catching the problem in about 1 in 1000
runs.

With additional logging and additional head-banging, I determined
the root cause.

The following scenario can happen, 2 nodes are sufficient, let's call
them A and B:
- Node B calls `add_local_application_state` to update its gossiper
  state, for example, to propagate its new NORMAL status.
- `add_local_application_state` takes a copy of the endpoint_state, and
  updates the copy:
```
            auto local_state = *ep_state_before;
            for (auto& p : states) {
                auto& state = p.first;
                auto& value = p.second;
                value = versioned_value::clone_with_higher_version(value);
                local_state.add_application_state(state, value);
            }
```
  `clone_with_higher_version` bumps `version` inside
  gms/version_generator.cc.
- `add_local_application_state` calls `gossiper.replicate(...)`
- `replicate` works in 2 phases to achieve exception safety: in first
  phase it copies the updated `local_state` to all shards into a
  separate map. In second phase the values from separate map are used to
  overwrite the endpoint_state map used for gossiping.

  Due to the cross-shard calls of the 1 phase, there is a yield before
  the second phase. *During this yield* the following happens:
- `gossiper::run()` loop on B executes and bumps node B's `heart_beat`.
  This uses the monotonic version_generator, so it uses a higher version
  then the ones we used for states added above. Let's call this new version
  X. Note that X is larger than the versions used by application_states
  added above.
- now node B handles a SYN or ACK message from node A, creating
  an ACK or ACK2 message in response. This message contains:
    - old application states (NOT including the update described above,
      because `replicate` is still sleeping before phase 2),
    - but bumped heart_beat == X from `gossiper::run()` loop,
  and sends the message.
- node A receives the message and remembers that the max
  version across all states (including heart_beat) of node B is X.
  This means that it will no longer request or apply states from node B
  with versions smaller than X.
- `gossiper.replicate(...)` on B wakes up, and overwrites
  endpoint_state with the ones it saved in phase 1. In particular it
  reverts heart_beat back to smaller value, but the larger problem is that it
  saves updated application_states that use versions smaller than X.
- now when node B sends the updated application_states in ACK or ACK2
  message to node A, node A will ignore them, because their versions are
  smaller than X. Or node B will never send them, because whenever node
  A requests states from node B, it only requests states with versions >
  X. Either way, node A will fail to observe new states of node B.

If I understand correctly, this is a regression introduced in
38c2347, which introduced a yield in
`replicate`. Before that, the updated state would be saved atomically on
shard 0, there could be no `heart_beat` bump in-between making a copy of
the local state, updating it, and then saving it.

With the description above, it's easy to make a consistent
reproducer for the problem -- introduce a longer sleep in
`add_local_application_state` before second phase of replicate, to
increase the chance that gossiper loop will execute and bump heart_beat
version during the yield. Further commit adds a test based on that.

The fix is to bump the heart_beat under local endpoint lock, which is
also taken by `replicate`.

The PR also adds a regression test.

Fixes: #15393
Fixes: #15602
Fixes: #16668
Fixes: #16902
Fixes: #17493
Fixes: #18118
Ref: scylladb/scylla-enterprise#3720

Closes #18184

* github.com:scylladb/scylladb:
  test: reproducer for missing gossiper updates
  gossiper: lock local endpoint when updating heart_beat
mergify bot pushed a commit that referenced this issue Apr 16, 2024
In testing, we've observed multiple cases where nodes would fail to
observe updated application states of other nodes in gossiper.

For example:
- in #16902, a node would finish bootstrapping and enter
NORMAL state, propagating this information through gossiper. However,
other nodes would never observe that the node entered NORMAL state,
still thinking that it is in joining state. This would lead to further
bad consequences down the line.
- in #15393, a node got stuck in bootstrap, waiting for
schema versions to converge. Convergence would never be achieved and the
test eventually timed out. The node was observing outdated schema state
of some existing node in gossip.

I created a test that would bootstrap 3 nodes, then wait until they all
observe each other as NORMAL, with timeout. Unfortunately, thousands of
runs of this test on different machines failed to reproduce the problem.

After banging my head against the wall failing to reproduce, I decided
to sprinkle randomized sleeps across multiple places in gossiper code
and finally: the test started catching the problem in about 1 in 1000
runs.

With additional logging and additional head-banging, I determined
the root cause.

The following scenario can happen, 2 nodes are sufficient, let's call
them A and B:
- Node B calls `add_local_application_state` to update its gossiper
  state, for example, to propagate its new NORMAL status.
- `add_local_application_state` takes a copy of the endpoint_state, and
  updates the copy:
```
            auto local_state = *ep_state_before;
            for (auto& p : states) {
                auto& state = p.first;
                auto& value = p.second;
                value = versioned_value::clone_with_higher_version(value);
                local_state.add_application_state(state, value);
            }
```
  `clone_with_higher_version` bumps `version` inside
  gms/version_generator.cc.
- `add_local_application_state` calls `gossiper.replicate(...)`
- `replicate` works in 2 phases to achieve exception safety: in first
  phase it copies the updated `local_state` to all shards into a
  separate map. In second phase the values from separate map are used to
  overwrite the endpoint_state map used for gossiping.

  Due to the cross-shard calls of the 1 phase, there is a yield before
  the second phase. *During this yield* the following happens:
- `gossiper::run()` loop on B executes and bumps node B's `heart_beat`.
  This uses the monotonic version_generator, so it uses a higher version
  then the ones we used for states added above. Let's call this new version
  X. Note that X is larger than the versions used by application_states
  added above.
- now node B handles a SYN or ACK message from node A, creating
  an ACK or ACK2 message in response. This message contains:
    - old application states (NOT including the update described above,
      because `replicate` is still sleeping before phase 2),
    - but bumped heart_beat == X from `gossiper::run()` loop,
  and sends the message.
- node A receives the message and remembers that the max
  version across all states (including heart_beat) of node B is X.
  This means that it will no longer request or apply states from node B
  with versions smaller than X.
- `gossiper.replicate(...)` on B wakes up, and overwrites
  endpoint_state with the ones it saved in phase 1. In particular it
  reverts heart_beat back to smaller value, but the larger problem is that it
  saves updated application_states that use versions smaller than X.
- now when node B sends the updated application_states in ACK or ACK2
  message to node A, node A will ignore them, because their versions are
  smaller than X. Or node B will never send them, because whenever node
  A requests states from node B, it only requests states with versions >
  X. Either way, node A will fail to observe new states of node B.

If I understand correctly, this is a regression introduced in
38c2347, which introduced a yield in
`replicate`. Before that, the updated state would be saved atomically on
shard 0, there could be no `heart_beat` bump in-between making a copy of
the local state, updating it, and then saving it.

With the description above, it's easy to make a consistent
reproducer for the problem -- introduce a longer sleep in
`add_local_application_state` before second phase of replicate, to
increase the chance that gossiper loop will execute and bump heart_beat
version during the yield. Further commit adds a test based on that.

The fix is to bump the heart_beat under local endpoint lock, which is
also taken by `replicate`.

Fixes: #15393
Fixes: #15602
Fixes: #16668
Fixes: #16902
Fixes: #17493
Fixes: #18118
Ref: scylladb/scylla-enterprise#3720
(cherry picked from commit a0b331b)
kbr-scylla added a commit that referenced this issue Apr 16, 2024
In testing, we've observed multiple cases where nodes would fail to
observe updated application states of other nodes in gossiper.

For example:
- in #16902, a node would finish bootstrapping and enter
NORMAL state, propagating this information through gossiper. However,
other nodes would never observe that the node entered NORMAL state,
still thinking that it is in joining state. This would lead to further
bad consequences down the line.
- in #15393, a node got stuck in bootstrap, waiting for
schema versions to converge. Convergence would never be achieved and the
test eventually timed out. The node was observing outdated schema state
of some existing node in gossip.

I created a test that would bootstrap 3 nodes, then wait until they all
observe each other as NORMAL, with timeout. Unfortunately, thousands of
runs of this test on different machines failed to reproduce the problem.

After banging my head against the wall failing to reproduce, I decided
to sprinkle randomized sleeps across multiple places in gossiper code
and finally: the test started catching the problem in about 1 in 1000
runs.

With additional logging and additional head-banging, I determined
the root cause.

The following scenario can happen, 2 nodes are sufficient, let's call
them A and B:
- Node B calls `add_local_application_state` to update its gossiper
  state, for example, to propagate its new NORMAL status.
- `add_local_application_state` takes a copy of the endpoint_state, and
  updates the copy:
```
            auto local_state = *ep_state_before;
            for (auto& p : states) {
                auto& state = p.first;
                auto& value = p.second;
                value = versioned_value::clone_with_higher_version(value);
                local_state.add_application_state(state, value);
            }
```
  `clone_with_higher_version` bumps `version` inside
  gms/version_generator.cc.
- `add_local_application_state` calls `gossiper.replicate(...)`
- `replicate` works in 2 phases to achieve exception safety: in first
  phase it copies the updated `local_state` to all shards into a
  separate map. In second phase the values from separate map are used to
  overwrite the endpoint_state map used for gossiping.

  Due to the cross-shard calls of the 1 phase, there is a yield before
  the second phase. *During this yield* the following happens:
- `gossiper::run()` loop on B executes and bumps node B's `heart_beat`.
  This uses the monotonic version_generator, so it uses a higher version
  then the ones we used for states added above. Let's call this new version
  X. Note that X is larger than the versions used by application_states
  added above.
- now node B handles a SYN or ACK message from node A, creating
  an ACK or ACK2 message in response. This message contains:
    - old application states (NOT including the update described above,
      because `replicate` is still sleeping before phase 2),
    - but bumped heart_beat == X from `gossiper::run()` loop,
  and sends the message.
- node A receives the message and remembers that the max
  version across all states (including heart_beat) of node B is X.
  This means that it will no longer request or apply states from node B
  with versions smaller than X.
- `gossiper.replicate(...)` on B wakes up, and overwrites
  endpoint_state with the ones it saved in phase 1. In particular it
  reverts heart_beat back to smaller value, but the larger problem is that it
  saves updated application_states that use versions smaller than X.
- now when node B sends the updated application_states in ACK or ACK2
  message to node A, node A will ignore them, because their versions are
  smaller than X. Or node B will never send them, because whenever node
  A requests states from node B, it only requests states with versions >
  X. Either way, node A will fail to observe new states of node B.

If I understand correctly, this is a regression introduced in
38c2347, which introduced a yield in
`replicate`. Before that, the updated state would be saved atomically on
shard 0, there could be no `heart_beat` bump in-between making a copy of
the local state, updating it, and then saving it.

With the description above, it's easy to make a consistent
reproducer for the problem -- introduce a longer sleep in
`add_local_application_state` before second phase of replicate, to
increase the chance that gossiper loop will execute and bump heart_beat
version during the yield. Further commit adds a test based on that.

The fix is to bump the heart_beat under local endpoint lock, which is
also taken by `replicate`.

Fixes: #15393
Fixes: #15602
Fixes: #16668
Fixes: #16902
Fixes: #17493
Fixes: #18118
Ref: scylladb/scylla-enterprise#3720
(cherry picked from commit a0b331b)
kbr-scylla added a commit that referenced this issue Apr 17, 2024
…rt_beat' from ScyllaDB

In testing, we've observed multiple cases where nodes would fail to
observe updated application states of other nodes in gossiper.

For example:
- in #16902, a node would finish bootstrapping and enter
NORMAL state, propagating this information through gossiper. However,
other nodes would never observe that the node entered NORMAL state,
still thinking that it is in joining state. This would lead to further
bad consequences down the line.
- in #15393, a node got stuck in bootstrap, waiting for
schema versions to converge. Convergence would never be achieved and the
test eventually timed out. The node was observing outdated schema state
of some existing node in gossip.

I created a test that would bootstrap 3 nodes, then wait until they all
observe each other as NORMAL, with timeout. Unfortunately, thousands of
runs of this test on different machines failed to reproduce the problem.

After banging my head against the wall failing to reproduce, I decided
to sprinkle randomized sleeps across multiple places in gossiper code
and finally: the test started catching the problem in about 1 in 1000
runs.

With additional logging and additional head-banging, I determined
the root cause.

The following scenario can happen, 2 nodes are sufficient, let's call
them A and B:
- Node B calls `add_local_application_state` to update its gossiper
  state, for example, to propagate its new NORMAL status.
- `add_local_application_state` takes a copy of the endpoint_state, and
  updates the copy:
```
            auto local_state = *ep_state_before;
            for (auto& p : states) {
                auto& state = p.first;
                auto& value = p.second;
                value = versioned_value::clone_with_higher_version(value);
                local_state.add_application_state(state, value);
            }
```
  `clone_with_higher_version` bumps `version` inside
  gms/version_generator.cc.
- `add_local_application_state` calls `gossiper.replicate(...)`
- `replicate` works in 2 phases to achieve exception safety: in first
  phase it copies the updated `local_state` to all shards into a
  separate map. In second phase the values from separate map are used to
  overwrite the endpoint_state map used for gossiping.

  Due to the cross-shard calls of the 1 phase, there is a yield before
  the second phase. *During this yield* the following happens:
- `gossiper::run()` loop on B executes and bumps node B's `heart_beat`.
  This uses the monotonic version_generator, so it uses a higher version
  then the ones we used for states added above. Let's call this new version
  X. Note that X is larger than the versions used by application_states
  added above.
- now node B handles a SYN or ACK message from node A, creating
  an ACK or ACK2 message in response. This message contains:
    - old application states (NOT including the update described above,
      because `replicate` is still sleeping before phase 2),
    - but bumped heart_beat == X from `gossiper::run()` loop,
  and sends the message.
- node A receives the message and remembers that the max
  version across all states (including heart_beat) of node B is X.
  This means that it will no longer request or apply states from node B
  with versions smaller than X.
- `gossiper.replicate(...)` on B wakes up, and overwrites
  endpoint_state with the ones it saved in phase 1. In particular it
  reverts heart_beat back to smaller value, but the larger problem is that it
  saves updated application_states that use versions smaller than X.
- now when node B sends the updated application_states in ACK or ACK2
  message to node A, node A will ignore them, because their versions are
  smaller than X. Or node B will never send them, because whenever node
  A requests states from node B, it only requests states with versions >
  X. Either way, node A will fail to observe new states of node B.

If I understand correctly, this is a regression introduced in
38c2347, which introduced a yield in
`replicate`. Before that, the updated state would be saved atomically on
shard 0, there could be no `heart_beat` bump in-between making a copy of
the local state, updating it, and then saving it.

With the description above, it's easy to make a consistent
reproducer for the problem -- introduce a longer sleep in
`add_local_application_state` before second phase of replicate, to
increase the chance that gossiper loop will execute and bump heart_beat
version during the yield. Further commit adds a test based on that.

The fix is to bump the heart_beat under local endpoint lock, which is
also taken by `replicate`.

The PR also adds a regression test.

Fixes: #15393
Fixes: #15602
Fixes: #16668
Fixes: #16902
Fixes: #17493
Fixes: #18118
Ref: scylladb/scylla-enterprise#3720

(cherry picked from commit a0b331b)

(cherry picked from commit 7295509)

Refs #18184

Closes #18245

* github.com:scylladb/scylladb:
  test: reproducer for missing gossiper updates
  gossiper: lock local endpoint when updating heart_beat
dgarcia360 pushed a commit to dgarcia360/scylla that referenced this issue Apr 30, 2024
In testing, we've observed multiple cases where nodes would fail to
observe updated application states of other nodes in gossiper.

For example:
- in scylladb#16902, a node would finish bootstrapping and enter
NORMAL state, propagating this information through gossiper. However,
other nodes would never observe that the node entered NORMAL state,
still thinking that it is in joining state. This would lead to further
bad consequences down the line.
- in scylladb#15393, a node got stuck in bootstrap, waiting for
schema versions to converge. Convergence would never be achieved and the
test eventually timed out. The node was observing outdated schema state
of some existing node in gossip.

I created a test that would bootstrap 3 nodes, then wait until they all
observe each other as NORMAL, with timeout. Unfortunately, thousands of
runs of this test on different machines failed to reproduce the problem.

After banging my head against the wall failing to reproduce, I decided
to sprinkle randomized sleeps across multiple places in gossiper code
and finally: the test started catching the problem in about 1 in 1000
runs.

With additional logging and additional head-banging, I determined
the root cause.

The following scenario can happen, 2 nodes are sufficient, let's call
them A and B:
- Node B calls `add_local_application_state` to update its gossiper
  state, for example, to propagate its new NORMAL status.
- `add_local_application_state` takes a copy of the endpoint_state, and
  updates the copy:
```
            auto local_state = *ep_state_before;
            for (auto& p : states) {
                auto& state = p.first;
                auto& value = p.second;
                value = versioned_value::clone_with_higher_version(value);
                local_state.add_application_state(state, value);
            }
```
  `clone_with_higher_version` bumps `version` inside
  gms/version_generator.cc.
- `add_local_application_state` calls `gossiper.replicate(...)`
- `replicate` works in 2 phases to achieve exception safety: in first
  phase it copies the updated `local_state` to all shards into a
  separate map. In second phase the values from separate map are used to
  overwrite the endpoint_state map used for gossiping.

  Due to the cross-shard calls of the 1 phase, there is a yield before
  the second phase. *During this yield* the following happens:
- `gossiper::run()` loop on B executes and bumps node B's `heart_beat`.
  This uses the monotonic version_generator, so it uses a higher version
  then the ones we used for states added above. Let's call this new version
  X. Note that X is larger than the versions used by application_states
  added above.
- now node B handles a SYN or ACK message from node A, creating
  an ACK or ACK2 message in response. This message contains:
    - old application states (NOT including the update described above,
      because `replicate` is still sleeping before phase 2),
    - but bumped heart_beat == X from `gossiper::run()` loop,
  and sends the message.
- node A receives the message and remembers that the max
  version across all states (including heart_beat) of node B is X.
  This means that it will no longer request or apply states from node B
  with versions smaller than X.
- `gossiper.replicate(...)` on B wakes up, and overwrites
  endpoint_state with the ones it saved in phase 1. In particular it
  reverts heart_beat back to smaller value, but the larger problem is that it
  saves updated application_states that use versions smaller than X.
- now when node B sends the updated application_states in ACK or ACK2
  message to node A, node A will ignore them, because their versions are
  smaller than X. Or node B will never send them, because whenever node
  A requests states from node B, it only requests states with versions >
  X. Either way, node A will fail to observe new states of node B.

If I understand correctly, this is a regression introduced in
38c2347, which introduced a yield in
`replicate`. Before that, the updated state would be saved atomically on
shard 0, there could be no `heart_beat` bump in-between making a copy of
the local state, updating it, and then saving it.

With the description above, it's easy to make a consistent
reproducer for the problem -- introduce a longer sleep in
`add_local_application_state` before second phase of replicate, to
increase the chance that gossiper loop will execute and bump heart_beat
version during the yield. Further commit adds a test based on that.

The fix is to bump the heart_beat under local endpoint lock, which is
also taken by `replicate`.

Fixes: scylladb#15393
Fixes: scylladb#15602
Fixes: scylladb#16668
Fixes: scylladb#16902
Fixes: scylladb#17493
Fixes: scylladb#18118
Ref: scylladb/scylla-enterprise#3720
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Backport candidate symptom/ci stability Issues that failed in ScyllaDB CI - tests and framework tests/dtest
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants