-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gossiper, a new node doesn't get properly notified about other node restart #14042
Comments
Idea for reproducer: Add a check into gossiper code: immediately after the first gossiper round -- which should contact all of the seed nodes -- verify that gossiper considers all the nodes as NORMAL. The CL=ALL write is racing with gossiper rounds, if the second round fixes the problem it will mask it, making it harder to reproduce (in runs where the write happens after the second round, the test would pass). |
@gusev-p did you ever see this happen again? |
Nope. I tried to construct a concrete scenario how this could happen, but bogged down in all the gossiper details. There are too many things over there that depend on time, so I believe this can happen if some activities took longer than expected. We can leave this issue on me, if it comes up again it will add motivation to investigate further. |
The new node has this:
The skip is due to shutdown status. The cause is that the new node didn't wait for the gossip round (we run tests with skip_wait_for_gossip_to_settle=0), so didn't have time to learn that the status is no longer shutdown. |
The problem is that the node should have never received the shutdown status in the first place, because it performs a gossip round with every seed, and at least one of them already knows that the node is no longer shutdown but NORMAL. |
Well to be precise, it may see that it's shutdown after contacting one of the seeds, but there is at least one seed which should the newest version that's NORMAL. (Like the node that restarted itself) - and IIUC it performs a round with all seeds |
Happened again in CI: did the frequency increase? |
Could this also be influencing #14274? @kbr-scylla the endpoint state in gossip is shutdown; the booting node sends its new IP address; the update/notification is skipped since the node is not NORMAL. |
The node sends its NORMAL status eventually on restart so the IP should be updated. |
Happened again |
The bootstrap procedure is racing with gossiper marking nodes as UP. If gossiper is quick enough, everything will be fine. If not, problems may arise such as streaming or repair failing due to nodes still being marked as DOWN, or the CDC generation write failing. In general, we need all NORMAL nodes to be up for bootstrap to proceed. One exception is replace where we ignore the replaced node. The `sync_nodes` set constructed for `wait_for_normal_state_handled_on_boot` takes this into account, so we use it. Fixes: scylladb#14042
The bootstrap procedure is racing with gossiper marking nodes as UP. If gossiper is quick enough, everything will be fine. If not, problems may arise such as streaming or repair failing due to nodes still being marked as DOWN, or the CDC generation write failing. In general, we need all NORMAL nodes to be up for bootstrap to proceed. One exception is replace where we ignore the replaced node. The `sync_nodes` set constructed for `wait_for_normal_state_handled_on_boot` takes this into account, so we use it. Fixes: scylladb#14042
The bootstrap procedure is racing with gossiper marking nodes as UP. If gossiper is quick enough, everything will be fine. If not, problems may arise such as streaming or repair failing due to nodes still being marked as DOWN, or the CDC generation write failing. In general, we need all NORMAL nodes to be up for bootstrap to proceed. One exception is replace where we ignore the replaced node. The `sync_nodes` set constructed for `wait_for_normal_state_handled_on_boot` takes this into account, so we use it. Fixes: scylladb#14042
The bootstrap procedure is racing with gossiper marking nodes as UP. If gossiper is quick enough, everything will be fine. If not, problems may arise such as streaming or repair failing due to nodes still being marked as DOWN, or the CDC generation write failing. In general, we need all NORMAL nodes to be up for bootstrap to proceed. One exception is replace where we ignore the replaced node. The `sync_nodes` set constructed for `wait_for_normal_state_handled_on_boot` takes this into account, so we use it. Fixes: scylladb#14042
The bootstrap procedure is racing with gossiper marking nodes as UP. If gossiper is quick enough, everything will be fine. If not, problems may arise such as streaming or repair failing due to nodes still being marked as DOWN, or the CDC generation write failing. In general, we need all NORMAL nodes to be up for bootstrap to proceed. One exception is replace where we ignore the replaced node. The `sync_nodes` set constructed for `wait_for_normal_state_handled_on_boot` takes this into account, so we use it. Fixes: scylladb#14042
The bootstrap procedure is racing with gossiper marking nodes as UP. If gossiper is quick enough, everything will be fine. If not, problems may arise such as streaming or repair failing due to nodes still being marked as DOWN, or the CDC generation write failing. In general, we need all NORMAL nodes to be up for bootstrap to proceed. One exception is replace where we ignore the replaced node. The `sync_nodes` set constructed for `wait_for_normal_state_handled_on_boot` takes this into account, so we use it. Refs: scylladb#14042 This doesn't completely fix scylladb#14042 yet becasue it's specific to gossiper-based topology mode only. For Raft-based topology, the node joining procedure will be coordinated by the topology coordinator right from the start and it will be the coordinator who issues the 'wait for node to see other live nodes'.
…ter enabling gossiping' from Kamil Braun `handle_state_normal` may drop connections to the handled node. This causes spurious failures if there's an ongoing concurrent operation. This problem was already solved twice in the past in different contexts: first in 5363616, then in 79ee381. Time to fix it for the third time. Now we do this right after enabling gossiping, so hopefully it's the last time. This time it's causing snapshot transfer failures in group 0. Although the transfer is retried and eventually succeeds, the failed transfer is wasted work and causes an annoying ERROR message in the log which dtests, SCT, and I don't like. The fix is done by moving the `wait_for_normal_state_handled_on_boot()` call before `setup_group0()`. But for the wait to work correctly we must first ensure that gossiper sees an alive node, so we precede it with `wait_for_live_node_to_show_up()` (before this commit, the call site of `wait_for_normal_state_handled_on_boot` was already after this wait). There is another problem: the bootstrap procedure is racing with gossiper marking nodes as UP, and waiting for other nodes to be NORMAL doesn't guarantee that they are also UP. If gossiper is quick enough, everything will be fine. If not, problems may arise such as streaming or repair failing due to nodes still being marked as DOWN, or the CDC generation write failing. In general, we need all NORMAL nodes to be up for bootstrap to proceed. One exception is replace where we ignore the replaced node. The `sync_nodes` set constructed for `wait_for_normal_state_handled_on_boot` takes this into account, so we also use it to wait for nodes to be UP. As explained in commit messages and comments, we only do these waits outside raft-based-topology mode. This should improve CI stability. Fixes: #12972 Refs: #14042 Closes #14354 * github.com:scylladb/scylladb: messaging_service: print which connections are dropped due to missing topology info storage_service: wait for nodes to be UP on bootstrap storage_service: wait for NORMAL state handler before `setup_group0()` storage_service: extract `gossiper::wait_for_live_nodes_to_show_up()`
Got closed due to this part of commit message I guess:
|
@kbr-scylla - so what is the next step here? |
Ok, we're not seeing this after 51cec2b. No point in keeping this open I think. |
seems happend also in 5.2/next @kbr-scylla - some backport is missing? link to the log: https://jenkins.scylladb.com/job/scylla-5.2/job/next/289/artifact/testlog/x86_64/release/topology.test_topology_schema.3.log
|
@kbr-scylla ping, seen again on 5.2: https://jenkins.scylladb.com/view/scylla-5.2/job/scylla-5.2/job/next/337/ |
The backport is nontrivial and there are nontrivial prerequisites that would also have to be backported. Since this is a test-issue (doesn't happen in prod with all the sleeps), the backport is not strictly necessary. Still, we should deal somehow with the test flakiness... The problem is that this can happen any time we start a cluster in test.py. There is no single test to disable to ignore it. I don't know how to deal with it :/ @avikivity should I prepare a large backport just to unflake the tests? Another way would be to just ignore the flakiness... |
seen also on 2024.1: required backport to 2024.1? |
@Annamikhlin this was caused by a different issue
This is #15747, indeed missing backport. I'll prepare it. |
Fix was only backported to 5.4 (#15747 (comment) ) |
Three nodes in a cluster, trying to add a new node after some other node has been restarted, getting
Initially reproduced on this CI job.
The failed test is
Tests / Unit Tests / non-boost tests.topology.test_topology_schema.release.1
, failing on the lineawait manager.server_add()
.Three existing nodes: (121, 127.206.68.28), (132, 127.206.68.11), (127, 127.206.68.26) - this node has beed restarted. The new node (149, 127.206.68.56).
scylla-121.log
scylla-121.yaml.txt
scylla-132.log
scylla-132.yaml.txt
scylla-127.log
scylla-127.yaml.txt
scylla-149.log
scylla-149.yaml.txt
gossiper::run
at14:01:40,479
should have contacted all three seed nodes here, since_live_endpoints
are empty - we cleared them in check_for_endpoint_collision.Looks like there's some problem with
heart_beat_version
, like they are not compared correctly or (de)serialized.The text was updated successfully, but these errors were encountered: