New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
raft - remote_verb_error (connection is closed) when transferring snapshot #12972
Comments
@bhalevy - I think you had some pending work to elaborate more on what the failure might be in such situations? |
There's #12799 but it's for the current node operations, unrelated to raft. |
@gleb-cloudius raft snapshot transfer hits the unpaged query memory limit. Shall we increase the limit or it's time to switch to paged transfer? |
We use existing schema_pull rpc call which is called without raft as well. Why don't we see it without raft? Something is fishy here. As of the limit for internal queries it should not be enforced. |
BTW this is a soft limit anyway. It should not cause any exception to be thrown. I am not sure the connection dropping is related. It may be the same issue with the gossiper aborting existing connections when a node becomes normal that caused RBNO to fail the tests. |
@kbr-scylla / @gleb-cloudius for you this is blocking 5.2 ? |
No. |
I confirmed that this is indeed the case. |
@soyacz for now please mask the error in the log. I have no idea yet how to fix all of the issues which dropping rpc connections during the boot generates. |
@mykaul please create 'rpc' label or similar. Since @xemul 's 7bdad47 we have gotten a lot of flaky tests, since connections can be dropped in the middle and - as in this case - the test itself is not flaky, but there is an unwanted error message in the log. I don't know how to elegantly fix it yet. Generally, I think dropping messaging service connections should be transparent to clients, currently it isn't. |
Due to scylladb/scylladb#12972 SCT test fails. But this issue is not severe and should not cause failure of the test. Reduce severity of `raft - transfer snapshot error` to `WARNING` as per developers request until issue is fixed.
Due to scylladb/scylladb#12972 SCT test fails. But this issue is not severe and should not cause failure of the test. Reduce severity of `raft - transfer snapshot error` to `WARNING` as per developers request until issue is fixed.
Due to scylladb/scylladb#12972 SCT test fails. But this issue is not severe and should not cause failure of the test. Reduce severity of `raft - transfer snapshot error` to `WARNING` as per developers request until issue is fixed. (cherry picked from commit 9667ee1)
Due to scylladb/scylladb#12972 SCT test fails. But this issue is not severe and should not cause failure of the test. Reduce severity of `raft - transfer snapshot error` to `WARNING` as per developers request until issue is fixed. (cherry picked from commit 9667ee1)
`handle_state_normal` may drop connections to the handled node. This causes spurious failures if there's an ongoing concurrent operation. This problem was already solved twice in the past in different contexts: first in 5363616, then in 79ee381. Time to fix it for the third time. Now we do this right after enabling gossiping, so hopefully it's the last time. This time it's causing snapshot transfer failures in group 0. Although the transfer is retried and eventually succeeds, the failed transfer is wasted work and causes an annoying ERROR message in the log which dtests, SCT, and I don't like. The fix is done by moving the `wait_for_normal_state_handled_on_boot()` call before `setup_group0()`. But for the wait to work correctly we must first ensure that gossiper sees an alive node, so we precede it with `wait_for_live_node_to_show_up()` (before this commit, the call site of `wait_for_normal_state_handled_on_boot` was already after this wait). Also ensure that `wait_for_normal_state_handled_on_boot()` doesn't hang in raft-topology mode by adding `_normal_state_handled_on_boot.insert(endpoint);` in the raft-topology branch of `handle_state_normal`. Fixes: scylladb#12972
`handle_state_normal` may drop connections to the handled node. This causes spurious failures if there's an ongoing concurrent operation. This problem was already solved twice in the past in different contexts: first in 5363616, then in 79ee381. Time to fix it for the third time. Now we do this right after enabling gossiping, so hopefully it's the last time. This time it's causing snapshot transfer failures in group 0. Although the transfer is retried and eventually succeeds, the failed transfer is wasted work and causes an annoying ERROR message in the log which dtests, SCT, and I don't like. The fix is done by moving the `wait_for_normal_state_handled_on_boot()` call before `setup_group0()`. But for the wait to work correctly we must first ensure that gossiper sees an alive node, so we precede it with `wait_for_live_node_to_show_up()` (before this commit, the call site of `wait_for_normal_state_handled_on_boot` was already after this wait). Also ensure that `wait_for_normal_state_handled_on_boot()` doesn't hang in raft-topology mode by adding `_normal_state_handled_on_boot.insert(endpoint);` in the raft-topology branch of `handle_state_normal`. Fixes: scylladb#12972
`handle_state_normal` may drop connections to the handled node. This causes spurious failures if there's an ongoing concurrent operation. This problem was already solved twice in the past in different contexts: first in 5363616, then in 79ee381. Time to fix it for the third time. Now we do this right after enabling gossiping, so hopefully it's the last time. This time it's causing snapshot transfer failures in group 0. Although the transfer is retried and eventually succeeds, the failed transfer is wasted work and causes an annoying ERROR message in the log which dtests, SCT, and I don't like. The fix is done by moving the `wait_for_normal_state_handled_on_boot()` call before `setup_group0()`. But for the wait to work correctly we must first ensure that gossiper sees an alive node, so we precede it with `wait_for_live_node_to_show_up()` (before this commit, the call site of `wait_for_normal_state_handled_on_boot` was already after this wait). Also ensure that `wait_for_normal_state_handled_on_boot()` doesn't hang in raft-topology mode by adding `_normal_state_handled_on_boot.insert(endpoint);` in the raft-topology branch of `handle_state_normal`. Fixes: scylladb#12972
`handle_state_normal` may drop connections to the handled node. This causes spurious failures if there's an ongoing concurrent operation. This problem was already solved twice in the past in different contexts: first in 5363616, then in 79ee381. Time to fix it for the third time. Now we do this right after enabling gossiping, so hopefully it's the last time. This time it's causing snapshot transfer failures in group 0. Although the transfer is retried and eventually succeeds, the failed transfer is wasted work and causes an annoying ERROR message in the log which dtests, SCT, and I don't like. The fix is done by moving the `wait_for_normal_state_handled_on_boot()` call before `setup_group0()`. But for the wait to work correctly we must first ensure that gossiper sees an alive node, so we precede it with `wait_for_live_node_to_show_up()` (before this commit, the call site of `wait_for_normal_state_handled_on_boot` was already after this wait). Also ensure that `wait_for_normal_state_handled_on_boot()` doesn't hang in raft-topology mode by adding `_normal_state_handled_on_boot.insert(endpoint);` in the raft-topology branch of `handle_state_normal`. Fixes: scylladb#12972
`handle_state_normal` may drop connections to the handled node. This causes spurious failures if there's an ongoing concurrent operation. This problem was already solved twice in the past in different contexts: first in 5363616, then in 79ee381. Time to fix it for the third time. Now we do this right after enabling gossiping, so hopefully it's the last time. This time it's causing snapshot transfer failures in group 0. Although the transfer is retried and eventually succeeds, the failed transfer is wasted work and causes an annoying ERROR message in the log which dtests, SCT, and I don't like. The fix is done by moving the `wait_for_normal_state_handled_on_boot()` call before `setup_group0()`. But for the wait to work correctly we must first ensure that gossiper sees an alive node, so we precede it with `wait_for_live_node_to_show_up()` (before this commit, the call site of `wait_for_normal_state_handled_on_boot` was already after this wait). Also ensure that `wait_for_normal_state_handled_on_boot()` doesn't hang in raft-topology mode by adding `_normal_state_handled_on_boot.insert(endpoint);` in the raft-topology branch of `handle_state_normal`. Fixes: scylladb#12972
`handle_state_normal` may drop connections to the handled node. This causes spurious failures if there's an ongoing concurrent operation. This problem was already solved twice in the past in different contexts: first in 5363616, then in 79ee381. Time to fix it for the third time. Now we do this right after enabling gossiping, so hopefully it's the last time. This time it's causing snapshot transfer failures in group 0. Although the transfer is retried and eventually succeeds, the failed transfer is wasted work and causes an annoying ERROR message in the log which dtests, SCT, and I don't like. The fix is done by moving the `wait_for_normal_state_handled_on_boot()` call before `setup_group0()`. But for the wait to work correctly we must first ensure that gossiper sees an alive node, so we precede it with `wait_for_live_node_to_show_up()` (before this commit, the call site of `wait_for_normal_state_handled_on_boot` was already after this wait). Also ensure that `wait_for_normal_state_handled_on_boot()` doesn't hang in raft-topology mode by adding `_normal_state_handled_on_boot.insert(endpoint);` in the raft-topology branch of `handle_state_normal`. Fixes: scylladb#12972
`handle_state_normal` may drop connections to the handled node. This causes spurious failures if there's an ongoing concurrent operation. This problem was already solved twice in the past in different contexts: first in 5363616, then in 79ee381. Time to fix it for the third time. Now we do this right after enabling gossiping, so hopefully it's the last time. This time it's causing snapshot transfer failures in group 0. Although the transfer is retried and eventually succeeds, the failed transfer is wasted work and causes an annoying ERROR message in the log which dtests, SCT, and I don't like. The fix is done by moving the `wait_for_normal_state_handled_on_boot()` call before `setup_group0()`. But for the wait to work correctly we must first ensure that gossiper sees an alive node, so we precede it with `wait_for_live_node_to_show_up()` (before this commit, the call site of `wait_for_normal_state_handled_on_boot` was already after this wait). Also ensure that `wait_for_normal_state_handled_on_boot()` doesn't hang in raft-topology mode by adding `_normal_state_handled_on_boot.insert(endpoint);` in the raft-topology branch of `handle_state_normal`. Fixes: scylladb#12972
…ter enabling gossiping' from Kamil Braun `handle_state_normal` may drop connections to the handled node. This causes spurious failures if there's an ongoing concurrent operation. This problem was already solved twice in the past in different contexts: first in 5363616, then in 79ee381. Time to fix it for the third time. Now we do this right after enabling gossiping, so hopefully it's the last time. This time it's causing snapshot transfer failures in group 0. Although the transfer is retried and eventually succeeds, the failed transfer is wasted work and causes an annoying ERROR message in the log which dtests, SCT, and I don't like. The fix is done by moving the `wait_for_normal_state_handled_on_boot()` call before `setup_group0()`. But for the wait to work correctly we must first ensure that gossiper sees an alive node, so we precede it with `wait_for_live_node_to_show_up()` (before this commit, the call site of `wait_for_normal_state_handled_on_boot` was already after this wait). There is another problem: the bootstrap procedure is racing with gossiper marking nodes as UP, and waiting for other nodes to be NORMAL doesn't guarantee that they are also UP. If gossiper is quick enough, everything will be fine. If not, problems may arise such as streaming or repair failing due to nodes still being marked as DOWN, or the CDC generation write failing. In general, we need all NORMAL nodes to be up for bootstrap to proceed. One exception is replace where we ignore the replaced node. The `sync_nodes` set constructed for `wait_for_normal_state_handled_on_boot` takes this into account, so we also use it to wait for nodes to be UP. As explained in commit messages and comments, we only do these waits outside raft-based-topology mode. This should improve CI stability. Fixes: #12972 Refs: #14042 Closes #14354 * github.com:scylladb/scylladb: messaging_service: print which connections are dropped due to missing topology info storage_service: wait for nodes to be UP on bootstrap storage_service: wait for NORMAL state handler before `setup_group0()` storage_service: extract `gossiper::wait_for_live_nodes_to_show_up()`
@kbr-scylla please evaluate for backport |
The snapshot transfer error is harmless (we automatically recover). The 5.2 tests are already ignoring it (scylladb/scylla-cluster-tests@4cdffb4). So no need to backport. |
When a node joins the cluster, it closes connections after learning topology information from other nodes, in order to reopen them with correct encryption, compression etc. In ScyllaDB 5.2, this mechanism may interrupt an ongoing Raft snapshot transfer. This was fixed in later versions by putting some order into the bootstrap process with 50e8ec7 but the fix was not backported due to many prerequisites and complexity. Raft automatically recovers from interrupted snapshot transfer by retrying it eventually, and everything works. However an ERROR is reported due to that one failed snapshot transfer, and dtests dont like ERRORs -- they report the test case as failed if an ERROR happened in any node's logs even if the test passed otherwise. Here we apply a simple workaround to please dtests -- in this particular scenario, turn the ERROR into a WARN.
Backports required to fix #16683 in 5.2: - when creating first group 0 server, create a snapshot with non-empty ID, and start it at index 1 instead of 0 to force snapshot transfer to servers that join group 0 - add an API to trigger Raft snapshot - use the API when we restart and see that the existing snapshot is at index 0, to trigger a new one --- in order to fix broken deployments that already bootstrapped with index-0 snapshot. Closes #17087 * github.com:scylladb/scylladb: test_raft_snapshot_request: fix flakiness (again) test_raft_snapshot_request: fix flakiness Merge 'raft_group0: trigger snapshot if existing snapshot index is 0' from Kamil Braun Merge 'Add an API to trigger snapshot in Raft servers' from Kamil Braun raft: server: add workaround for #12972 raft: Store snapshot update and truncate log atomically service: raft: force initial snapshot transfer in new cluster raft_sys_table_storage: give initial snapshot a non zero value
Issue description
In test, we create 1 node, then create 5000 tables and then add another nodes to cluster.
When adding first node, existing node logs show error (with some context):
In the meantime added node shows:
Impact
Unknown, later we can see tables being populated and cluster works.
But this causes SCT to raise an error and failing the test.
If
seastar::rpc::remote_verb_error (connection is closed)
is catched and raft is retrying, maybe we should hide it or rephrase to not scare the user.How frequently does it reproduce?
It happened for all added nodes (5) in this test.
Installation details
Kernel Version: 5.15.0-1028-aws
Scylla version (or git commit hash):
2023.1.0~rc1-20230208.fe3cc281ec73
with build-idff20df9822b5b6397724a6ff6caadde419b383e6
Cluster size: 1 nodes (i3.8xlarge)
Scylla Nodes used in this run:
OS / Image:
ami-056165f482cc8e0d8
(aws: eu-west-1)Test:
scale-5000-tables-test
Test id:
d89f20f6-1ec1-4bca-a7f2-1e8c1d0bacda
Test name:
enterprise-2023.1/scale/scale-5000-tables-test
Test config file(s):
Logs and commands
$ hydra investigate show-monitor d89f20f6-1ec1-4bca-a7f2-1e8c1d0bacda
$ hydra investigate show-logs d89f20f6-1ec1-4bca-a7f2-1e8c1d0bacda
Logs:
Jenkins job URL
The text was updated successfully, but these errors were encountered: