raft - remote_verb_error (connection is closed) when transferring snapshot #12972

soyacz · 2023-02-23T09:35:39Z

Issue description

This issue is a regression.
It is unknown if this issue is a regression.

In test, we create 1 node, then create 5000 tables and then add another nodes to cluster.
When adding first node, existing node logs show error (with some context):

Feb 21 16:30:46 longevity-5000-tables-2023-1-db-node-d89f20f6-1 scylla[5704]:  [shard  1] raft_group_registry - Raft server id a1018c3f-739c-4fea-8d2b-94fdaf448142 cannot be translated to an IP address.
Feb 21 16:30:47 longevity-5000-tables-2023-1-db-node-d89f20f6-1 scylla[5704]:  [shard  0] gossip - InetAddress 10.4.2.158 is now UP, status = UNKNOWN
Feb 21 16:30:47 longevity-5000-tables-2023-1-db-node-d89f20f6-1 scylla[5704]:  [shard  0] raft_group_registry - marking Raft server a1018c3f-739c-4fea-8d2b-94fdaf448142 as alive for raft groups
Feb 21 16:30:47 longevity-5000-tables-2023-1-db-node-d89f20f6-1 scylla[5704]:  [shard  0] mutation_partition - Memory usage of unpaged query exceeds soft limit of 1048576 (configured via max_memory_for_unlimited_query_soft_limit)
Feb 21 16:30:47 longevity-5000-tables-2023-1-db-node-d89f20f6-1 scylla[5704]:  [shard  0] mutation_partition - Memory usage of unpaged query exceeds soft limit of 1048576 (configured via max_memory_for_unlimited_query_soft_limit)
Feb 21 16:30:47 longevity-5000-tables-2023-1-db-node-d89f20f6-1 scylla[5704]:  [shard  0] mutation_partition - Memory usage of unpaged query exceeds soft limit of 1048576 (configured via max_memory_for_unlimited_query_soft_limit)
Feb 21 16:30:47 longevity-5000-tables-2023-1-db-node-d89f20f6-1 scylla[5704]:  [shard  0] mutation_partition - Memory usage of unpaged query exceeds soft limit of 1048576 (configured via max_memory_for_unlimited_query_soft_limit)
Feb 21 16:30:48 longevity-5000-tables-2023-1-db-node-d89f20f6-1 scylla[5704]:  [shard  0] mutation_partition - Memory usage of unpaged query exceeds soft limit of 1048576 (configured via max_memory_for_unlimited_query_soft_limit)
Feb 21 16:30:48 longevity-5000-tables-2023-1-db-node-d89f20f6-1 scylla[5704]:  [shard  0] raft - [0478b1fd-32de-4906-8071-bec9c81b1fe2] Transferring snapshot to a1018c3f-739c-4fea-8d2b-94fdaf448142 failed with: seastar::rpc::remote_verb_error (connection is closed)

In the meantime added node shows:

2023-02-21T16:30:46+00:00 longevity-5000-tables-2023-1-db-node-d89f20f6-2     !INFO | scylla[5812]:  [shard  0] storage_service - Starting up server gossip
2023-02-21T16:30:46+00:00 longevity-5000-tables-2023-1-db-node-d89f20f6-2     !INFO | scylla[5812]:  [shard 10] compaction - [Compact system.local 180a6b30-b205-11ed-b3bb-03052efc5e67] Compacting [/var/lib/scylla/data/system/local-7ad54392bcdd35a684174e047860b377/mc-40-big-Data.db:level=0:origin=memtable,/var/lib/scylla/data/system/local-7ad54392bcdd35a684174e047860b377/mc-70-big-Data.db:level=0:origin=memtable]
2023-02-21T16:30:46+00:00 longevity-5000-tables-2023-1-db-node-d89f20f6-2     !INFO | scylla[5812]:  [shard  0] gossip - failure_detector_loop: Started main loop
2023-02-21T16:30:46+00:00 longevity-5000-tables-2023-1-db-node-d89f20f6-2     !INFO | scylla[5812]:  [shard  0] raft_group0 - setup_group0: joining group 0...
2023-02-21T16:30:46+00:00 longevity-5000-tables-2023-1-db-node-d89f20f6-2     !INFO | scylla[5812]:  [shard  0] raft_group0 - server a1018c3f-739c-4fea-8d2b-94fdaf448142 found no local group 0. Discovering...
2023-02-21T16:30:46+00:00 longevity-5000-tables-2023-1-db-node-d89f20f6-2     !INFO | scylla[5812]:  [shard  0] raft_group0 - server a1018c3f-739c-4fea-8d2b-94fdaf448142 found group 0 with group id a00c7430-b1ae-11ed-bbd9-7bf2770c6d3b, leader 0478b1fd-32de-4906-8071-bec9c81b1fe2
2023-02-21T16:30:46+00:00 longevity-5000-tables-2023-1-db-node-d89f20f6-2     !INFO | scylla[5812]:  [shard  0] raft_group0 - Server a1018c3f-739c-4fea-8d2b-94fdaf448142 is starting group 0 with id a00c7430-b1ae-11ed-bbd9-7bf2770c6d3b
2023-02-21T16:30:46+00:00 longevity-5000-tables-2023-1-db-node-d89f20f6-2     !INFO | scylla[5812]:  [shard 10] compaction - [Compact system.local 180a6b30-b205-11ed-b3bb-03052efc5e67] Compacted 2 sstables to [/var/lib/scylla/data/system/local-7ad54392bcdd35a684174e047860b377/mc-100-big-Data.db:level=0]. 28kB to 15kB (~53% of original) in 8ms = 1MB/s. ~256 total partitions merged to 1.
2023-02-21T16:30:46+00:00 longevity-5000-tables-2023-1-db-node-d89f20f6-2     !INFO | scylla[5812]:  [shard  0] raft_group0 - server a1018c3f-739c-4fea-8d2b-94fdaf448142 joined group 0 with group id a00c7430-b1ae-11ed-bbd9-7bf2770c6d3b
2023-02-21T16:30:46+00:00 longevity-5000-tables-2023-1-db-node-d89f20f6-2     !INFO | scylla[5812]:  [shard  0] raft_group0 - setup_group0: successfully joined group 0.
2023-02-21T16:30:46+00:00 longevity-5000-tables-2023-1-db-node-d89f20f6-2     !INFO | scylla[5812]:  [shard  0] raft_group0 - setup_group0: ensuring that the cluster has fully upgraded to use Raft...
2023-02-21T16:30:48+00:00 longevity-5000-tables-2023-1-db-node-d89f20f6-2     !INFO | scylla[5812]:  [shard  0] storage_service - Set host_id=0478b1fd-32de-4906-8071-bec9c81b1fe2 to be owned by node=10.4.0.247
2023-02-21T16:30:48+00:00 longevity-5000-tables-2023-1-db-node-d89f20f6-2     !INFO | scylla[5812]:  [shard  0] gossip - InetAddress 10.4.0.247 is now UP, status = NORMAL
2023-02-21T16:30:48+00:00 longevity-5000-tables-2023-1-db-node-d89f20f6-2     !INFO | scylla[5812]:  [shard  0] migration_manager - Requesting schema pull from 10.4.0.247:0
2023-02-21T16:30:48+00:00 longevity-5000-tables-2023-1-db-node-d89f20f6-2     !INFO | scylla[5812]:  [shard  0] migration_manager - Pulling schema from 10.4.0.247:0
2023-02-21T16:30:48+00:00 longevity-5000-tables-2023-1-db-node-d89f20f6-2     !INFO | scylla[5812]:  [shard  0] migration_manager - Requesting schema pull from 10.4.0.247:0

Impact

Unknown, later we can see tables being populated and cluster works.
But this causes SCT to raise an error and failing the test.
If seastar::rpc::remote_verb_error (connection is closed) is catched and raft is retrying, maybe we should hide it or rephrase to not scare the user.

How frequently does it reproduce?

It happened for all added nodes (5) in this test.

Installation details

Kernel Version: 5.15.0-1028-aws
Scylla version (or git commit hash): 2023.1.0~rc1-20230208.fe3cc281ec73 with build-id ff20df9822b5b6397724a6ff6caadde419b383e6

Cluster size: 1 nodes (i3.8xlarge)

Scylla Nodes used in this run:

longevity-5000-tables-2023-1-db-node-d89f20f6-6 (34.252.200.114 | 10.4.2.130) (shards: 30)
longevity-5000-tables-2023-1-db-node-d89f20f6-5 (54.246.45.102 | 10.4.0.228) (shards: 30)
longevity-5000-tables-2023-1-db-node-d89f20f6-4 (34.243.98.114 | 10.4.0.238) (shards: 30)
longevity-5000-tables-2023-1-db-node-d89f20f6-3 (34.244.54.19 | 10.4.1.23) (shards: 30)
longevity-5000-tables-2023-1-db-node-d89f20f6-2 (54.195.136.190 | 10.4.2.158) (shards: 30)
longevity-5000-tables-2023-1-db-node-d89f20f6-1 (52.50.126.236 | 10.4.0.247) (shards: 30)

OS / Image: ami-056165f482cc8e0d8 (aws: eu-west-1)

Test: scale-5000-tables-test
Test id: d89f20f6-1ec1-4bca-a7f2-1e8c1d0bacda
Test name: enterprise-2023.1/scale/scale-5000-tables-test
Test config file(s):

longevity-5000-tables.yaml

Logs and commands

Restore Monitor Stack command: $ hydra investigate show-monitor d89f20f6-1ec1-4bca-a7f2-1e8c1d0bacda
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs d89f20f6-1ec1-4bca-a7f2-1e8c1d0bacda

Logs:

db-cluster-d89f20f6.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/d89f20f6-1ec1-4bca-a7f2-1e8c1d0bacda/20230222_190116/db-cluster-d89f20f6.tar.gz
email_data-d89f20f6.json.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/d89f20f6-1ec1-4bca-a7f2-1e8c1d0bacda/20230222_190116/email_data-d89f20f6.json.tar.gz
output-d89f20f6.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/d89f20f6-1ec1-4bca-a7f2-1e8c1d0bacda/20230222_190116/output-d89f20f6.log.tar.gz
debug-d89f20f6.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/d89f20f6-1ec1-4bca-a7f2-1e8c1d0bacda/20230222_190116/debug-d89f20f6.log.tar.gz
events-d89f20f6.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/d89f20f6-1ec1-4bca-a7f2-1e8c1d0bacda/20230222_190116/events-d89f20f6.log.tar.gz
sct-d89f20f6.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/d89f20f6-1ec1-4bca-a7f2-1e8c1d0bacda/20230222_190116/sct-d89f20f6.log.tar.gz
normal-d89f20f6.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/d89f20f6-1ec1-4bca-a7f2-1e8c1d0bacda/20230222_190116/normal-d89f20f6.log.tar.gz
argus-d89f20f6.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/d89f20f6-1ec1-4bca-a7f2-1e8c1d0bacda/20230222_190116/argus-d89f20f6.log.tar.gz
raw_events-d89f20f6.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/d89f20f6-1ec1-4bca-a7f2-1e8c1d0bacda/20230222_190116/raw_events-d89f20f6.log.tar.gz
critical-d89f20f6.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/d89f20f6-1ec1-4bca-a7f2-1e8c1d0bacda/20230222_190116/critical-d89f20f6.log.tar.gz
warning-d89f20f6.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/d89f20f6-1ec1-4bca-a7f2-1e8c1d0bacda/20230222_190116/warning-d89f20f6.log.tar.gz
summary-d89f20f6.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/d89f20f6-1ec1-4bca-a7f2-1e8c1d0bacda/20230222_190116/summary-d89f20f6.log.tar.gz
left_processes-d89f20f6.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/d89f20f6-1ec1-4bca-a7f2-1e8c1d0bacda/20230222_190116/left_processes-d89f20f6.log.tar.gz
error-d89f20f6.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/d89f20f6-1ec1-4bca-a7f2-1e8c1d0bacda/20230222_190116/error-d89f20f6.log.tar.gz
monitor-set-d89f20f6.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/d89f20f6-1ec1-4bca-a7f2-1e8c1d0bacda/20230222_190116/monitor-set-d89f20f6.tar.gz
loader-set-d89f20f6.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/d89f20f6-1ec1-4bca-a7f2-1e8c1d0bacda/20230222_190116/loader-set-d89f20f6.tar.gz

Jenkins job URL

The text was updated successfully, but these errors were encountered:

mykaul · 2023-02-23T09:43:43Z

@bhalevy - I think you had some pending work to elaborate more on what the failure might be in such situations?

bhalevy · 2023-02-23T12:11:01Z

There's #12799 but it's for the current node operations, unrelated to raft.

kostja · 2023-02-28T14:19:45Z

@gleb-cloudius raft snapshot transfer hits the unpaged query memory limit. Shall we increase the limit or it's time to switch to paged transfer?

gleb-cloudius · 2023-02-28T14:26:34Z

@gleb-cloudius raft snapshot transfer hits the unpaged query memory limit. Shall we increase the limit or it's time to switch to paged transfer?

We use existing schema_pull rpc call which is called without raft as well. Why don't we see it without raft? Something is fishy here. As of the limit for internal queries it should not be enforced.

gleb-cloudius · 2023-02-28T14:31:46Z

BTW this is a soft limit anyway. It should not cause any exception to be thrown. I am not sure the connection dropping is related. It may be the same issue with the gossiper aborting existing connections when a node becomes normal that caused RBNO to fail the tests.

DoronArazii · 2023-03-12T12:19:55Z

@kbr-scylla / @gleb-cloudius for you this is blocking 5.2 ?

gleb-cloudius · 2023-03-13T08:17:54Z

@kbr-scylla / @gleb-cloudius for you this is blocking 5.2 ?

No.

gleb-cloudius · 2023-05-02T10:23:49Z

BIt may be the same issue with the gossiper aborting existing connections when a node becomes normal that caused RBNO to fail the tests.

I confirmed that this is indeed the case.

kostja · 2023-05-02T11:41:43Z

@soyacz for now please mask the error in the log. I have no idea yet how to fix all of the issues which dropping rpc connections during the boot generates.

kostja · 2023-05-03T11:24:39Z

@mykaul please create 'rpc' label or similar. Since @xemul 's 7bdad47 we have gotten a lot of flaky tests, since connections can be dropped in the middle and - as in this case - the test itself is not flaky, but there is an unwanted error message in the log. I don't know how to elegantly fix it yet. Generally, I think dropping messaging service connections should be transparent to clients, currently it isn't.

Due to scylladb/scylladb#12972 SCT test fails. But this issue is not severe and should not cause failure of the test. Reduce severity of `raft - transfer snapshot error` to `WARNING` as per developers request until issue is fixed.

Due to scylladb/scylladb#12972 SCT test fails. But this issue is not severe and should not cause failure of the test. Reduce severity of `raft - transfer snapshot error` to `WARNING` as per developers request until issue is fixed. (cherry picked from commit 9667ee1)

`handle_state_normal` may drop connections to the handled node. This causes spurious failures if there's an ongoing concurrent operation. This problem was already solved twice in the past in different contexts: first in 5363616, then in 79ee381. Time to fix it for the third time. Now we do this right after enabling gossiping, so hopefully it's the last time. This time it's causing snapshot transfer failures in group 0. Although the transfer is retried and eventually succeeds, the failed transfer is wasted work and causes an annoying ERROR message in the log which dtests, SCT, and I don't like. The fix is done by moving the `wait_for_normal_state_handled_on_boot()` call before `setup_group0()`. But for the wait to work correctly we must first ensure that gossiper sees an alive node, so we precede it with `wait_for_live_node_to_show_up()` (before this commit, the call site of `wait_for_normal_state_handled_on_boot` was already after this wait). Also ensure that `wait_for_normal_state_handled_on_boot()` doesn't hang in raft-topology mode by adding `_normal_state_handled_on_boot.insert(endpoint);` in the raft-topology branch of `handle_state_normal`. Fixes: scylladb#12972

…ter enabling gossiping' from Kamil Braun `handle_state_normal` may drop connections to the handled node. This causes spurious failures if there's an ongoing concurrent operation. This problem was already solved twice in the past in different contexts: first in 5363616, then in 79ee381. Time to fix it for the third time. Now we do this right after enabling gossiping, so hopefully it's the last time. This time it's causing snapshot transfer failures in group 0. Although the transfer is retried and eventually succeeds, the failed transfer is wasted work and causes an annoying ERROR message in the log which dtests, SCT, and I don't like. The fix is done by moving the `wait_for_normal_state_handled_on_boot()` call before `setup_group0()`. But for the wait to work correctly we must first ensure that gossiper sees an alive node, so we precede it with `wait_for_live_node_to_show_up()` (before this commit, the call site of `wait_for_normal_state_handled_on_boot` was already after this wait). There is another problem: the bootstrap procedure is racing with gossiper marking nodes as UP, and waiting for other nodes to be NORMAL doesn't guarantee that they are also UP. If gossiper is quick enough, everything will be fine. If not, problems may arise such as streaming or repair failing due to nodes still being marked as DOWN, or the CDC generation write failing. In general, we need all NORMAL nodes to be up for bootstrap to proceed. One exception is replace where we ignore the replaced node. The `sync_nodes` set constructed for `wait_for_normal_state_handled_on_boot` takes this into account, so we also use it to wait for nodes to be UP. As explained in commit messages and comments, we only do these waits outside raft-based-topology mode. This should improve CI stability. Fixes: #12972 Refs: #14042 Closes #14354 * github.com:scylladb/scylladb: messaging_service: print which connections are dropped due to missing topology info storage_service: wait for nodes to be UP on bootstrap storage_service: wait for NORMAL state handler before `setup_group0()` storage_service: extract `gossiper::wait_for_live_nodes_to_show_up()`

avikivity · 2023-10-31T14:06:30Z

@kbr-scylla please evaluate for backport

kbr-scylla · 2023-11-02T08:40:11Z

The snapshot transfer error is harmless (we automatically recover). The 5.2 tests are already ignoring it (scylladb/scylla-cluster-tests@4cdffb4). So no need to backport.

When a node joins the cluster, it closes connections after learning topology information from other nodes, in order to reopen them with correct encryption, compression etc. In ScyllaDB 5.2, this mechanism may interrupt an ongoing Raft snapshot transfer. This was fixed in later versions by putting some order into the bootstrap process with 50e8ec7 but the fix was not backported due to many prerequisites and complexity. Raft automatically recovers from interrupted snapshot transfer by retrying it eventually, and everything works. However an ERROR is reported due to that one failed snapshot transfer, and dtests dont like ERRORs -- they report the test case as failed if an ERROR happened in any node's logs even if the test passed otherwise. Here we apply a simple workaround to please dtests -- in this particular scenario, turn the ERROR into a WARN.

Backports required to fix #16683 in 5.2: - when creating first group 0 server, create a snapshot with non-empty ID, and start it at index 1 instead of 0 to force snapshot transfer to servers that join group 0 - add an API to trigger Raft snapshot - use the API when we restart and see that the existing snapshot is at index 0, to trigger a new one --- in order to fix broken deployments that already bootstrapped with index-0 snapshot. Closes #17087 * github.com:scylladb/scylladb: test_raft_snapshot_request: fix flakiness (again) test_raft_snapshot_request: fix flakiness Merge 'raft_group0: trigger snapshot if existing snapshot index is 0' from Kamil Braun Merge 'Add an API to trigger snapshot in Raft servers' from Kamil Braun raft: server: add workaround for #12972 raft: Store snapshot update and truncate log atomically service: raft: force initial snapshot transfer in new cluster raft_sys_table_storage: give initial snapshot a non zero value

kostja self-assigned this Feb 28, 2023

kostja assigned gleb-cloudius Feb 28, 2023

DoronArazii added this to the 5.x milestone Mar 5, 2023

roydahan modified the milestones: 5.x, 5.2 Mar 9, 2023

mykaul modified the milestones: 5.2, 5.3 Mar 13, 2023

mykaul added the area/raft label Mar 13, 2023

kostja removed the area/raft label May 3, 2023

kostja modified the milestones: 5.3, 5.x May 3, 2023

kostja unassigned gleb-cloudius May 3, 2023

mykaul added the area/rpc RPC related issue label May 3, 2023

soyacz mentioned this issue May 4, 2023

fix(event): ignore raft transfer snapshot error scylladb/scylla-cluster-tests#6094

Merged

7 tasks

soyacz mentioned this issue May 12, 2023

Raft - failed transfer snapshot due too large mutation #13864

Closed

2 tasks

kbr-scylla mentioned this issue Jun 21, 2023

service: raft: force initial snapshot transfer in new cluster #14336

Closed

kbr-scylla mentioned this issue Jun 22, 2023

Wait for other nodes to be UP and NORMAL on bootstrap right after enabling gossiping #14354

Merged

scylladb-promoter closed this as completed in 5ec5c77 Jun 29, 2023

scylladb-promoter added the Backport candidate label Jun 29, 2023

DoronArazii modified the milestones: 5.x, 5.4 Jun 29, 2023

DoronArazii added backport/5.2 Issues that should be backported to 5.2 branch once they'll be fixed Requires-Backport-to-5.3 labels Jun 29, 2023

kbr-scylla removed Backport candidate backport/5.2 Issues that should be backported to 5.2 branch once they'll be fixed labels Nov 2, 2023

kbr-scylla mentioned this issue Feb 1, 2024

[Backport 5.2] Raft snapshot fixes #17087

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

raft - remote_verb_error (connection is closed) when transferring snapshot #12972

raft - remote_verb_error (connection is closed) when transferring snapshot #12972

soyacz commented Feb 23, 2023

Logs:

mykaul commented Feb 23, 2023

bhalevy commented Feb 23, 2023 •

edited

kostja commented Feb 28, 2023

gleb-cloudius commented Feb 28, 2023

gleb-cloudius commented Feb 28, 2023

DoronArazii commented Mar 12, 2023

gleb-cloudius commented Mar 13, 2023

gleb-cloudius commented May 2, 2023

kostja commented May 2, 2023

kostja commented May 3, 2023 •

edited

avikivity commented Oct 31, 2023

kbr-scylla commented Nov 2, 2023

raft - remote_verb_error (connection is closed) when transferring snapshot #12972

raft - remote_verb_error (connection is closed) when transferring snapshot #12972

Comments

soyacz commented Feb 23, 2023

Issue description

Impact

How frequently does it reproduce?

Installation details

Logs:

mykaul commented Feb 23, 2023

bhalevy commented Feb 23, 2023 • edited

kostja commented Feb 28, 2023

gleb-cloudius commented Feb 28, 2023

gleb-cloudius commented Feb 28, 2023

DoronArazii commented Mar 12, 2023

gleb-cloudius commented Mar 13, 2023

gleb-cloudius commented May 2, 2023

kostja commented May 2, 2023

kostja commented May 3, 2023 • edited

avikivity commented Oct 31, 2023

kbr-scylla commented Nov 2, 2023

bhalevy commented Feb 23, 2023 •

edited

kostja commented May 3, 2023 •

edited