Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

raft - remote_verb_error (connection is closed) when transferring snapshot #12972

Closed
1 of 2 tasks
soyacz opened this issue Feb 23, 2023 · 12 comments · Fixed by #14354
Closed
1 of 2 tasks

raft - remote_verb_error (connection is closed) when transferring snapshot #12972

soyacz opened this issue Feb 23, 2023 · 12 comments · Fixed by #14354
Assignees
Labels
area/rpc RPC related issue
Milestone

Comments

@soyacz
Copy link
Contributor

soyacz commented Feb 23, 2023

Issue description

  • This issue is a regression.
  • It is unknown if this issue is a regression.

In test, we create 1 node, then create 5000 tables and then add another nodes to cluster.
When adding first node, existing node logs show error (with some context):

Feb 21 16:30:46 longevity-5000-tables-2023-1-db-node-d89f20f6-1 scylla[5704]:  [shard  1] raft_group_registry - Raft server id a1018c3f-739c-4fea-8d2b-94fdaf448142 cannot be translated to an IP address.
Feb 21 16:30:47 longevity-5000-tables-2023-1-db-node-d89f20f6-1 scylla[5704]:  [shard  0] gossip - InetAddress 10.4.2.158 is now UP, status = UNKNOWN
Feb 21 16:30:47 longevity-5000-tables-2023-1-db-node-d89f20f6-1 scylla[5704]:  [shard  0] raft_group_registry - marking Raft server a1018c3f-739c-4fea-8d2b-94fdaf448142 as alive for raft groups
Feb 21 16:30:47 longevity-5000-tables-2023-1-db-node-d89f20f6-1 scylla[5704]:  [shard  0] mutation_partition - Memory usage of unpaged query exceeds soft limit of 1048576 (configured via max_memory_for_unlimited_query_soft_limit)
Feb 21 16:30:47 longevity-5000-tables-2023-1-db-node-d89f20f6-1 scylla[5704]:  [shard  0] mutation_partition - Memory usage of unpaged query exceeds soft limit of 1048576 (configured via max_memory_for_unlimited_query_soft_limit)
Feb 21 16:30:47 longevity-5000-tables-2023-1-db-node-d89f20f6-1 scylla[5704]:  [shard  0] mutation_partition - Memory usage of unpaged query exceeds soft limit of 1048576 (configured via max_memory_for_unlimited_query_soft_limit)
Feb 21 16:30:47 longevity-5000-tables-2023-1-db-node-d89f20f6-1 scylla[5704]:  [shard  0] mutation_partition - Memory usage of unpaged query exceeds soft limit of 1048576 (configured via max_memory_for_unlimited_query_soft_limit)
Feb 21 16:30:48 longevity-5000-tables-2023-1-db-node-d89f20f6-1 scylla[5704]:  [shard  0] mutation_partition - Memory usage of unpaged query exceeds soft limit of 1048576 (configured via max_memory_for_unlimited_query_soft_limit)
Feb 21 16:30:48 longevity-5000-tables-2023-1-db-node-d89f20f6-1 scylla[5704]:  [shard  0] raft - [0478b1fd-32de-4906-8071-bec9c81b1fe2] Transferring snapshot to a1018c3f-739c-4fea-8d2b-94fdaf448142 failed with: seastar::rpc::remote_verb_error (connection is closed)

In the meantime added node shows:

2023-02-21T16:30:46+00:00 longevity-5000-tables-2023-1-db-node-d89f20f6-2     !INFO | scylla[5812]:  [shard  0] storage_service - Starting up server gossip
2023-02-21T16:30:46+00:00 longevity-5000-tables-2023-1-db-node-d89f20f6-2     !INFO | scylla[5812]:  [shard 10] compaction - [Compact system.local 180a6b30-b205-11ed-b3bb-03052efc5e67] Compacting [/var/lib/scylla/data/system/local-7ad54392bcdd35a684174e047860b377/mc-40-big-Data.db:level=0:origin=memtable,/var/lib/scylla/data/system/local-7ad54392bcdd35a684174e047860b377/mc-70-big-Data.db:level=0:origin=memtable]
2023-02-21T16:30:46+00:00 longevity-5000-tables-2023-1-db-node-d89f20f6-2     !INFO | scylla[5812]:  [shard  0] gossip - failure_detector_loop: Started main loop
2023-02-21T16:30:46+00:00 longevity-5000-tables-2023-1-db-node-d89f20f6-2     !INFO | scylla[5812]:  [shard  0] raft_group0 - setup_group0: joining group 0...
2023-02-21T16:30:46+00:00 longevity-5000-tables-2023-1-db-node-d89f20f6-2     !INFO | scylla[5812]:  [shard  0] raft_group0 - server a1018c3f-739c-4fea-8d2b-94fdaf448142 found no local group 0. Discovering...
2023-02-21T16:30:46+00:00 longevity-5000-tables-2023-1-db-node-d89f20f6-2     !INFO | scylla[5812]:  [shard  0] raft_group0 - server a1018c3f-739c-4fea-8d2b-94fdaf448142 found group 0 with group id a00c7430-b1ae-11ed-bbd9-7bf2770c6d3b, leader 0478b1fd-32de-4906-8071-bec9c81b1fe2
2023-02-21T16:30:46+00:00 longevity-5000-tables-2023-1-db-node-d89f20f6-2     !INFO | scylla[5812]:  [shard  0] raft_group0 - Server a1018c3f-739c-4fea-8d2b-94fdaf448142 is starting group 0 with id a00c7430-b1ae-11ed-bbd9-7bf2770c6d3b
2023-02-21T16:30:46+00:00 longevity-5000-tables-2023-1-db-node-d89f20f6-2     !INFO | scylla[5812]:  [shard 10] compaction - [Compact system.local 180a6b30-b205-11ed-b3bb-03052efc5e67] Compacted 2 sstables to [/var/lib/scylla/data/system/local-7ad54392bcdd35a684174e047860b377/mc-100-big-Data.db:level=0]. 28kB to 15kB (~53% of original) in 8ms = 1MB/s. ~256 total partitions merged to 1.
2023-02-21T16:30:46+00:00 longevity-5000-tables-2023-1-db-node-d89f20f6-2     !INFO | scylla[5812]:  [shard  0] raft_group0 - server a1018c3f-739c-4fea-8d2b-94fdaf448142 joined group 0 with group id a00c7430-b1ae-11ed-bbd9-7bf2770c6d3b
2023-02-21T16:30:46+00:00 longevity-5000-tables-2023-1-db-node-d89f20f6-2     !INFO | scylla[5812]:  [shard  0] raft_group0 - setup_group0: successfully joined group 0.
2023-02-21T16:30:46+00:00 longevity-5000-tables-2023-1-db-node-d89f20f6-2     !INFO | scylla[5812]:  [shard  0] raft_group0 - setup_group0: ensuring that the cluster has fully upgraded to use Raft...
2023-02-21T16:30:48+00:00 longevity-5000-tables-2023-1-db-node-d89f20f6-2     !INFO | scylla[5812]:  [shard  0] storage_service - Set host_id=0478b1fd-32de-4906-8071-bec9c81b1fe2 to be owned by node=10.4.0.247
2023-02-21T16:30:48+00:00 longevity-5000-tables-2023-1-db-node-d89f20f6-2     !INFO | scylla[5812]:  [shard  0] gossip - InetAddress 10.4.0.247 is now UP, status = NORMAL
2023-02-21T16:30:48+00:00 longevity-5000-tables-2023-1-db-node-d89f20f6-2     !INFO | scylla[5812]:  [shard  0] migration_manager - Requesting schema pull from 10.4.0.247:0
2023-02-21T16:30:48+00:00 longevity-5000-tables-2023-1-db-node-d89f20f6-2     !INFO | scylla[5812]:  [shard  0] migration_manager - Pulling schema from 10.4.0.247:0
2023-02-21T16:30:48+00:00 longevity-5000-tables-2023-1-db-node-d89f20f6-2     !INFO | scylla[5812]:  [shard  0] migration_manager - Requesting schema pull from 10.4.0.247:0

Impact

Unknown, later we can see tables being populated and cluster works.
But this causes SCT to raise an error and failing the test.
If seastar::rpc::remote_verb_error (connection is closed) is catched and raft is retrying, maybe we should hide it or rephrase to not scare the user.

How frequently does it reproduce?

It happened for all added nodes (5) in this test.

Installation details

Kernel Version: 5.15.0-1028-aws
Scylla version (or git commit hash): 2023.1.0~rc1-20230208.fe3cc281ec73 with build-id ff20df9822b5b6397724a6ff6caadde419b383e6

Cluster size: 1 nodes (i3.8xlarge)

Scylla Nodes used in this run:

  • longevity-5000-tables-2023-1-db-node-d89f20f6-6 (34.252.200.114 | 10.4.2.130) (shards: 30)
  • longevity-5000-tables-2023-1-db-node-d89f20f6-5 (54.246.45.102 | 10.4.0.228) (shards: 30)
  • longevity-5000-tables-2023-1-db-node-d89f20f6-4 (34.243.98.114 | 10.4.0.238) (shards: 30)
  • longevity-5000-tables-2023-1-db-node-d89f20f6-3 (34.244.54.19 | 10.4.1.23) (shards: 30)
  • longevity-5000-tables-2023-1-db-node-d89f20f6-2 (54.195.136.190 | 10.4.2.158) (shards: 30)
  • longevity-5000-tables-2023-1-db-node-d89f20f6-1 (52.50.126.236 | 10.4.0.247) (shards: 30)

OS / Image: ami-056165f482cc8e0d8 (aws: eu-west-1)

Test: scale-5000-tables-test
Test id: d89f20f6-1ec1-4bca-a7f2-1e8c1d0bacda
Test name: enterprise-2023.1/scale/scale-5000-tables-test
Test config file(s):

Logs and commands
  • Restore Monitor Stack command: $ hydra investigate show-monitor d89f20f6-1ec1-4bca-a7f2-1e8c1d0bacda
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs d89f20f6-1ec1-4bca-a7f2-1e8c1d0bacda

Logs:

Jenkins job URL

@mykaul
Copy link
Contributor

mykaul commented Feb 23, 2023

@bhalevy - I think you had some pending work to elaborate more on what the failure might be in such situations?

@bhalevy
Copy link
Member

bhalevy commented Feb 23, 2023

There's #12799 but it's for the current node operations, unrelated to raft.

@kostja kostja self-assigned this Feb 28, 2023
@kostja
Copy link
Contributor

kostja commented Feb 28, 2023

@gleb-cloudius raft snapshot transfer hits the unpaged query memory limit. Shall we increase the limit or it's time to switch to paged transfer?

@gleb-cloudius
Copy link
Contributor

@gleb-cloudius raft snapshot transfer hits the unpaged query memory limit. Shall we increase the limit or it's time to switch to paged transfer?

We use existing schema_pull rpc call which is called without raft as well. Why don't we see it without raft? Something is fishy here. As of the limit for internal queries it should not be enforced.

@gleb-cloudius
Copy link
Contributor

BTW this is a soft limit anyway. It should not cause any exception to be thrown. I am not sure the connection dropping is related. It may be the same issue with the gossiper aborting existing connections when a node becomes normal that caused RBNO to fail the tests.

@DoronArazii DoronArazii added this to the 5.x milestone Mar 5, 2023
@roydahan roydahan modified the milestones: 5.x, 5.2 Mar 9, 2023
@DoronArazii
Copy link

@kbr-scylla / @gleb-cloudius for you this is blocking 5.2 ?

@gleb-cloudius
Copy link
Contributor

@kbr-scylla / @gleb-cloudius for you this is blocking 5.2 ?

No.

@mykaul mykaul modified the milestones: 5.2, 5.3 Mar 13, 2023
@gleb-cloudius
Copy link
Contributor

BIt may be the same issue with the gossiper aborting existing connections when a node becomes normal that caused RBNO to fail the tests.

I confirmed that this is indeed the case.

@kostja
Copy link
Contributor

kostja commented May 2, 2023

@soyacz for now please mask the error in the log. I have no idea yet how to fix all of the issues which dropping rpc connections during the boot generates.

@kostja kostja removed the area/raft label May 3, 2023
@kostja kostja modified the milestones: 5.3, 5.x May 3, 2023
@kostja
Copy link
Contributor

kostja commented May 3, 2023

@mykaul please create 'rpc' label or similar. Since @xemul 's 7bdad47 we have gotten a lot of flaky tests, since connections can be dropped in the middle and - as in this case - the test itself is not flaky, but there is an unwanted error message in the log. I don't know how to elegantly fix it yet. Generally, I think dropping messaging service connections should be transparent to clients, currently it isn't.

@mykaul mykaul added the area/rpc RPC related issue label May 3, 2023
soyacz added a commit to soyacz/scylla-cluster-tests that referenced this issue May 5, 2023
Due to scylladb/scylladb#12972 SCT test fails.
But this issue is not severe and should not cause failure of the test.

Reduce severity of `raft - transfer snapshot error` to `WARNING` as per
developers request until issue is fixed.
fruch pushed a commit to scylladb/scylla-cluster-tests that referenced this issue May 7, 2023
Due to scylladb/scylladb#12972 SCT test fails.
But this issue is not severe and should not cause failure of the test.

Reduce severity of `raft - transfer snapshot error` to `WARNING` as per
developers request until issue is fixed.
fruch pushed a commit to scylladb/scylla-cluster-tests that referenced this issue May 7, 2023
Due to scylladb/scylladb#12972 SCT test fails.
But this issue is not severe and should not cause failure of the test.

Reduce severity of `raft - transfer snapshot error` to `WARNING` as per
developers request until issue is fixed.

(cherry picked from commit 9667ee1)
fruch pushed a commit to scylladb/scylla-cluster-tests that referenced this issue May 7, 2023
Due to scylladb/scylladb#12972 SCT test fails.
But this issue is not severe and should not cause failure of the test.

Reduce severity of `raft - transfer snapshot error` to `WARNING` as per
developers request until issue is fixed.

(cherry picked from commit 9667ee1)
kbr-scylla added a commit to kbr-scylla/scylladb that referenced this issue Jun 21, 2023
`handle_state_normal` may drop connections to the handled node. This
causes spurious failures if there's an ongoing concurrent operation.
This problem was already solved twice in the past in different contexts:
first in 5363616, then in
79ee381.

Time to fix it for the third time. Now we do this right after enabling
gossiping, so hopefully it's the last time.

This time it's causing snapshot transfer failures in group 0. Although
the transfer is retried and eventually succeeds, the failed transfer is
wasted work and causes an annoying ERROR message in the log which
dtests, SCT, and I don't like.

The fix is done by moving the `wait_for_normal_state_handled_on_boot()`
call before `setup_group0()`. But for the wait to work correctly we must
first ensure that gossiper sees an alive node, so we precede it with
`wait_for_live_node_to_show_up()` (before this commit, the call site of
`wait_for_normal_state_handled_on_boot` was already after this wait).

Also ensure that `wait_for_normal_state_handled_on_boot()` doesn't hang
in raft-topology mode by adding
`_normal_state_handled_on_boot.insert(endpoint);` in the raft-topology
branch of `handle_state_normal`.

Fixes: scylladb#12972
kbr-scylla added a commit to kbr-scylla/scylladb that referenced this issue Jun 21, 2023
`handle_state_normal` may drop connections to the handled node. This
causes spurious failures if there's an ongoing concurrent operation.
This problem was already solved twice in the past in different contexts:
first in 5363616, then in
79ee381.

Time to fix it for the third time. Now we do this right after enabling
gossiping, so hopefully it's the last time.

This time it's causing snapshot transfer failures in group 0. Although
the transfer is retried and eventually succeeds, the failed transfer is
wasted work and causes an annoying ERROR message in the log which
dtests, SCT, and I don't like.

The fix is done by moving the `wait_for_normal_state_handled_on_boot()`
call before `setup_group0()`. But for the wait to work correctly we must
first ensure that gossiper sees an alive node, so we precede it with
`wait_for_live_node_to_show_up()` (before this commit, the call site of
`wait_for_normal_state_handled_on_boot` was already after this wait).

Also ensure that `wait_for_normal_state_handled_on_boot()` doesn't hang
in raft-topology mode by adding
`_normal_state_handled_on_boot.insert(endpoint);` in the raft-topology
branch of `handle_state_normal`.

Fixes: scylladb#12972
kbr-scylla added a commit to kbr-scylla/scylladb that referenced this issue Jun 22, 2023
`handle_state_normal` may drop connections to the handled node. This
causes spurious failures if there's an ongoing concurrent operation.
This problem was already solved twice in the past in different contexts:
first in 5363616, then in
79ee381.

Time to fix it for the third time. Now we do this right after enabling
gossiping, so hopefully it's the last time.

This time it's causing snapshot transfer failures in group 0. Although
the transfer is retried and eventually succeeds, the failed transfer is
wasted work and causes an annoying ERROR message in the log which
dtests, SCT, and I don't like.

The fix is done by moving the `wait_for_normal_state_handled_on_boot()`
call before `setup_group0()`. But for the wait to work correctly we must
first ensure that gossiper sees an alive node, so we precede it with
`wait_for_live_node_to_show_up()` (before this commit, the call site of
`wait_for_normal_state_handled_on_boot` was already after this wait).

Also ensure that `wait_for_normal_state_handled_on_boot()` doesn't hang
in raft-topology mode by adding
`_normal_state_handled_on_boot.insert(endpoint);` in the raft-topology
branch of `handle_state_normal`.

Fixes: scylladb#12972
kbr-scylla added a commit to kbr-scylla/scylladb that referenced this issue Jun 27, 2023
`handle_state_normal` may drop connections to the handled node. This
causes spurious failures if there's an ongoing concurrent operation.
This problem was already solved twice in the past in different contexts:
first in 5363616, then in
79ee381.

Time to fix it for the third time. Now we do this right after enabling
gossiping, so hopefully it's the last time.

This time it's causing snapshot transfer failures in group 0. Although
the transfer is retried and eventually succeeds, the failed transfer is
wasted work and causes an annoying ERROR message in the log which
dtests, SCT, and I don't like.

The fix is done by moving the `wait_for_normal_state_handled_on_boot()`
call before `setup_group0()`. But for the wait to work correctly we must
first ensure that gossiper sees an alive node, so we precede it with
`wait_for_live_node_to_show_up()` (before this commit, the call site of
`wait_for_normal_state_handled_on_boot` was already after this wait).

Also ensure that `wait_for_normal_state_handled_on_boot()` doesn't hang
in raft-topology mode by adding
`_normal_state_handled_on_boot.insert(endpoint);` in the raft-topology
branch of `handle_state_normal`.

Fixes: scylladb#12972
kbr-scylla added a commit to kbr-scylla/scylladb that referenced this issue Jun 28, 2023
`handle_state_normal` may drop connections to the handled node. This
causes spurious failures if there's an ongoing concurrent operation.
This problem was already solved twice in the past in different contexts:
first in 5363616, then in
79ee381.

Time to fix it for the third time. Now we do this right after enabling
gossiping, so hopefully it's the last time.

This time it's causing snapshot transfer failures in group 0. Although
the transfer is retried and eventually succeeds, the failed transfer is
wasted work and causes an annoying ERROR message in the log which
dtests, SCT, and I don't like.

The fix is done by moving the `wait_for_normal_state_handled_on_boot()`
call before `setup_group0()`. But for the wait to work correctly we must
first ensure that gossiper sees an alive node, so we precede it with
`wait_for_live_node_to_show_up()` (before this commit, the call site of
`wait_for_normal_state_handled_on_boot` was already after this wait).

Also ensure that `wait_for_normal_state_handled_on_boot()` doesn't hang
in raft-topology mode by adding
`_normal_state_handled_on_boot.insert(endpoint);` in the raft-topology
branch of `handle_state_normal`.

Fixes: scylladb#12972
kbr-scylla added a commit to kbr-scylla/scylladb that referenced this issue Jun 28, 2023
`handle_state_normal` may drop connections to the handled node. This
causes spurious failures if there's an ongoing concurrent operation.
This problem was already solved twice in the past in different contexts:
first in 5363616, then in
79ee381.

Time to fix it for the third time. Now we do this right after enabling
gossiping, so hopefully it's the last time.

This time it's causing snapshot transfer failures in group 0. Although
the transfer is retried and eventually succeeds, the failed transfer is
wasted work and causes an annoying ERROR message in the log which
dtests, SCT, and I don't like.

The fix is done by moving the `wait_for_normal_state_handled_on_boot()`
call before `setup_group0()`. But for the wait to work correctly we must
first ensure that gossiper sees an alive node, so we precede it with
`wait_for_live_node_to_show_up()` (before this commit, the call site of
`wait_for_normal_state_handled_on_boot` was already after this wait).

Also ensure that `wait_for_normal_state_handled_on_boot()` doesn't hang
in raft-topology mode by adding
`_normal_state_handled_on_boot.insert(endpoint);` in the raft-topology
branch of `handle_state_normal`.

Fixes: scylladb#12972
kbr-scylla added a commit to kbr-scylla/scylladb that referenced this issue Jun 28, 2023
`handle_state_normal` may drop connections to the handled node. This
causes spurious failures if there's an ongoing concurrent operation.
This problem was already solved twice in the past in different contexts:
first in 5363616, then in
79ee381.

Time to fix it for the third time. Now we do this right after enabling
gossiping, so hopefully it's the last time.

This time it's causing snapshot transfer failures in group 0. Although
the transfer is retried and eventually succeeds, the failed transfer is
wasted work and causes an annoying ERROR message in the log which
dtests, SCT, and I don't like.

The fix is done by moving the `wait_for_normal_state_handled_on_boot()`
call before `setup_group0()`. But for the wait to work correctly we must
first ensure that gossiper sees an alive node, so we precede it with
`wait_for_live_node_to_show_up()` (before this commit, the call site of
`wait_for_normal_state_handled_on_boot` was already after this wait).

Also ensure that `wait_for_normal_state_handled_on_boot()` doesn't hang
in raft-topology mode by adding
`_normal_state_handled_on_boot.insert(endpoint);` in the raft-topology
branch of `handle_state_normal`.

Fixes: scylladb#12972
tgrabiec added a commit that referenced this issue Jun 28, 2023
…ter enabling gossiping' from Kamil Braun

`handle_state_normal` may drop connections to the handled node. This
causes spurious failures if there's an ongoing concurrent operation.
This problem was already solved twice in the past in different contexts:
first in 5363616, then in
79ee381.

Time to fix it for the third time. Now we do this right after enabling
gossiping, so hopefully it's the last time.

This time it's causing snapshot transfer failures in group 0. Although
the transfer is retried and eventually succeeds, the failed transfer is
wasted work and causes an annoying ERROR message in the log which
dtests, SCT, and I don't like.

The fix is done by moving the `wait_for_normal_state_handled_on_boot()`
call before `setup_group0()`. But for the wait to work correctly we must
first ensure that gossiper sees an alive node, so we precede it with
`wait_for_live_node_to_show_up()` (before this commit, the call site of
`wait_for_normal_state_handled_on_boot` was already after this wait).

There is another problem: the bootstrap procedure is racing with gossiper
marking nodes as UP, and waiting for other nodes to be NORMAL doesn't guarantee
that they are also UP. If gossiper is quick enough, everything will be fine.
If not, problems may arise such as streaming or repair failing due to nodes
still being marked as DOWN, or the CDC generation write failing.

In general, we need all NORMAL nodes to be up for bootstrap to proceed.
One exception is replace where we ignore the replaced node. The
`sync_nodes` set constructed for `wait_for_normal_state_handled_on_boot`
takes this into account, so we also use it to wait for nodes to be UP.

As explained in commit messages and comments, we only do these
waits outside raft-based-topology mode.

This should improve CI stability.
Fixes: #12972
Refs: #14042

Closes #14354

* github.com:scylladb/scylladb:
  messaging_service: print which connections are dropped due to missing topology info
  storage_service: wait for nodes to be UP on bootstrap
  storage_service: wait for NORMAL state handler before `setup_group0()`
  storage_service: extract `gossiper::wait_for_live_nodes_to_show_up()`
@DoronArazii DoronArazii modified the milestones: 5.x, 5.4 Jun 29, 2023
@DoronArazii DoronArazii added backport/5.2 Issues that should be backported to 5.2 branch once they'll be fixed Requires-Backport-to-5.3 labels Jun 29, 2023
@avikivity
Copy link
Member

@kbr-scylla please evaluate for backport

@kbr-scylla
Copy link
Contributor

The snapshot transfer error is harmless (we automatically recover). The 5.2 tests are already ignoring it (scylladb/scylla-cluster-tests@4cdffb4). So no need to backport.

@kbr-scylla kbr-scylla removed Backport candidate backport/5.2 Issues that should be backported to 5.2 branch once they'll be fixed labels Nov 2, 2023
kbr-scylla added a commit to kbr-scylla/scylladb that referenced this issue Feb 2, 2024
When a node joins the cluster, it closes connections after learning
topology information from other nodes, in order to reopen them with
correct encryption, compression etc.

In ScyllaDB 5.2, this mechanism may interrupt an ongoing Raft snapshot
transfer. This was fixed in later versions by putting some order into
the bootstrap process with 50e8ec7 but
the fix was not backported due to many prerequisites and complexity.

Raft automatically recovers from interrupted snapshot transfer by
retrying it eventually, and everything works. However an ERROR is
reported due to that one failed snapshot transfer, and dtests dont like
ERRORs -- they report the test case as failed if an ERROR happened in
any node's logs even if the test passed otherwise.

Here we apply a simple workaround to please dtests -- in this particular
scenario, turn the ERROR into a WARN.
denesb added a commit that referenced this issue Feb 7, 2024
Backports required to fix #16683 in 5.2:
- when creating first group 0 server, create a snapshot with non-empty ID, and start it at index 1 instead of 0 to force snapshot transfer to servers that join group 0
- add an API to trigger Raft snapshot
- use the API when we restart and see that the existing snapshot is at index 0, to trigger a new one --- in order to fix broken deployments that already bootstrapped with index-0 snapshot.

Closes #17087

* github.com:scylladb/scylladb:
  test_raft_snapshot_request: fix flakiness (again)
  test_raft_snapshot_request: fix flakiness
  Merge 'raft_group0: trigger snapshot if existing snapshot index is 0' from Kamil Braun
  Merge 'Add an API to trigger snapshot in Raft servers' from Kamil Braun
  raft: server: add workaround for #12972
  raft: Store snapshot update and truncate log atomically
  service: raft: force initial snapshot transfer in new cluster
  raft_sys_table_storage: give initial snapshot a non zero value
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/rpc RPC related issue
Projects
None yet
10 participants