gossiper, a new node doesn't get properly notified about other node restart #14042

gusev-p · 2023-05-26T09:21:14Z

Three nodes in a cluster, trying to add a new node after some other node has been restarted, getting

ERROR 2023-05-24 14:01:41,480 [shard 0] init - Startup failed: exceptions::unavailable_exception (Cannot achieve consistency level for cl ALL. Requires 3, alive 2)

Initially reproduced on this CI job.

Revision: 98cc500cbd0aadde2996c784e11e41abcf0b4c23 
Repository: git@github.com:scylladb/scylla.git
origin/13884

The failed test is Tests / Unit Tests / non-boost tests.topology.test_topology_schema.release.1, failing on the line await manager.server_add().

Three existing nodes: (121, 127.206.68.28), (132, 127.206.68.11), (127, 127.206.68.26) - this node has beed restarted. The new node (149, 127.206.68.56).

scylla-121.log
scylla-121.yaml.txt

scylla-132.log
scylla-132.yaml.txt

scylla-127.log
scylla-127.yaml.txt

scylla-149.log
scylla-149.yaml.txt

14:01:36,460
    node 121 marked 127 as shutdown

    node 132 marked 127 as shutdown

14:01:38,902
    node 127 marked itself as normal

14:01:39,479
    node 149 triggered one second timer for gossiper::run

14:01:39,904
    node 121 marked 127 as up

~14:01:40,479
    node 149 timer triggers gossiper::run, which should have contacted all three seed nodes 121, 132, 127

    so here only 132 thinks 127 is shutdown, both 121 and 127 itself consider it as normal, but 149 for some reason
    decides that it's shutdown. I don't know why(((

14:01:40,547
    node 132 marked 127 as up

14:01:40,547
    node 149 marked 127 as shutdown

gossiper::run at 14:01:40,479 should have contacted all three seed nodes here, since _live_endpoints are empty - we cleared them in check_for_endpoint_collision.

Looks like there's some problem with heart_beat_version, like they are not compared correctly or (de)serialized.

The text was updated successfully, but these errors were encountered:

kbr-scylla · 2023-05-26T09:44:26Z

Idea for reproducer:
boot 3 nodes, restart one of them, immediately boot 4th node

Add a check into gossiper code: immediately after the first gossiper round -- which should contact all of the seed nodes -- verify that gossiper considers all the nodes as NORMAL.

The CL=ALL write is racing with gossiper rounds, if the second round fixes the problem it will mask it, making it harder to reproduce (in runs where the write happens after the second round, the test would pass).

kostja · 2023-06-12T08:18:34Z

@gusev-p did you ever see this happen again?

gusev-p · 2023-06-12T08:28:52Z

Nope. I tried to construct a concrete scenario how this could happen, but bogged down in all the gossiper details. There are too many things over there that depend on time, so I believe this can happen if some activities took longer than expected.

We can leave this issue on me, if it comes up again it will add motivation to investigate further.

kbr-scylla · 2023-06-21T11:33:41Z

Looks like it happened again:
https://jenkins.scylladb.com/job/scylla-master/job/next/6166/testReport/junit/(root)/non-boost%20tests/Tests___Unit_Tests___topology_test_topology_schema_release_3/

tgrabiec · 2023-06-21T11:44:57Z

The new node has this:

WARN  2023-06-21 14:01:32,757 [shard 0] gossip - Skip marking node 127.95.60.12 with status = shutdown as UP

The skip is due to shutdown status. The cause is that the new node didn't wait for the gossip round (we run tests with skip_wait_for_gossip_to_settle=0), so didn't have time to learn that the status is no longer shutdown.

kbr-scylla · 2023-06-21T11:46:07Z

so didn't have time to learn that the status is no longer shutdown.

The problem is that the node should have never received the shutdown status in the first place, because it performs a gossip round with every seed, and at least one of them already knows that the node is no longer shutdown but NORMAL.

kbr-scylla · 2023-06-21T11:47:22Z

The problem is that the node should have never received the shutdown status in the first place, because it performs a gossip round with every seed, and at least one of them already knows that the node is no longer shutdown but NORMAL.

Well to be precise, it may see that it's shutdown after contacting one of the seeds, but there is at least one seed which should the newest version that's NORMAL. (Like the node that restarted itself) - and IIUC it performs a round with all seeds

kbr-scylla · 2023-06-21T13:30:31Z

Happened again in CI:
https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/1974/artifact/testlog/x86_64/release/topology.test_topology_schema.1.log

did the frequency increase?

kostja · 2023-06-21T18:01:54Z

Could this also be influencing #14274? @kbr-scylla the endpoint state in gossip is shutdown; the booting node sends its new IP address; the update/notification is skipped since the node is not NORMAL.

kbr-scylla · 2023-06-22T08:23:04Z

the booting node sends its new IP address; the update/notification is skipped since the node is not NORMAL

The node sends its NORMAL status eventually on restart so the IP should be updated.

kbr-scylla · 2023-06-22T08:41:33Z

Happened again
https://jenkins.scylladb.com/job/scylla-master/job/next/6170/
frequency definitely increased.

The bootstrap procedure is racing with gossiper marking nodes as UP. If gossiper is quick enough, everything will be fine. If not, problems may arise such as streaming or repair failing due to nodes still being marked as DOWN, or the CDC generation write failing. In general, we need all NORMAL nodes to be up for bootstrap to proceed. One exception is replace where we ignore the replaced node. The `sync_nodes` set constructed for `wait_for_normal_state_handled_on_boot` takes this into account, so we use it. Fixes: scylladb#14042

The bootstrap procedure is racing with gossiper marking nodes as UP. If gossiper is quick enough, everything will be fine. If not, problems may arise such as streaming or repair failing due to nodes still being marked as DOWN, or the CDC generation write failing. In general, we need all NORMAL nodes to be up for bootstrap to proceed. One exception is replace where we ignore the replaced node. The `sync_nodes` set constructed for `wait_for_normal_state_handled_on_boot` takes this into account, so we use it. Refs: scylladb#14042 This doesn't completely fix scylladb#14042 yet becasue it's specific to gossiper-based topology mode only. For Raft-based topology, the node joining procedure will be coordinated by the topology coordinator right from the start and it will be the coordinator who issues the 'wait for node to see other live nodes'.

…ter enabling gossiping' from Kamil Braun `handle_state_normal` may drop connections to the handled node. This causes spurious failures if there's an ongoing concurrent operation. This problem was already solved twice in the past in different contexts: first in 5363616, then in 79ee381. Time to fix it for the third time. Now we do this right after enabling gossiping, so hopefully it's the last time. This time it's causing snapshot transfer failures in group 0. Although the transfer is retried and eventually succeeds, the failed transfer is wasted work and causes an annoying ERROR message in the log which dtests, SCT, and I don't like. The fix is done by moving the `wait_for_normal_state_handled_on_boot()` call before `setup_group0()`. But for the wait to work correctly we must first ensure that gossiper sees an alive node, so we precede it with `wait_for_live_node_to_show_up()` (before this commit, the call site of `wait_for_normal_state_handled_on_boot` was already after this wait). There is another problem: the bootstrap procedure is racing with gossiper marking nodes as UP, and waiting for other nodes to be NORMAL doesn't guarantee that they are also UP. If gossiper is quick enough, everything will be fine. If not, problems may arise such as streaming or repair failing due to nodes still being marked as DOWN, or the CDC generation write failing. In general, we need all NORMAL nodes to be up for bootstrap to proceed. One exception is replace where we ignore the replaced node. The `sync_nodes` set constructed for `wait_for_normal_state_handled_on_boot` takes this into account, so we also use it to wait for nodes to be UP. As explained in commit messages and comments, we only do these waits outside raft-based-topology mode. This should improve CI stability. Fixes: #12972 Refs: #14042 Closes #14354 * github.com:scylladb/scylladb: messaging_service: print which connections are dropped due to missing topology info storage_service: wait for nodes to be UP on bootstrap storage_service: wait for NORMAL state handler before `setup_group0()` storage_service: extract `gossiper::wait_for_live_nodes_to_show_up()`

kbr-scylla · 2023-06-29T11:53:49Z

Got closed due to this part of commit message I guess:

This doesn't completely fix #14042 yet

mykaul · 2023-08-16T10:46:21Z

@kbr-scylla - so what is the next step here?

kbr-scylla · 2023-08-21T09:44:35Z

Ok, we're not seeing this after 51cec2b. No point in keeping this open I think.

Annamikhlin · 2023-09-14T06:45:55Z

seems happend also in 5.2/next
https://jenkins.scylladb.com/job/scylla-5.2/job/next/289/testReport/junit/(root)/test_topology_schema/Unit_Tests___test_topology_schema_changes/

@kbr-scylla - some backport is missing?

link to the log: https://jenkins.scylladb.com/job/scylla-5.2/job/next/289/artifact/testlog/x86_64/release/topology.test_topology_schema.3.log

=================================== FAILURES ===================================
_________________________ test_topology_schema_changes _________________________

self = <test.pylib.manager_client.ManagerClient object at 0x7fd90bb7e110>
replace_cfg = None, cmdline = None, config = None, start = True

    async def server_add(self, replace_cfg: Optional[ReplaceConfig] = None, cmdline: Optional[List[str]] = None, config: Optional[dict[str, str]] = None, start: bool = True) -> ServerInfo:
        """Add a new server"""
        try:
            data: dict[str, Any] = {'start': start}
            if replace_cfg:
                data['replace_cfg'] = replace_cfg._asdict()
            if cmdline:
                data['cmdline'] = cmdline
            if config:
                data['config'] = config
>           server_info = await self.client.put_json("/cluster/addserver", data, response_type="json",
                                                     timeout=ScyllaServer.TOPOLOGY_TIMEOUT)

test/pylib/manager_client.py:169: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <test.pylib.rest_client.UnixRESTClient object at 0x7fd90bb7df90>
resource_uri = '/cluster/addserver', data = {'start': True}, host = None
port = None, params = None, response_type = 'json', timeout = 1000

    async def put_json(self, resource_uri: str, data: Mapping, host: Optional[str] = None,
                       port: Optional[int] = None, params: Optional[dict[str, str]] = None,
                       response_type: Optional[str] = None, timeout: Optional[float] = None) -> Any:
>       ret = await self._fetch("PUT", resource_uri, response_type = response_type, host = host,
                                port = port, params = params, json = data, timeout = timeout)

test/pylib/rest_client.py:104: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <test.pylib.rest_client.UnixRESTClient object at 0x7fd90bb7df90>
method = 'PUT', resource = '/cluster/addserver', response_type = 'json'
host = None, port = None, params = None, json = {'start': True}, timeout = 1000

    async def _fetch(self, method: str, resource: str, response_type: Optional[str] = None,
                     host: Optional[str] = None, port: Optional[int] = None,
                     params: Optional[Mapping[str, str]] = None,
                     json: Optional[Mapping] = None, timeout: Optional[float] = None) -> Any:
        # Can raise exception. See https://docs.aiohttp.org/en/latest/web_exceptions.html
        assert method in ["GET", "POST", "PUT", "DELETE"], f"Invalid HTTP request method {method}"
        assert response_type is None or response_type in ["text", "json"], \
                f"Invalid response type requested {response_type} (expected 'text' or 'json')"
        # Build the URI
        port = port if port else self.default_port if hasattr(self, "default_port") else None
        port_str = f":{port}" if port else ""
        assert host is not None or hasattr(self, "default_host"), "_fetch: missing host for " \
                "{method} {resource}"
        host_str = host if host is not None else self.default_host
        uri = self.uri_scheme + "://" + host_str + port_str + resource
        logging.debug(f"RESTClient fetching {method} {uri}")
    
        client_timeout = ClientTimeout(total = timeout if timeout is not None else 300)
        async with request(method, uri,
                           connector = self.connector if hasattr(self, "connector") else None,
                           params = params, json = json, timeout = client_timeout) as resp:
            if resp.status != 200:
                text = await resp.text()
>               raise HTTPError(uri, resp.status, params, json, text)
E               test.pylib.rest_client.HTTPError: HTTP error 500, uri: http+unix://api/cluster/addserver, params: None, json: {'start': True}, body:
E               failed to start the node, server_id 217, IP 127.192.220.18, workdir scylla-217, host_id 6d82e977-9994-4f15-b8b8-aa91c99b7f33, cql [not connected]
E               Check the log files:
E               /jenkins/workspace/scylla-5.2/next/scylla/testlog/x86_64/test.py.debug-release-dev.log
E               /jenkins/workspace/scylla-5.2/next/scylla/testlog/x86_64/release/scylla-217.log

test/pylib/rest_client.py:69: HTTPError

The above exception was the direct cause of the following exception:

manager = <test.pylib.manager_client.ManagerClient object at 0x7fd90bb7e110>
random_tables = <test.pylib.random_tables.RandomTables object at 0x7fd90bb7fdd0>

    @pytest.mark.asyncio
    async def test_topology_schema_changes(manager, random_tables):
        """Test schema consistency with restart, add, and sudden stop of servers"""
        table = await random_tables.add_table(ncolumns=5)
        servers = await manager.running_servers()
    
        # Test add column after server restart
        await manager.server_restart(servers[1].server_id)
        await table.add_column()
        await random_tables.verify_schema()
    
        # Test add column after adding a server
>       await manager.server_add()

test/topology/test_topology_schema.py:26: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <test.pylib.manager_client.ManagerClient object at 0x7fd90bb7e110>
replace_cfg = None, cmdline = None, config = None, start = True

    async def server_add(self, replace_cfg: Optional[ReplaceConfig] = None, cmdline: Optional[List[str]] = None, config: Optional[dict[str, str]] = None, start: bool = True) -> ServerInfo:
        """Add a new server"""
        try:
            data: dict[str, Any] = {'start': start}
            if replace_cfg:
                data['replace_cfg'] = replace_cfg._asdict()
            if cmdline:
                data['cmdline'] = cmdline
            if config:
                data['config'] = config
            server_info = await self.client.put_json("/cluster/addserver", data, response_type="json",
                                                     timeout=ScyllaServer.TOPOLOGY_TIMEOUT)
        except Exception as exc:
>           raise Exception("Failed to add server") from exc
E           Exception: Failed to add server

test/pylib/manager_client.py:172: Exception

tgrabiec · 2023-11-03T14:22:37Z

@kbr-scylla ping, seen again on 5.2: https://jenkins.scylladb.com/view/scylla-5.2/job/scylla-5.2/job/next/337/

kbr-scylla · 2023-11-08T14:01:58Z

The backport is nontrivial and there are nontrivial prerequisites that would also have to be backported. Since this is a test-issue (doesn't happen in prod with all the sleeps), the backport is not strictly necessary. Still, we should deal somehow with the test flakiness...

The problem is that this can happen any time we start a cluster in test.py. There is no single test to disable to ignore it.

I don't know how to deal with it :/

@avikivity should I prepare a large backport just to unflake the tests?
I'd have to backport at least:
5363616
#13317
#14354
maybe more. Those are large and potentially dangerous backports.

Another way would be to just ignore the flakiness...

Annamikhlin · 2023-11-16T07:04:56Z

seen also on 2024.1:
https://jenkins.scylladb.com/job/enterprise-2024.1/job/build/1/testReport/junit/(root)/test_topology_schema/Build__Test__Publish___buildTestPublishX86___test_topology_schema_changes/

required backport to 2024.1?

kbr-scylla · 2023-11-16T11:32:14Z

@Annamikhlin this was caused by a different issue
https://jenkins.scylladb.com/job/enterprise-2024.1/job/build/1/artifact/testlog/x86_64/debug/scylla-429.log

INFO  2023-11-16 06:00:06,104 [shard 0:main] raft_group0 - gaining leadership
ERROR 2023-11-16 06:00:06,106 [shard 0:stre] storage_service - raft topology: failed to fence previous coordinator raft::request_aborted (Request is aborted by a caller)
INFO  2023-11-16 06:00:06,108 [shard 0:stre] storage_service - raft_state_monitor_fiber aborted with seastar::sleep_aborted (Sleep is aborted)

This is #15747, indeed missing backport. I'll prepare it.

benipeled · 2023-12-10T08:54:15Z

For the record - seen again on 5.2 https://jenkins.scylladb.com/job/scylla-5.2/job/next/379/testReport/junit/(root)/test_topology_schema/Unit_Tests___test_topology_schema_changes/

mykaul · 2023-12-10T08:56:48Z

For the record - seen again on 5.2 https://jenkins.scylladb.com/job/scylla-5.2/job/next/379/testReport/junit/(root)/test_topology_schema/Unit_Tests___test_topology_schema_changes/

Fix was only backported to 5.4 (#15747 (comment) )

gusev-p assigned gusev-p and kbr-scylla May 26, 2023

mykaul added P1 Urgent symptom/ci stability Issues that failed in ScyllaDB CI - tests and framework labels May 29, 2023

DoronArazii added this to the 5.3 milestone Jun 6, 2023

kostja modified the milestones: 5.3, 5.x Jun 12, 2023

kostja added symptom/ci stability Issues that failed in ScyllaDB CI - tests and framework and removed P1 Urgent symptom/ci stability Issues that failed in ScyllaDB CI - tests and framework labels Jun 12, 2023

bhalevy mentioned this issue Jun 21, 2023

test_topology_schema_changes: Failed to add server #14335

Closed

kbr-scylla mentioned this issue Jun 21, 2023

service: raft: force initial snapshot transfer in new cluster #14336

Closed

DoronArazii added the P1 Urgent label Jun 22, 2023

DoronArazii modified the milestones: 5.x, 5.4, 5.3 Jun 22, 2023

kbr-scylla mentioned this issue Jun 22, 2023

Wait for other nodes to be UP and NORMAL on bootstrap right after enabling gossiping #14354

Merged

scylladb-promoter closed this as completed in 51cec2b Jun 29, 2023

kbr-scylla reopened this Jun 29, 2023

DoronArazii modified the milestones: 5.3, 5.4 Jul 26, 2023

raphaelsc mentioned this issue Aug 9, 2023

Wire tablet into compaction group #14863

Merged

kbr-scylla closed this as completed Aug 21, 2023

mykaul added the area/gossip label Dec 10, 2023

Deexie mentioned this issue Feb 27, 2024

[Backport 5.2] repair: streaming: handle no_such_column_family from remote node #17528

Merged

kbr-scylla mentioned this issue Mar 7, 2024

test_topology_schema_changes fails due to timeout when starting a server #16708

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gossiper, a new node doesn't get properly notified about other node restart #14042

gossiper, a new node doesn't get properly notified about other node restart #14042

gusev-p commented May 26, 2023 •

edited

kbr-scylla commented May 26, 2023

kostja commented Jun 12, 2023

gusev-p commented Jun 12, 2023

kbr-scylla commented Jun 21, 2023

tgrabiec commented Jun 21, 2023

kbr-scylla commented Jun 21, 2023

kbr-scylla commented Jun 21, 2023

kbr-scylla commented Jun 21, 2023

kostja commented Jun 21, 2023

kbr-scylla commented Jun 22, 2023

kbr-scylla commented Jun 22, 2023

kbr-scylla commented Jun 29, 2023

mykaul commented Aug 16, 2023

kbr-scylla commented Aug 21, 2023

Annamikhlin commented Sep 14, 2023

tgrabiec commented Nov 3, 2023

kbr-scylla commented Nov 8, 2023

Annamikhlin commented Nov 16, 2023

kbr-scylla commented Nov 16, 2023

benipeled commented Dec 10, 2023 •

edited

mykaul commented Dec 10, 2023

gossiper, a new node doesn't get properly notified about other node restart #14042

gossiper, a new node doesn't get properly notified about other node restart #14042

Comments

gusev-p commented May 26, 2023 • edited

kbr-scylla commented May 26, 2023

kostja commented Jun 12, 2023

gusev-p commented Jun 12, 2023

kbr-scylla commented Jun 21, 2023

tgrabiec commented Jun 21, 2023

kbr-scylla commented Jun 21, 2023

kbr-scylla commented Jun 21, 2023

kbr-scylla commented Jun 21, 2023

kostja commented Jun 21, 2023

kbr-scylla commented Jun 22, 2023

kbr-scylla commented Jun 22, 2023

kbr-scylla commented Jun 29, 2023

mykaul commented Aug 16, 2023

kbr-scylla commented Aug 21, 2023

Annamikhlin commented Sep 14, 2023

tgrabiec commented Nov 3, 2023

kbr-scylla commented Nov 8, 2023

Annamikhlin commented Nov 16, 2023

kbr-scylla commented Nov 16, 2023

benipeled commented Dec 10, 2023 • edited

mykaul commented Dec 10, 2023

gusev-p commented May 26, 2023 •

edited

benipeled commented Dec 10, 2023 •

edited