Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gossiper, a new node doesn't get properly notified about other node restart #14042

Closed
gusev-p opened this issue May 26, 2023 · 21 comments
Closed
Assignees
Labels
area/gossip P1 Urgent symptom/ci stability Issues that failed in ScyllaDB CI - tests and framework
Milestone

Comments

@gusev-p
Copy link

gusev-p commented May 26, 2023

Three nodes in a cluster, trying to add a new node after some other node has been restarted, getting

ERROR 2023-05-24 14:01:41,480 [shard 0] init - Startup failed: exceptions::unavailable_exception (Cannot achieve consistency level for cl ALL. Requires 3, alive 2)

Initially reproduced on this CI job.

Revision: 98cc500cbd0aadde2996c784e11e41abcf0b4c23 
Repository: git@github.com:scylladb/scylla.git
origin/13884

The failed test is Tests / Unit Tests / non-boost tests.topology.test_topology_schema.release.1, failing on the line await manager.server_add().

Three existing nodes: (121, 127.206.68.28), (132, 127.206.68.11), (127, 127.206.68.26) - this node has beed restarted. The new node (149, 127.206.68.56).

scylla-121.log
scylla-121.yaml.txt

scylla-132.log
scylla-132.yaml.txt

scylla-127.log
scylla-127.yaml.txt

scylla-149.log
scylla-149.yaml.txt

14:01:36,460
    node 121 marked 127 as shutdown

    node 132 marked 127 as shutdown

14:01:38,902
    node 127 marked itself as normal

14:01:39,479
    node 149 triggered one second timer for gossiper::run

14:01:39,904
    node 121 marked 127 as up

~14:01:40,479
    node 149 timer triggers gossiper::run, which should have contacted all three seed nodes 121, 132, 127

    so here only 132 thinks 127 is shutdown, both 121 and 127 itself consider it as normal, but 149 for some reason
    decides that it's shutdown. I don't know why(((

14:01:40,547
    node 132 marked 127 as up

14:01:40,547
    node 149 marked 127 as shutdown

gossiper::run at 14:01:40,479 should have contacted all three seed nodes here, since _live_endpoints are empty - we cleared them in check_for_endpoint_collision.

Looks like there's some problem with heart_beat_version, like they are not compared correctly or (de)serialized.

@kbr-scylla
Copy link
Contributor

Idea for reproducer:
boot 3 nodes, restart one of them, immediately boot 4th node

Add a check into gossiper code: immediately after the first gossiper round -- which should contact all of the seed nodes -- verify that gossiper considers all the nodes as NORMAL.

The CL=ALL write is racing with gossiper rounds, if the second round fixes the problem it will mask it, making it harder to reproduce (in runs where the write happens after the second round, the test would pass).

@mykaul mykaul added P1 Urgent symptom/ci stability Issues that failed in ScyllaDB CI - tests and framework labels May 29, 2023
@DoronArazii DoronArazii added this to the 5.3 milestone Jun 6, 2023
@kostja
Copy link
Contributor

kostja commented Jun 12, 2023

@gusev-p did you ever see this happen again?

@gusev-p
Copy link
Author

gusev-p commented Jun 12, 2023

Nope. I tried to construct a concrete scenario how this could happen, but bogged down in all the gossiper details. There are too many things over there that depend on time, so I believe this can happen if some activities took longer than expected.

We can leave this issue on me, if it comes up again it will add motivation to investigate further.

@kostja kostja modified the milestones: 5.3, 5.x Jun 12, 2023
@kostja kostja added symptom/ci stability Issues that failed in ScyllaDB CI - tests and framework and removed P1 Urgent symptom/ci stability Issues that failed in ScyllaDB CI - tests and framework labels Jun 12, 2023
@tgrabiec
Copy link
Contributor

The new node has this:

WARN  2023-06-21 14:01:32,757 [shard 0] gossip - Skip marking node 127.95.60.12 with status = shutdown as UP

The skip is due to shutdown status. The cause is that the new node didn't wait for the gossip round (we run tests with skip_wait_for_gossip_to_settle=0), so didn't have time to learn that the status is no longer shutdown.

@kbr-scylla
Copy link
Contributor

so didn't have time to learn that the status is no longer shutdown.

The problem is that the node should have never received the shutdown status in the first place, because it performs a gossip round with every seed, and at least one of them already knows that the node is no longer shutdown but NORMAL.

@kbr-scylla
Copy link
Contributor

The problem is that the node should have never received the shutdown status in the first place, because it performs a gossip round with every seed, and at least one of them already knows that the node is no longer shutdown but NORMAL.

Well to be precise, it may see that it's shutdown after contacting one of the seeds, but there is at least one seed which should the newest version that's NORMAL. (Like the node that restarted itself) - and IIUC it performs a round with all seeds

@kbr-scylla
Copy link
Contributor

@kostja
Copy link
Contributor

kostja commented Jun 21, 2023

Could this also be influencing #14274? @kbr-scylla the endpoint state in gossip is shutdown; the booting node sends its new IP address; the update/notification is skipped since the node is not NORMAL.

@kbr-scylla
Copy link
Contributor

the booting node sends its new IP address; the update/notification is skipped since the node is not NORMAL

The node sends its NORMAL status eventually on restart so the IP should be updated.

@kbr-scylla
Copy link
Contributor

Happened again
https://jenkins.scylladb.com/job/scylla-master/job/next/6170/
frequency definitely increased.

@DoronArazii DoronArazii added the P1 Urgent label Jun 22, 2023
@DoronArazii DoronArazii modified the milestones: 5.x, 5.4, 5.3 Jun 22, 2023
kbr-scylla added a commit to kbr-scylla/scylladb that referenced this issue Jun 22, 2023
The bootstrap procedure is racing with gossiper marking nodes as UP.
If gossiper is quick enough, everything will be fine.
If not, problems may arise such as streaming or repair failing due to
nodes still being marked as DOWN, or the CDC generation write failing.

In general, we need all NORMAL nodes to be up for bootstrap to proceed.
One exception is replace where we ignore the replaced node. The
`sync_nodes` set constructed for `wait_for_normal_state_handled_on_boot`
takes this into account, so we use it.

Fixes: scylladb#14042
kbr-scylla added a commit to kbr-scylla/scylladb that referenced this issue Jun 22, 2023
The bootstrap procedure is racing with gossiper marking nodes as UP.
If gossiper is quick enough, everything will be fine.
If not, problems may arise such as streaming or repair failing due to
nodes still being marked as DOWN, or the CDC generation write failing.

In general, we need all NORMAL nodes to be up for bootstrap to proceed.
One exception is replace where we ignore the replaced node. The
`sync_nodes` set constructed for `wait_for_normal_state_handled_on_boot`
takes this into account, so we use it.

Fixes: scylladb#14042
kbr-scylla added a commit to kbr-scylla/scylladb that referenced this issue Jun 27, 2023
The bootstrap procedure is racing with gossiper marking nodes as UP.
If gossiper is quick enough, everything will be fine.
If not, problems may arise such as streaming or repair failing due to
nodes still being marked as DOWN, or the CDC generation write failing.

In general, we need all NORMAL nodes to be up for bootstrap to proceed.
One exception is replace where we ignore the replaced node. The
`sync_nodes` set constructed for `wait_for_normal_state_handled_on_boot`
takes this into account, so we use it.

Fixes: scylladb#14042
kbr-scylla added a commit to kbr-scylla/scylladb that referenced this issue Jun 28, 2023
The bootstrap procedure is racing with gossiper marking nodes as UP.
If gossiper is quick enough, everything will be fine.
If not, problems may arise such as streaming or repair failing due to
nodes still being marked as DOWN, or the CDC generation write failing.

In general, we need all NORMAL nodes to be up for bootstrap to proceed.
One exception is replace where we ignore the replaced node. The
`sync_nodes` set constructed for `wait_for_normal_state_handled_on_boot`
takes this into account, so we use it.

Fixes: scylladb#14042
kbr-scylla added a commit to kbr-scylla/scylladb that referenced this issue Jun 28, 2023
The bootstrap procedure is racing with gossiper marking nodes as UP.
If gossiper is quick enough, everything will be fine.
If not, problems may arise such as streaming or repair failing due to
nodes still being marked as DOWN, or the CDC generation write failing.

In general, we need all NORMAL nodes to be up for bootstrap to proceed.
One exception is replace where we ignore the replaced node. The
`sync_nodes` set constructed for `wait_for_normal_state_handled_on_boot`
takes this into account, so we use it.

Fixes: scylladb#14042
kbr-scylla added a commit to kbr-scylla/scylladb that referenced this issue Jun 28, 2023
The bootstrap procedure is racing with gossiper marking nodes as UP.
If gossiper is quick enough, everything will be fine.
If not, problems may arise such as streaming or repair failing due to
nodes still being marked as DOWN, or the CDC generation write failing.

In general, we need all NORMAL nodes to be up for bootstrap to proceed.
One exception is replace where we ignore the replaced node. The
`sync_nodes` set constructed for `wait_for_normal_state_handled_on_boot`
takes this into account, so we use it.

Refs: scylladb#14042

This doesn't completely fix scylladb#14042 yet becasue it's specific to
gossiper-based topology mode only. For Raft-based topology, the node
joining procedure will be coordinated by the topology coordinator right
from the start and it will be the coordinator who issues the 'wait for
node to see other live nodes'.
tgrabiec added a commit that referenced this issue Jun 28, 2023
…ter enabling gossiping' from Kamil Braun

`handle_state_normal` may drop connections to the handled node. This
causes spurious failures if there's an ongoing concurrent operation.
This problem was already solved twice in the past in different contexts:
first in 5363616, then in
79ee381.

Time to fix it for the third time. Now we do this right after enabling
gossiping, so hopefully it's the last time.

This time it's causing snapshot transfer failures in group 0. Although
the transfer is retried and eventually succeeds, the failed transfer is
wasted work and causes an annoying ERROR message in the log which
dtests, SCT, and I don't like.

The fix is done by moving the `wait_for_normal_state_handled_on_boot()`
call before `setup_group0()`. But for the wait to work correctly we must
first ensure that gossiper sees an alive node, so we precede it with
`wait_for_live_node_to_show_up()` (before this commit, the call site of
`wait_for_normal_state_handled_on_boot` was already after this wait).

There is another problem: the bootstrap procedure is racing with gossiper
marking nodes as UP, and waiting for other nodes to be NORMAL doesn't guarantee
that they are also UP. If gossiper is quick enough, everything will be fine.
If not, problems may arise such as streaming or repair failing due to nodes
still being marked as DOWN, or the CDC generation write failing.

In general, we need all NORMAL nodes to be up for bootstrap to proceed.
One exception is replace where we ignore the replaced node. The
`sync_nodes` set constructed for `wait_for_normal_state_handled_on_boot`
takes this into account, so we also use it to wait for nodes to be UP.

As explained in commit messages and comments, we only do these
waits outside raft-based-topology mode.

This should improve CI stability.
Fixes: #12972
Refs: #14042

Closes #14354

* github.com:scylladb/scylladb:
  messaging_service: print which connections are dropped due to missing topology info
  storage_service: wait for nodes to be UP on bootstrap
  storage_service: wait for NORMAL state handler before `setup_group0()`
  storage_service: extract `gossiper::wait_for_live_nodes_to_show_up()`
@kbr-scylla
Copy link
Contributor

Got closed due to this part of commit message I guess:

This doesn't completely fix #14042 yet

@mykaul
Copy link
Contributor

mykaul commented Aug 16, 2023

@kbr-scylla - so what is the next step here?

@kbr-scylla
Copy link
Contributor

Ok, we're not seeing this after 51cec2b. No point in keeping this open I think.

@Annamikhlin
Copy link

seems happend also in 5.2/next
https://jenkins.scylladb.com/job/scylla-5.2/job/next/289/testReport/junit/(root)/test_topology_schema/Unit_Tests___test_topology_schema_changes/

@kbr-scylla - some backport is missing?

link to the log: https://jenkins.scylladb.com/job/scylla-5.2/job/next/289/artifact/testlog/x86_64/release/topology.test_topology_schema.3.log

=================================== FAILURES ===================================
_________________________ test_topology_schema_changes _________________________

self = <test.pylib.manager_client.ManagerClient object at 0x7fd90bb7e110>
replace_cfg = None, cmdline = None, config = None, start = True

    async def server_add(self, replace_cfg: Optional[ReplaceConfig] = None, cmdline: Optional[List[str]] = None, config: Optional[dict[str, str]] = None, start: bool = True) -> ServerInfo:
        """Add a new server"""
        try:
            data: dict[str, Any] = {'start': start}
            if replace_cfg:
                data['replace_cfg'] = replace_cfg._asdict()
            if cmdline:
                data['cmdline'] = cmdline
            if config:
                data['config'] = config
>           server_info = await self.client.put_json("/cluster/addserver", data, response_type="json",
                                                     timeout=ScyllaServer.TOPOLOGY_TIMEOUT)

test/pylib/manager_client.py:169: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <test.pylib.rest_client.UnixRESTClient object at 0x7fd90bb7df90>
resource_uri = '/cluster/addserver', data = {'start': True}, host = None
port = None, params = None, response_type = 'json', timeout = 1000

    async def put_json(self, resource_uri: str, data: Mapping, host: Optional[str] = None,
                       port: Optional[int] = None, params: Optional[dict[str, str]] = None,
                       response_type: Optional[str] = None, timeout: Optional[float] = None) -> Any:
>       ret = await self._fetch("PUT", resource_uri, response_type = response_type, host = host,
                                port = port, params = params, json = data, timeout = timeout)

test/pylib/rest_client.py:104: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <test.pylib.rest_client.UnixRESTClient object at 0x7fd90bb7df90>
method = 'PUT', resource = '/cluster/addserver', response_type = 'json'
host = None, port = None, params = None, json = {'start': True}, timeout = 1000

    async def _fetch(self, method: str, resource: str, response_type: Optional[str] = None,
                     host: Optional[str] = None, port: Optional[int] = None,
                     params: Optional[Mapping[str, str]] = None,
                     json: Optional[Mapping] = None, timeout: Optional[float] = None) -> Any:
        # Can raise exception. See https://docs.aiohttp.org/en/latest/web_exceptions.html
        assert method in ["GET", "POST", "PUT", "DELETE"], f"Invalid HTTP request method {method}"
        assert response_type is None or response_type in ["text", "json"], \
                f"Invalid response type requested {response_type} (expected 'text' or 'json')"
        # Build the URI
        port = port if port else self.default_port if hasattr(self, "default_port") else None
        port_str = f":{port}" if port else ""
        assert host is not None or hasattr(self, "default_host"), "_fetch: missing host for " \
                "{method} {resource}"
        host_str = host if host is not None else self.default_host
        uri = self.uri_scheme + "://" + host_str + port_str + resource
        logging.debug(f"RESTClient fetching {method} {uri}")
    
        client_timeout = ClientTimeout(total = timeout if timeout is not None else 300)
        async with request(method, uri,
                           connector = self.connector if hasattr(self, "connector") else None,
                           params = params, json = json, timeout = client_timeout) as resp:
            if resp.status != 200:
                text = await resp.text()
>               raise HTTPError(uri, resp.status, params, json, text)
E               test.pylib.rest_client.HTTPError: HTTP error 500, uri: http+unix://api/cluster/addserver, params: None, json: {'start': True}, body:
E               failed to start the node, server_id 217, IP 127.192.220.18, workdir scylla-217, host_id 6d82e977-9994-4f15-b8b8-aa91c99b7f33, cql [not connected]
E               Check the log files:
E               /jenkins/workspace/scylla-5.2/next/scylla/testlog/x86_64/test.py.debug-release-dev.log
E               /jenkins/workspace/scylla-5.2/next/scylla/testlog/x86_64/release/scylla-217.log

test/pylib/rest_client.py:69: HTTPError

The above exception was the direct cause of the following exception:

manager = <test.pylib.manager_client.ManagerClient object at 0x7fd90bb7e110>
random_tables = <test.pylib.random_tables.RandomTables object at 0x7fd90bb7fdd0>

    @pytest.mark.asyncio
    async def test_topology_schema_changes(manager, random_tables):
        """Test schema consistency with restart, add, and sudden stop of servers"""
        table = await random_tables.add_table(ncolumns=5)
        servers = await manager.running_servers()
    
        # Test add column after server restart
        await manager.server_restart(servers[1].server_id)
        await table.add_column()
        await random_tables.verify_schema()
    
        # Test add column after adding a server
>       await manager.server_add()

test/topology/test_topology_schema.py:26: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <test.pylib.manager_client.ManagerClient object at 0x7fd90bb7e110>
replace_cfg = None, cmdline = None, config = None, start = True

    async def server_add(self, replace_cfg: Optional[ReplaceConfig] = None, cmdline: Optional[List[str]] = None, config: Optional[dict[str, str]] = None, start: bool = True) -> ServerInfo:
        """Add a new server"""
        try:
            data: dict[str, Any] = {'start': start}
            if replace_cfg:
                data['replace_cfg'] = replace_cfg._asdict()
            if cmdline:
                data['cmdline'] = cmdline
            if config:
                data['config'] = config
            server_info = await self.client.put_json("/cluster/addserver", data, response_type="json",
                                                     timeout=ScyllaServer.TOPOLOGY_TIMEOUT)
        except Exception as exc:
>           raise Exception("Failed to add server") from exc
E           Exception: Failed to add server

test/pylib/manager_client.py:172: Exception

@tgrabiec
Copy link
Contributor

tgrabiec commented Nov 3, 2023

@kbr-scylla
Copy link
Contributor

The backport is nontrivial and there are nontrivial prerequisites that would also have to be backported. Since this is a test-issue (doesn't happen in prod with all the sleeps), the backport is not strictly necessary. Still, we should deal somehow with the test flakiness...

The problem is that this can happen any time we start a cluster in test.py. There is no single test to disable to ignore it.

I don't know how to deal with it :/

@avikivity should I prepare a large backport just to unflake the tests?
I'd have to backport at least:
5363616
#13317
#14354
maybe more. Those are large and potentially dangerous backports.

Another way would be to just ignore the flakiness...

@kbr-scylla
Copy link
Contributor

@Annamikhlin this was caused by a different issue
https://jenkins.scylladb.com/job/enterprise-2024.1/job/build/1/artifact/testlog/x86_64/debug/scylla-429.log

INFO  2023-11-16 06:00:06,104 [shard 0:main] raft_group0 - gaining leadership
ERROR 2023-11-16 06:00:06,106 [shard 0:stre] storage_service - raft topology: failed to fence previous coordinator raft::request_aborted (Request is aborted by a caller)
INFO  2023-11-16 06:00:06,108 [shard 0:stre] storage_service - raft_state_monitor_fiber aborted with seastar::sleep_aborted (Sleep is aborted)

This is #15747, indeed missing backport. I'll prepare it.

@benipeled
Copy link
Contributor

benipeled commented Dec 10, 2023

@mykaul
Copy link
Contributor

mykaul commented Dec 10, 2023

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/gossip P1 Urgent symptom/ci stability Issues that failed in ScyllaDB CI - tests and framework
Projects
None yet
Development

No branches or pull requests

8 participants