Test "test_tablets.test_topology_changes" failed in next - TimeoutError #15728

roydahan · 2023-10-16T22:56:37Z

https://jenkins.scylladb.com/job/scylla-master/job/releng-testing/job/staging/job/next/22/testReport/(root)/test_tablets/Tests___Unit_Tests___test_topology_changes_2/

manager = <test.pylib.manager_client.ManagerClient object at 0x7f9e8364d850>

    @pytest.mark.asyncio
    async def test_topology_changes(manager: ManagerClient):
        logger.info("Bootstrapping cluster")
        servers = [await manager.server_add(), await manager.server_add(), await manager.server_add()]
    
        cql = manager.get_cql()
        await cql.run_async("CREATE KEYSPACE test WITH replication = {'class': 'NetworkTopologyStrategy', "
                      "'replication_factor': 1, 'initial_tablets': 32};")
        await cql.run_async("CREATE TABLE test.test (pk int PRIMARY KEY, c int);")
    
        logger.info("Populating table")
    
        keys = range(256)
        await asyncio.gather(*[cql.run_async(f"INSERT INTO test.test (pk, c) VALUES ({k}, {k});") for k in keys])
    
        async def check():
            logger.info("Checking table")
            rows = await cql.run_async("SELECT * FROM test.test;")
            assert len(rows) == len(keys)
            for r in rows:
                assert r.c == r.pk
    
        await inject_error_on(manager, "tablet_allocator_shuffle", servers)
    
        logger.info("Adding new server")
        await manager.server_add()
    
        await check()
    
        logger.info("Adding new server")
        await manager.server_add()
    
        await check()
        time.sleep(5) # Give load balancer some time to do work
        await check()
    
>       await manager.decommission_node(servers[0].server_id)

test/topology_experimental_raft/test_tablets.py:181: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
test/pylib/manager_client.py:219: in decommission_node
    await self.client.put_json(f"/cluster/decommission-node/{server_id}",
test/pylib/rest_client.py:104: in put_json
    ret = await self._fetch("PUT", resource_uri, response_type = response_type, host = host,
test/pylib/rest_client.py:64: in _fetch
    async with request(method, uri,
/usr/lib64/python3.11/site-packages/aiohttp/client.py:1190: in __aenter__
    self._resp = await self._coro
/usr/lib64/python3.11/site-packages/aiohttp/client.py:561: in _request
    await resp.start(conn)
/usr/lib64/python3.11/site-packages/aiohttp/client_reqrep.py:894: in start
    with self._timer:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <aiohttp.helpers.TimerContext object at 0x7f9e80192010>
exc_type = <class 'asyncio.exceptions.CancelledError'>
exc_val = CancelledError(), exc_tb = <traceback object at 0x7f9e82541440>

    def __exit__(
        self,
        exc_type: Optional[Type[BaseException]],
        exc_val: Optional[BaseException],
        exc_tb: Optional[TracebackType],
    ) -> Optional[bool]:
        if self._tasks:
            self._tasks.pop()
    
        if exc_type is asyncio.CancelledError and self._cancelled:
>           raise asyncio.TimeoutError from None
E           TimeoutError

/usr/lib64/python3.11/site-packages/aiohttp/helpers.py:721: TimeoutError

The text was updated successfully, but these errors were encountered:

mykaul · 2023-10-17T07:05:26Z

@kbr-scylla - does that belong to your team?

mykaul · 2023-10-17T07:07:15Z

(Running on spider2 )

kbr-scylla · 2023-10-17T07:59:57Z

It's tablets test, @tgrabiec please take a look.

If you find that this is some generic problem with the topology coordinator, then we can take over.

tgrabiec · 2023-10-18T22:48:50Z

The test failed because decommission request timed out:

=================================== FAILURES ===================================
____________________________ test_topology_changes _____________________________

manager = <test.pylib.manager_client.ManagerClient object at 0x7f9e8364d850>

    @pytest.mark.asyncio
    async def test_topology_changes(manager: ManagerClient):
        logger.info("Bootstrapping cluster")
        servers = [await manager.server_add(), await manager.server_add(), await manager.server_add()]
    
        cql = manager.get_cql()
        await cql.run_async("CREATE KEYSPACE test WITH replication = {'class': 'NetworkTopologyStrategy', "
                      "'replication_factor': 1, 'initial_tablets': 32};")
        await cql.run_async("CREATE TABLE test.test (pk int PRIMARY KEY, c int);")
    
        logger.info("Populating table")
    
        keys = range(256)
        await asyncio.gather(*[cql.run_async(f"INSERT INTO test.test (pk, c) VALUES ({k}, {k});") for k in keys])
    
        async def check():
            logger.info("Checking table")
            rows = await cql.run_async("SELECT * FROM test.test;")
            assert len(rows) == len(keys)
            for r in rows:
                assert r.c == r.pk
    
        await inject_error_on(manager, "tablet_allocator_shuffle", servers)
    
        logger.info("Adding new server")
        await manager.server_add()
    
        await check()
    
        logger.info("Adding new server")
        await manager.server_add()
    
        await check()
        time.sleep(5) # Give load balancer some time to do work
        await check()
    
>       await manager.decommission_node(servers[0].server_id)

test/topology_experimental_raft/test_tablets.py:181: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
test/pylib/manager_client.py:219: in decommission_node
    await self.client.put_json(f"/cluster/decommission-node/{server_id}",
test/pylib/rest_client.py:104: in put_json
    ret = await self._fetch("PUT", resource_uri, response_type = response_type, host = host,
test/pylib/rest_client.py:64: in _fetch
    async with request(method, uri,
/usr/lib64/python3.11/site-packages/aiohttp/client.py:1190: in __aenter__
    self._resp = await self._coro
/usr/lib64/python3.11/site-packages/aiohttp/client.py:561: in _request
    await resp.start(conn)
/usr/lib64/python3.11/site-packages/aiohttp/client_reqrep.py:894: in start
    with self._timer:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <aiohttp.helpers.TimerContext object at 0x7f9e80192010>
exc_type = <class 'asyncio.exceptions.CancelledError'>
exc_val = CancelledError(), exc_tb = <traceback object at 0x7f9e82541440>

    def __exit__(
        self,
        exc_type: Optional[Type[BaseException]],
        exc_val: Optional[BaseException],
        exc_tb: Optional[TracebackType],
    ) -> Optional[bool]:
        if self._tasks:
            self._tasks.pop()
    
        if exc_type is asyncio.CancelledError and self._cancelled:
>           raise asyncio.TimeoutError from None
E           TimeoutError

/usr/lib64/python3.11/site-packages/aiohttp/helpers.py:721: TimeoutError

The timeout is ~16min, streaming finished successfully within 1 sec, but the process seem to be stuck somewhere after giving up leadership. This node should shut down, but it's not shutting down:

04:15:46.232 DEBUG> RESTClient fetching PUT http+unix://api/cluster/decommission-node/1209

scylla-1209.log:

INFO  2023-10-11 04:15:47,831 [shard 0:stre] storage_service - raft topology: coordinator fiber found a node to work on id=f62def06-592a-4b73-afd7-65726c4bf35c state=left_token_ring
INFO  2023-10-11 04:15:47,831 [shard 0:stre] storage_service - raft topology: coordinator is decommissioning; giving up leadership
INFO  2023-10-11 04:15:47,831 [shard 0:stre] raft_group0 - becoming a non-voter (my id = f62def06-592a-4b73-afd7-65726c4bf35c)...
INFO  2023-10-11 04:15:47,834 [shard 0:stre] raft_group0 - became a non-voter.
INFO  2023-10-11 04:15:47,834 [shard 0:stre] raft_group0 - losing leadership
------ Ending test dev/topology_experimental_raft.test_tablets.2::test_topology_changes ------

tgrabiec · 2023-10-18T22:59:38Z

The node should be told to shut down by the new leader. Looks like the new leader fails, in which case the topology change fiber will exit and not react to anything else:

INFO  2023-10-11 04:15:47,837 [shard 0:stre] raft_group0 - gaining leadership
INFO  2023-10-11 04:15:47,837 [shard 0:stre] storage_service - raft topology: start topology coordinator fiber
INFO  2023-10-11 04:15:47,837 [shard 0:stre] raft_group0 - losing leadership
INFO  2023-10-11 04:15:47,837 [shard 0:stre] raft_group0 - gaining leadership
ERROR 2023-10-11 04:15:47,838 [shard 0:stre] storage_service - raft topology: failed to fence previous coordinator raft::request_aborted (Request is aborted by a caller)
INFO  2023-10-11 04:15:47,838 [shard 0:stre] storage_service - raft_state_monitor_fiber aborted with raft::request_aborted (Request is aborted by a caller)

The fiber probably shouldn't exit if coordinator->run() throws. @gleb-cloudius

Annamikhlin · 2023-10-29T07:13:04Z

failed on build/5.4 -
https://jenkins.scylladb.com/job/scylla-5.4/job/build/2/testReport/junit/(root)/test_raft_cluster_features/Build__Test__Publish___buildTestPublishArm___test_rolling_upgrade_happy_path/

@avikivity - could you please backport it to 5.4? currently it is blocking us during a release process of 5.4.0-rc1
@mykaul - FYI

…s_coordinator() Throwing error kills the topology coordinator monitor fiber. Instead we retry the operation until it succeeds or the node looses its leadership. This is fine before for the operation to succeed quorum is needed and if the quorum is not available the node should relinquish its leadership. Fixes #15728 (cherry picked from commit 65bf587)

avikivity · 2023-10-29T11:59:46Z

Backported to 5.4. Earlier versions not vulnerable.

denesb · 2023-10-30T10:40:06Z

Removed Requires-Backport-to-5.4, already in branch-5.4.

tchaikov · 2023-11-07T03:56:41Z

i am still seeing a similar failure. see https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/4533/testReport/junit/(root)/test_tablets/Tests___Unit_Tests___test_topology_changes/. but due to #15973, the link might not lead us to the failure. refresh the browser more times to see it.

manager = <test.pylib.manager_client.ManagerClient object at 0x7f2547076490>

    @pytest.mark.asyncio
    async def test_topology_changes(manager: ManagerClient):
        logger.info("Bootstrapping cluster")
        servers = [await manager.server_add(), await manager.server_add(), await manager.server_add()]
    
        cql = manager.get_cql()
        await cql.run_async("CREATE KEYSPACE test WITH replication = {'class': 'NetworkTopologyStrategy', "
                      "'replication_factor': 1, 'initial_tablets': 32};")
        await cql.run_async("CREATE TABLE test.test (pk int PRIMARY KEY, c int);")
    
        logger.info("Populating table")
    
        keys = range(256)
        await asyncio.gather(*[cql.run_async(f"INSERT INTO test.test (pk, c) VALUES ({k}, {k});") for k in keys])
    
        async def check():
            logger.info("Checking table")
            rows = await cql.run_async("SELECT * FROM test.test;")
            assert len(rows) == len(keys)
            for r in rows:
                assert r.c == r.pk
    
        await inject_error_on(manager, "tablet_allocator_shuffle", servers)
    
        logger.info("Adding new server")
        await manager.server_add()
    
        await check()
    
        logger.info("Adding new server")
        await manager.server_add()
    
        await check()
        time.sleep(5) # Give load balancer some time to do work
        await check()
    
>       await manager.decommission_node(servers[0].server_id)

test/topology_experimental_raft/test_tablets.py:197: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
test/pylib/manager_client.py:222: in decommission_node
    await self.client.put_json(f"/cluster/decommission-node/{server_id}", data,
test/pylib/rest_client.py:104: in put_json
    ret = await self._fetch("PUT", resource_uri, response_type = response_type, host = host,
test/pylib/rest_client.py:64: in _fetch
    async with request(method, uri,
/usr/lib64/python3.11/site-packages/aiohttp/client.py:1190: in __aenter__
    self._resp = await self._coro
/usr/lib64/python3.11/site-packages/aiohttp/client.py:561: in _request
    await resp.start(conn)
/usr/lib64/python3.11/site-packages/aiohttp/client_reqrep.py:894: in start
    with self._timer:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <aiohttp.helpers.TimerContext object at 0x7f25442c0f50>
exc_type = <class 'asyncio.exceptions.CancelledError'>
exc_val = CancelledError(), exc_tb = <traceback object at 0x7f2544239100>

    def __exit__(
        self,
        exc_type: Optional[Type[BaseException]],
        exc_val: Optional[BaseException],
        exc_tb: Optional[TracebackType],
    ) -> Optional[bool]:
        if self._tasks:
            self._tasks.pop()
    
        if exc_type is asyncio.CancelledError and self._cancelled:
>           raise asyncio.TimeoutError from None
E           TimeoutError

/usr/lib64/python3.11/site-packages/aiohttp/helpers.py:721: TimeoutError

…s_coordinator' from Piotr Dulikowski When topology coordinator tries to fence the previous coordinator it performs a group0 operation. The current topology coordinator might be aborted in the meantime, which will result in a `raft::request_aborted` exception being thrown. After the fix to #15728 was merged, the exception is caught, but then `sleep_abortable` is called which immediately throws `abort_requested_exception` as it uses the same abort source as the group0 operation. The `fence_previous_coordinator` function which does all those things is not supposed to throw exceptions, if it does - it causes `raft_state_monitor_fiber` to exit, completely disabling the topology coordinator functionality on that node. Modify the code in the following way: - Catch `abort_requested_exception` thrown from `sleep_abortable` and exit the function if it happens. In addition to the described issue, it will also handle the case when abort is requested while `sleep_abortable` happens, - Catch `raft::request_aborted` thrown from group0 operation, log the exception with lower verbosity and exit the function explicitly. Finally, wrap both `fence_previous_coordinator` and `run` functions in a `try` block with `on_fatal_internal_error` in the catch handler in order to implement the behavior that adding `noexcept` was originally supposed to introduce. Fixes: #15747 Closes #15948 * github.com:scylladb/scylladb: raft topology: catch and abort on exceptions from topology_coordinator::run Revert "storage_service: raft topology: mark topology_coordinator::run function as noexcept" raft topology: don't print an error when fencing previous coordinator is aborted raft topology: handle abort exceptions from sleeping in fence_previous_coordinator

kbr-scylla · 2023-11-08T13:23:43Z

@tchaikov

ERROR 2023-11-06 13:03:01,895 [shard 0:strm] storage_service - raft topology: failed to fence previous coordinator raft::request_aborted (Request is aborted by a caller)
INFO  2023-11-06 13:03:01,895 [shard 0:strm] storage_service - raft_state_monitor_fiber aborted with seastar::sleep_aborted (Sleep is aborted)

this looks like #15747

…s_coordinator' from Piotr Dulikowski When topology coordinator tries to fence the previous coordinator it performs a group0 operation. The current topology coordinator might be aborted in the meantime, which will result in a `raft::request_aborted` exception being thrown. After the fix to #15728 was merged, the exception is caught, but then `sleep_abortable` is called which immediately throws `abort_requested_exception` as it uses the same abort source as the group0 operation. The `fence_previous_coordinator` function which does all those things is not supposed to throw exceptions, if it does - it causes `raft_state_monitor_fiber` to exit, completely disabling the topology coordinator functionality on that node. Modify the code in the following way: - Catch `abort_requested_exception` thrown from `sleep_abortable` and exit the function if it happens. In addition to the described issue, it will also handle the case when abort is requested while `sleep_abortable` happens, - Catch `raft::request_aborted` thrown from group0 operation, log the exception with lower verbosity and exit the function explicitly. Finally, wrap both `fence_previous_coordinator` and `run` functions in a `try` block with `on_fatal_internal_error` in the catch handler in order to implement the behavior that adding `noexcept` was originally supposed to introduce. Fixes: #15747 Closes #15948 * github.com:scylladb/scylladb: raft topology: catch and abort on exceptions from topology_coordinator::run Revert "storage_service: raft topology: mark topology_coordinator::run function as noexcept" raft topology: don't print an error when fencing previous coordinator is aborted raft topology: handle abort exceptions from sleeping in fence_previous_coordinator (cherry picked from commit 07e9522)

roydahan added symptom/ci stability Issues that failed in ScyllaDB CI - tests and framework triage/master Looking for assignee labels Oct 16, 2023

mykaul assigned tgrabiec Oct 18, 2023

tgrabiec assigned gleb-cloudius and unassigned tgrabiec Oct 18, 2023

piodul mentioned this issue Oct 24, 2023

Tests / Unit Tests / non-boost tests.topology_experimental_raft.test_raft_cluster_features.debug is flaky #15747

Closed

scylladb-promoter closed this as completed in 65bf587 Oct 24, 2023

scylladb-promoter added the Backport candidate label Oct 24, 2023

mykaul added backport/5.4 Issues that should be backported to 5.4 branch once they'll be fixed P1 Urgent labels Oct 29, 2023

avikivity removed the Backport candidate label Oct 29, 2023

denesb removed the backport/5.4 Issues that should be backported to 5.4 branch once they'll be fixed label Oct 30, 2023

piodul mentioned this issue Nov 4, 2023

raft topology: handle abort exceptions better in fence_previous_coordinator #15948

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test "test_tablets.test_topology_changes" failed in next - TimeoutError #15728

Test "test_tablets.test_topology_changes" failed in next - TimeoutError #15728

roydahan commented Oct 16, 2023

mykaul commented Oct 17, 2023

mykaul commented Oct 17, 2023

kbr-scylla commented Oct 17, 2023 •

edited

tgrabiec commented Oct 18, 2023

tgrabiec commented Oct 18, 2023

Annamikhlin commented Oct 29, 2023

avikivity commented Oct 29, 2023

denesb commented Oct 30, 2023

tchaikov commented Nov 7, 2023

kbr-scylla commented Nov 8, 2023

Test "test_tablets.test_topology_changes" failed in next - TimeoutError #15728

Test "test_tablets.test_topology_changes" failed in next - TimeoutError #15728

Comments

roydahan commented Oct 16, 2023

mykaul commented Oct 17, 2023

mykaul commented Oct 17, 2023

kbr-scylla commented Oct 17, 2023 • edited

tgrabiec commented Oct 18, 2023

tgrabiec commented Oct 18, 2023

Annamikhlin commented Oct 29, 2023

avikivity commented Oct 29, 2023

denesb commented Oct 30, 2023

tchaikov commented Nov 7, 2023

kbr-scylla commented Nov 8, 2023

kbr-scylla commented Oct 17, 2023 •

edited