Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test "test_tablets.test_topology_changes" failed in next - TimeoutError #15728

Closed
roydahan opened this issue Oct 16, 2023 · 10 comments
Closed

Test "test_tablets.test_topology_changes" failed in next - TimeoutError #15728

roydahan opened this issue Oct 16, 2023 · 10 comments
Assignees
Labels
P1 Urgent symptom/ci stability Issues that failed in ScyllaDB CI - tests and framework triage/master Looking for assignee

Comments

@roydahan
Copy link

https://jenkins.scylladb.com/job/scylla-master/job/releng-testing/job/staging/job/next/22/testReport/(root)/test_tablets/Tests___Unit_Tests___test_topology_changes_2/

manager = <test.pylib.manager_client.ManagerClient object at 0x7f9e8364d850>

    @pytest.mark.asyncio
    async def test_topology_changes(manager: ManagerClient):
        logger.info("Bootstrapping cluster")
        servers = [await manager.server_add(), await manager.server_add(), await manager.server_add()]
    
        cql = manager.get_cql()
        await cql.run_async("CREATE KEYSPACE test WITH replication = {'class': 'NetworkTopologyStrategy', "
                      "'replication_factor': 1, 'initial_tablets': 32};")
        await cql.run_async("CREATE TABLE test.test (pk int PRIMARY KEY, c int);")
    
        logger.info("Populating table")
    
        keys = range(256)
        await asyncio.gather(*[cql.run_async(f"INSERT INTO test.test (pk, c) VALUES ({k}, {k});") for k in keys])
    
        async def check():
            logger.info("Checking table")
            rows = await cql.run_async("SELECT * FROM test.test;")
            assert len(rows) == len(keys)
            for r in rows:
                assert r.c == r.pk
    
        await inject_error_on(manager, "tablet_allocator_shuffle", servers)
    
        logger.info("Adding new server")
        await manager.server_add()
    
        await check()
    
        logger.info("Adding new server")
        await manager.server_add()
    
        await check()
        time.sleep(5) # Give load balancer some time to do work
        await check()
    
>       await manager.decommission_node(servers[0].server_id)

test/topology_experimental_raft/test_tablets.py:181: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
test/pylib/manager_client.py:219: in decommission_node
    await self.client.put_json(f"/cluster/decommission-node/{server_id}",
test/pylib/rest_client.py:104: in put_json
    ret = await self._fetch("PUT", resource_uri, response_type = response_type, host = host,
test/pylib/rest_client.py:64: in _fetch
    async with request(method, uri,
/usr/lib64/python3.11/site-packages/aiohttp/client.py:1190: in __aenter__
    self._resp = await self._coro
/usr/lib64/python3.11/site-packages/aiohttp/client.py:561: in _request
    await resp.start(conn)
/usr/lib64/python3.11/site-packages/aiohttp/client_reqrep.py:894: in start
    with self._timer:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <aiohttp.helpers.TimerContext object at 0x7f9e80192010>
exc_type = <class 'asyncio.exceptions.CancelledError'>
exc_val = CancelledError(), exc_tb = <traceback object at 0x7f9e82541440>

    def __exit__(
        self,
        exc_type: Optional[Type[BaseException]],
        exc_val: Optional[BaseException],
        exc_tb: Optional[TracebackType],
    ) -> Optional[bool]:
        if self._tasks:
            self._tasks.pop()
    
        if exc_type is asyncio.CancelledError and self._cancelled:
>           raise asyncio.TimeoutError from None
E           TimeoutError

/usr/lib64/python3.11/site-packages/aiohttp/helpers.py:721: TimeoutError
@roydahan roydahan added symptom/ci stability Issues that failed in ScyllaDB CI - tests and framework triage/master Looking for assignee labels Oct 16, 2023
@mykaul
Copy link
Contributor

mykaul commented Oct 17, 2023

@kbr-scylla - does that belong to your team?

@mykaul
Copy link
Contributor

mykaul commented Oct 17, 2023

(Running on spider2 )

@kbr-scylla
Copy link
Contributor

kbr-scylla commented Oct 17, 2023

It's tablets test, @tgrabiec please take a look.

If you find that this is some generic problem with the topology coordinator, then we can take over.

@tgrabiec
Copy link
Contributor

The test failed because decommission request timed out:

=================================== FAILURES ===================================
____________________________ test_topology_changes _____________________________

manager = <test.pylib.manager_client.ManagerClient object at 0x7f9e8364d850>

    @pytest.mark.asyncio
    async def test_topology_changes(manager: ManagerClient):
        logger.info("Bootstrapping cluster")
        servers = [await manager.server_add(), await manager.server_add(), await manager.server_add()]
    
        cql = manager.get_cql()
        await cql.run_async("CREATE KEYSPACE test WITH replication = {'class': 'NetworkTopologyStrategy', "
                      "'replication_factor': 1, 'initial_tablets': 32};")
        await cql.run_async("CREATE TABLE test.test (pk int PRIMARY KEY, c int);")
    
        logger.info("Populating table")
    
        keys = range(256)
        await asyncio.gather(*[cql.run_async(f"INSERT INTO test.test (pk, c) VALUES ({k}, {k});") for k in keys])
    
        async def check():
            logger.info("Checking table")
            rows = await cql.run_async("SELECT * FROM test.test;")
            assert len(rows) == len(keys)
            for r in rows:
                assert r.c == r.pk
    
        await inject_error_on(manager, "tablet_allocator_shuffle", servers)
    
        logger.info("Adding new server")
        await manager.server_add()
    
        await check()
    
        logger.info("Adding new server")
        await manager.server_add()
    
        await check()
        time.sleep(5) # Give load balancer some time to do work
        await check()
    
>       await manager.decommission_node(servers[0].server_id)

test/topology_experimental_raft/test_tablets.py:181: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
test/pylib/manager_client.py:219: in decommission_node
    await self.client.put_json(f"/cluster/decommission-node/{server_id}",
test/pylib/rest_client.py:104: in put_json
    ret = await self._fetch("PUT", resource_uri, response_type = response_type, host = host,
test/pylib/rest_client.py:64: in _fetch
    async with request(method, uri,
/usr/lib64/python3.11/site-packages/aiohttp/client.py:1190: in __aenter__
    self._resp = await self._coro
/usr/lib64/python3.11/site-packages/aiohttp/client.py:561: in _request
    await resp.start(conn)
/usr/lib64/python3.11/site-packages/aiohttp/client_reqrep.py:894: in start
    with self._timer:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <aiohttp.helpers.TimerContext object at 0x7f9e80192010>
exc_type = <class 'asyncio.exceptions.CancelledError'>
exc_val = CancelledError(), exc_tb = <traceback object at 0x7f9e82541440>

    def __exit__(
        self,
        exc_type: Optional[Type[BaseException]],
        exc_val: Optional[BaseException],
        exc_tb: Optional[TracebackType],
    ) -> Optional[bool]:
        if self._tasks:
            self._tasks.pop()
    
        if exc_type is asyncio.CancelledError and self._cancelled:
>           raise asyncio.TimeoutError from None
E           TimeoutError

/usr/lib64/python3.11/site-packages/aiohttp/helpers.py:721: TimeoutError

The timeout is ~16min, streaming finished successfully within 1 sec, but the process seem to be stuck somewhere after giving up leadership. This node should shut down, but it's not shutting down:

04:15:46.232 DEBUG> RESTClient fetching PUT http+unix://api/cluster/decommission-node/1209

scylla-1209.log:

INFO  2023-10-11 04:15:47,831 [shard 0:stre] storage_service - raft topology: coordinator fiber found a node to work on id=f62def06-592a-4b73-afd7-65726c4bf35c state=left_token_ring
INFO  2023-10-11 04:15:47,831 [shard 0:stre] storage_service - raft topology: coordinator is decommissioning; giving up leadership
INFO  2023-10-11 04:15:47,831 [shard 0:stre] raft_group0 - becoming a non-voter (my id = f62def06-592a-4b73-afd7-65726c4bf35c)...
INFO  2023-10-11 04:15:47,834 [shard 0:stre] raft_group0 - became a non-voter.
INFO  2023-10-11 04:15:47,834 [shard 0:stre] raft_group0 - losing leadership
------ Ending test dev/topology_experimental_raft.test_tablets.2::test_topology_changes ------

@tgrabiec
Copy link
Contributor

The node should be told to shut down by the new leader. Looks like the new leader fails, in which case the topology change fiber will exit and not react to anything else:

INFO  2023-10-11 04:15:47,837 [shard 0:stre] raft_group0 - gaining leadership
INFO  2023-10-11 04:15:47,837 [shard 0:stre] storage_service - raft topology: start topology coordinator fiber
INFO  2023-10-11 04:15:47,837 [shard 0:stre] raft_group0 - losing leadership
INFO  2023-10-11 04:15:47,837 [shard 0:stre] raft_group0 - gaining leadership
ERROR 2023-10-11 04:15:47,838 [shard 0:stre] storage_service - raft topology: failed to fence previous coordinator raft::request_aborted (Request is aborted by a caller)
INFO  2023-10-11 04:15:47,838 [shard 0:stre] storage_service - raft_state_monitor_fiber aborted with raft::request_aborted (Request is aborted by a caller)

The fiber probably shouldn't exit if coordinator->run() throws. @gleb-cloudius

@Annamikhlin
Copy link

failed on build/5.4 -
https://jenkins.scylladb.com/job/scylla-5.4/job/build/2/testReport/junit/(root)/test_raft_cluster_features/Build__Test__Publish___buildTestPublishArm___test_rolling_upgrade_happy_path/

@avikivity - could you please backport it to 5.4? currently it is blocking us during a release process of 5.4.0-rc1
@mykaul - FYI

@mykaul mykaul added backport/5.4 Issues that should be backported to 5.4 branch once they'll be fixed P1 Urgent labels Oct 29, 2023
avikivity pushed a commit that referenced this issue Oct 29, 2023
…s_coordinator()

Throwing error kills the topology coordinator monitor fiber. Instead we
retry the operation until it succeeds or the node looses its leadership.
This is fine before for the operation to succeed quorum is needed and if
the quorum is not available the node should relinquish its leadership.

Fixes #15728

(cherry picked from commit 65bf587)
@avikivity
Copy link
Member

Backported to 5.4. Earlier versions not vulnerable.

@denesb denesb removed the backport/5.4 Issues that should be backported to 5.4 branch once they'll be fixed label Oct 30, 2023
@denesb
Copy link
Contributor

denesb commented Oct 30, 2023

Removed Requires-Backport-to-5.4, already in branch-5.4.

@tchaikov
Copy link
Contributor

tchaikov commented Nov 7, 2023

i am still seeing a similar failure. see https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/4533/testReport/junit/(root)/test_tablets/Tests___Unit_Tests___test_topology_changes/. but due to #15973, the link might not lead us to the failure. refresh the browser more times to see it.

manager = <test.pylib.manager_client.ManagerClient object at 0x7f2547076490>

    @pytest.mark.asyncio
    async def test_topology_changes(manager: ManagerClient):
        logger.info("Bootstrapping cluster")
        servers = [await manager.server_add(), await manager.server_add(), await manager.server_add()]
    
        cql = manager.get_cql()
        await cql.run_async("CREATE KEYSPACE test WITH replication = {'class': 'NetworkTopologyStrategy', "
                      "'replication_factor': 1, 'initial_tablets': 32};")
        await cql.run_async("CREATE TABLE test.test (pk int PRIMARY KEY, c int);")
    
        logger.info("Populating table")
    
        keys = range(256)
        await asyncio.gather(*[cql.run_async(f"INSERT INTO test.test (pk, c) VALUES ({k}, {k});") for k in keys])
    
        async def check():
            logger.info("Checking table")
            rows = await cql.run_async("SELECT * FROM test.test;")
            assert len(rows) == len(keys)
            for r in rows:
                assert r.c == r.pk
    
        await inject_error_on(manager, "tablet_allocator_shuffle", servers)
    
        logger.info("Adding new server")
        await manager.server_add()
    
        await check()
    
        logger.info("Adding new server")
        await manager.server_add()
    
        await check()
        time.sleep(5) # Give load balancer some time to do work
        await check()
    
>       await manager.decommission_node(servers[0].server_id)

test/topology_experimental_raft/test_tablets.py:197: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
test/pylib/manager_client.py:222: in decommission_node
    await self.client.put_json(f"/cluster/decommission-node/{server_id}", data,
test/pylib/rest_client.py:104: in put_json
    ret = await self._fetch("PUT", resource_uri, response_type = response_type, host = host,
test/pylib/rest_client.py:64: in _fetch
    async with request(method, uri,
/usr/lib64/python3.11/site-packages/aiohttp/client.py:1190: in __aenter__
    self._resp = await self._coro
/usr/lib64/python3.11/site-packages/aiohttp/client.py:561: in _request
    await resp.start(conn)
/usr/lib64/python3.11/site-packages/aiohttp/client_reqrep.py:894: in start
    with self._timer:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <aiohttp.helpers.TimerContext object at 0x7f25442c0f50>
exc_type = <class 'asyncio.exceptions.CancelledError'>
exc_val = CancelledError(), exc_tb = <traceback object at 0x7f2544239100>

    def __exit__(
        self,
        exc_type: Optional[Type[BaseException]],
        exc_val: Optional[BaseException],
        exc_tb: Optional[TracebackType],
    ) -> Optional[bool]:
        if self._tasks:
            self._tasks.pop()
    
        if exc_type is asyncio.CancelledError and self._cancelled:
>           raise asyncio.TimeoutError from None
E           TimeoutError

/usr/lib64/python3.11/site-packages/aiohttp/helpers.py:721: TimeoutError

kbr-scylla added a commit that referenced this issue Nov 7, 2023
…s_coordinator' from Piotr Dulikowski

When topology coordinator tries to fence the previous coordinator it
performs a group0 operation. The current topology coordinator might be
aborted in the meantime, which will result in a `raft::request_aborted`
exception being thrown. After the fix to #15728 was
merged, the exception is caught, but then `sleep_abortable` is called
which immediately throws `abort_requested_exception` as it uses the same
abort source as the group0 operation. The `fence_previous_coordinator`
function which does all those things is not supposed to throw
exceptions, if it does - it causes `raft_state_monitor_fiber` to exit,
completely disabling the topology coordinator functionality on that
node.

Modify the code in the following way:

- Catch `abort_requested_exception` thrown from `sleep_abortable` and
  exit the function if it happens. In addition to the described issue,
it will also handle the case when abort is requested while
`sleep_abortable` happens,
- Catch `raft::request_aborted` thrown from group0 operation, log the
  exception with lower verbosity and exit the function explicitly.

Finally, wrap both `fence_previous_coordinator` and `run` functions in a
`try` block with `on_fatal_internal_error` in the catch handler in order
to implement the behavior that adding `noexcept` was originally supposed
to introduce.

Fixes: #15747

Closes #15948

* github.com:scylladb/scylladb:
  raft topology: catch and abort on exceptions from topology_coordinator::run
  Revert "storage_service: raft topology: mark topology_coordinator::run function as noexcept"
  raft topology: don't print an error when fencing previous coordinator is aborted
  raft topology: handle abort exceptions from sleeping in fence_previous_coordinator
@kbr-scylla
Copy link
Contributor

@tchaikov

ERROR 2023-11-06 13:03:01,895 [shard 0:strm] storage_service - raft topology: failed to fence previous coordinator raft::request_aborted (Request is aborted by a caller)
INFO  2023-11-06 13:03:01,895 [shard 0:strm] storage_service - raft_state_monitor_fiber aborted with seastar::sleep_aborted (Sleep is aborted)

this looks like #15747

kbr-scylla added a commit that referenced this issue Nov 16, 2023
…s_coordinator' from Piotr Dulikowski

When topology coordinator tries to fence the previous coordinator it
performs a group0 operation. The current topology coordinator might be
aborted in the meantime, which will result in a `raft::request_aborted`
exception being thrown. After the fix to #15728 was
merged, the exception is caught, but then `sleep_abortable` is called
which immediately throws `abort_requested_exception` as it uses the same
abort source as the group0 operation. The `fence_previous_coordinator`
function which does all those things is not supposed to throw
exceptions, if it does - it causes `raft_state_monitor_fiber` to exit,
completely disabling the topology coordinator functionality on that
node.

Modify the code in the following way:

- Catch `abort_requested_exception` thrown from `sleep_abortable` and
  exit the function if it happens. In addition to the described issue,
it will also handle the case when abort is requested while
`sleep_abortable` happens,
- Catch `raft::request_aborted` thrown from group0 operation, log the
  exception with lower verbosity and exit the function explicitly.

Finally, wrap both `fence_previous_coordinator` and `run` functions in a
`try` block with `on_fatal_internal_error` in the catch handler in order
to implement the behavior that adding `noexcept` was originally supposed
to introduce.

Fixes: #15747

Closes #15948

* github.com:scylladb/scylladb:
  raft topology: catch and abort on exceptions from topology_coordinator::run
  Revert "storage_service: raft topology: mark topology_coordinator::run function as noexcept"
  raft topology: don't print an error when fencing previous coordinator is aborted
  raft topology: handle abort exceptions from sleeping in fence_previous_coordinator

(cherry picked from commit 07e9522)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 Urgent symptom/ci stability Issues that failed in ScyllaDB CI - tests and framework triage/master Looking for assignee
Projects
None yet
Development

No branches or pull requests

10 participants