Tests / Unit Tests / non-boost tests.topology_experimental_raft.test_raft_cluster_features.debug is flaky #15747

wmitros · 2023-10-17T22:00:50Z

Seen in https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/4217/testReport/junit/(root)/test_raft_cluster_features/Tests___Unit_Tests___test_rolling_upgrade_happy_path_5/

AssertionError: Deadline exceeded, failing test.

manager = <test.pylib.manager_client.ManagerClient object at 0x7fed911b51d0>

    @pytest.mark.asyncio
    async def test_rolling_upgrade_happy_path(manager: ManagerClient) -> None:
        for _ in range(3):
            await manager.server_add()
>       await test_cluster_features.test_rolling_upgrade_happy_path(manager)

test/topology_experimental_raft/test_raft_cluster_features.py:24: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
test/topology/test_cluster_features.py:92: in test_rolling_upgrade_happy_path
    await asyncio.gather(*(wait_for_feature(TEST_FEATURE_NAME, cql, h, time.time() + 60) for h in hosts))
test/pylib/util.py:131: in wait_for_feature
    await wait_for(feature_is_enabled, deadline)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

pred = <function wait_for_feature.<locals>.feature_is_enabled at 0x7fed906d8a40>
deadline = 1697493961.5800846, period = 1

    async def wait_for(
            pred: Callable[[], Awaitable[Optional[T]]],
            deadline: float, period: float = 1) -> T:
        while True:
>           assert(time.time() < deadline), "Deadline exceeded, failing test."
E           AssertionError: Deadline exceeded, failing test.

test/pylib/util.py:45: AssertionError

Ran on an m5ad.8xlarge instance, so shouldn't be an infra problem even though it's a timeout

The text was updated successfully, but these errors were encountered:

Annamikhlin · 2023-10-24T07:50:44Z

failed today in master/build -
https://jenkins.scylladb.com/job/scylla-master/job/build/1671/testReport/junit/(root)/test_raft_cluster_features/Build__Test__Publish___buildTestPublishArm___test_rolling_upgrade_happy_path/

mykaul · 2023-10-24T13:13:56Z

@kbr-scylla - please have someone look at the latest failure above.

piodul · 2023-10-24T15:41:36Z

The node logs from the original report are gone, but there are still some from the second report here.

This excerpt from https://jenkins.scylladb.com/job/scylla-master/job/build/1671/artifact/testlog/aarch64/debug/scylla-305.log looks very similar to #15728 (comment) and the explanation is the same - the topology coordinator failed to fence the previous coordinator, it throws an exception which causes the topology coordinator to exit and not do anything else.

INFO  2023-10-24 05:44:23,654 [shard 0:main] raft_group0 - gaining leadership
INFO  2023-10-24 05:44:23,660 [shard 0:stre] storage_service - raft topology: refreshing topology to check if it's synchronized with local metadata
ERROR 2023-10-24 05:44:23,672 [shard 0:stre] storage_service - raft topology: failed to fence previous coordinator raft::request_aborted (Request is aborted by a caller)
INFO  2023-10-24 05:44:23,676 [shard 0:stre] storage_service - raft_state_monitor_fiber aborted with raft::request_aborted (Request is aborted by a caller)

So it's a duplicate of #15728. It will be fixed by Gleb's patch which was already queued.

piodul · 2023-11-04T15:56:53Z

Unfortunately, it's not fixed yet, we had a similar failure today here: https://jenkins.scylladb.com/job/scylla-master/job/build/1684/

Logs are slightly different (there is a difference in the last message):

INFO  2023-11-04 09:13:14,211 [shard 0:strm] storage_service - raft topology: refreshing topology to check if it's synchronized with local metadata
INFO  2023-11-04 09:13:14,215 [shard 0:main] raft_group0 - gaining leadership
ERROR 2023-11-04 09:13:14,219 [shard 0:strm] storage_service - raft topology: failed to fence previous coordinator raft::request_aborted (Request is aborted by a caller)
INFO  2023-11-04 09:13:14,222 [shard 0:strm] storage_service - raft_state_monitor_fiber aborted with seastar::sleep_aborted (Sleep is aborted)

The fix mentioned above has a bug: although it catches the raft::request_aborted exception from start_operation/update_topology_state, it then calls sleep_abortable which throws an uncaught abort_requested_exception.

Will send a fix shortly.

kbr-scylla · 2023-11-08T13:25:16Z

We should backport to 5.4 to prevent flakiness.
I'll wait ~one week before backporting.

Annamikhlin · 2023-11-16T06:57:48Z

seen on 2024.1:
https://jenkins.scylladb.com/job/enterprise-2024.1/job/build/1/testReport/junit/(root)/test_cluster_features/Build__Test__Publish___buildTestPublishX86___test_rolling_upgrade_happy_path_2/

required backport to 2024.1

…s_coordinator' from Piotr Dulikowski When topology coordinator tries to fence the previous coordinator it performs a group0 operation. The current topology coordinator might be aborted in the meantime, which will result in a `raft::request_aborted` exception being thrown. After the fix to #15728 was merged, the exception is caught, but then `sleep_abortable` is called which immediately throws `abort_requested_exception` as it uses the same abort source as the group0 operation. The `fence_previous_coordinator` function which does all those things is not supposed to throw exceptions, if it does - it causes `raft_state_monitor_fiber` to exit, completely disabling the topology coordinator functionality on that node. Modify the code in the following way: - Catch `abort_requested_exception` thrown from `sleep_abortable` and exit the function if it happens. In addition to the described issue, it will also handle the case when abort is requested while `sleep_abortable` happens, - Catch `raft::request_aborted` thrown from group0 operation, log the exception with lower verbosity and exit the function explicitly. Finally, wrap both `fence_previous_coordinator` and `run` functions in a `try` block with `on_fatal_internal_error` in the catch handler in order to implement the behavior that adding `noexcept` was originally supposed to introduce. Fixes: #15747 Closes #15948 * github.com:scylladb/scylladb: raft topology: catch and abort on exceptions from topology_coordinator::run Revert "storage_service: raft topology: mark topology_coordinator::run function as noexcept" raft topology: don't print an error when fencing previous coordinator is aborted raft topology: handle abort exceptions from sleeping in fence_previous_coordinator (cherry picked from commit 07e9522)

kbr-scylla · 2023-11-16T11:46:40Z

Queued backport to 5.4 (will eventually end up in 2024.1).

mykaul · 2023-12-13T12:24:29Z

Queued backport to 5.4 (will eventually end up in 2024.1).

Removed labels.

wmitros added the symptom/ci stability Issues that failed in ScyllaDB CI - tests and framework label Oct 17, 2023

wmitros mentioned this issue Oct 17, 2023

test: account for multiple flushes of commitlog segments #14610

Closed

mykaul added the area/raft label Oct 18, 2023

mykaul assigned kbr-scylla Oct 24, 2023

kbr-scylla assigned piodul Oct 24, 2023

piodul closed this as not planned Won't fix, can't repro, duplicate, stale Oct 24, 2023

piodul reopened this Nov 4, 2023

piodul mentioned this issue Nov 4, 2023

raft topology: handle abort exceptions better in fence_previous_coordinator #15948

Merged

mykaul added the P1 Urgent label Nov 5, 2023

mykaul added this to the 6.0 milestone Nov 5, 2023

scylladb-promoter closed this as completed in 07e9522 Nov 7, 2023

scylladb-promoter added the Backport candidate label Nov 7, 2023

kbr-scylla mentioned this issue Nov 8, 2023

Test "test_tablets.test_topology_changes" failed in next - TimeoutError #15728

Closed

kbr-scylla added the backport/5.4 Issues that should be backported to 5.4 branch once they'll be fixed label Nov 8, 2023

kbr-scylla mentioned this issue Nov 16, 2023

gossiper, a new node doesn't get properly notified about other node restart #14042

Closed

mykaul removed Backport candidate backport/5.4 Issues that should be backported to 5.4 branch once they'll be fixed labels Dec 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tests / Unit Tests / non-boost tests.topology_experimental_raft.test_raft_cluster_features.debug is flaky #15747

Tests / Unit Tests / non-boost tests.topology_experimental_raft.test_raft_cluster_features.debug is flaky #15747

wmitros commented Oct 17, 2023

Annamikhlin commented Oct 24, 2023

mykaul commented Oct 24, 2023

piodul commented Oct 24, 2023

piodul commented Nov 4, 2023

kbr-scylla commented Nov 8, 2023

Annamikhlin commented Nov 16, 2023 •

edited

kbr-scylla commented Nov 16, 2023

mykaul commented Dec 13, 2023

Tests / Unit Tests / non-boost tests.topology_experimental_raft.test_raft_cluster_features.debug is flaky #15747

Tests / Unit Tests / non-boost tests.topology_experimental_raft.test_raft_cluster_features.debug is flaky #15747

Comments

wmitros commented Oct 17, 2023

Annamikhlin commented Oct 24, 2023

mykaul commented Oct 24, 2023

piodul commented Oct 24, 2023

piodul commented Nov 4, 2023

kbr-scylla commented Nov 8, 2023

Annamikhlin commented Nov 16, 2023 • edited

kbr-scylla commented Nov 16, 2023

mykaul commented Dec 13, 2023

Annamikhlin commented Nov 16, 2023 •

edited