Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tests / Unit Tests / non-boost tests.topology_experimental_raft.test_raft_cluster_features.debug is flaky #15747

Closed
wmitros opened this issue Oct 17, 2023 · 8 comments · Fixed by #15948
Assignees
Labels
area/raft P1 Urgent symptom/ci stability Issues that failed in ScyllaDB CI - tests and framework
Milestone

Comments

@wmitros
Copy link
Contributor

wmitros commented Oct 17, 2023

Seen in https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/4217/testReport/junit/(root)/test_raft_cluster_features/Tests___Unit_Tests___test_rolling_upgrade_happy_path_5/

AssertionError: Deadline exceeded, failing test.

manager = <test.pylib.manager_client.ManagerClient object at 0x7fed911b51d0>

    @pytest.mark.asyncio
    async def test_rolling_upgrade_happy_path(manager: ManagerClient) -> None:
        for _ in range(3):
            await manager.server_add()
>       await test_cluster_features.test_rolling_upgrade_happy_path(manager)

test/topology_experimental_raft/test_raft_cluster_features.py:24: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
test/topology/test_cluster_features.py:92: in test_rolling_upgrade_happy_path
    await asyncio.gather(*(wait_for_feature(TEST_FEATURE_NAME, cql, h, time.time() + 60) for h in hosts))
test/pylib/util.py:131: in wait_for_feature
    await wait_for(feature_is_enabled, deadline)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

pred = <function wait_for_feature.<locals>.feature_is_enabled at 0x7fed906d8a40>
deadline = 1697493961.5800846, period = 1

    async def wait_for(
            pred: Callable[[], Awaitable[Optional[T]]],
            deadline: float, period: float = 1) -> T:
        while True:
>           assert(time.time() < deadline), "Deadline exceeded, failing test."
E           AssertionError: Deadline exceeded, failing test.

test/pylib/util.py:45: AssertionError

Ran on an m5ad.8xlarge instance, so shouldn't be an infra problem even though it's a timeout

@wmitros wmitros added the symptom/ci stability Issues that failed in ScyllaDB CI - tests and framework label Oct 17, 2023
@mykaul
Copy link
Contributor

mykaul commented Oct 24, 2023

@kbr-scylla - please have someone look at the latest failure above.

@piodul
Copy link
Contributor

piodul commented Oct 24, 2023

The node logs from the original report are gone, but there are still some from the second report here.

This excerpt from https://jenkins.scylladb.com/job/scylla-master/job/build/1671/artifact/testlog/aarch64/debug/scylla-305.log looks very similar to #15728 (comment) and the explanation is the same - the topology coordinator failed to fence the previous coordinator, it throws an exception which causes the topology coordinator to exit and not do anything else.

INFO  2023-10-24 05:44:23,654 [shard 0:main] raft_group0 - gaining leadership
INFO  2023-10-24 05:44:23,660 [shard 0:stre] storage_service - raft topology: refreshing topology to check if it's synchronized with local metadata
ERROR 2023-10-24 05:44:23,672 [shard 0:stre] storage_service - raft topology: failed to fence previous coordinator raft::request_aborted (Request is aborted by a caller)
INFO  2023-10-24 05:44:23,676 [shard 0:stre] storage_service - raft_state_monitor_fiber aborted with raft::request_aborted (Request is aborted by a caller)

So it's a duplicate of #15728. It will be fixed by Gleb's patch which was already queued.

@piodul piodul closed this as not planned Won't fix, can't repro, duplicate, stale Oct 24, 2023
@piodul
Copy link
Contributor

piodul commented Nov 4, 2023

Unfortunately, it's not fixed yet, we had a similar failure today here: https://jenkins.scylladb.com/job/scylla-master/job/build/1684/

Logs are slightly different (there is a difference in the last message):

INFO  2023-11-04 09:13:14,211 [shard 0:strm] storage_service - raft topology: refreshing topology to check if it's synchronized with local metadata
INFO  2023-11-04 09:13:14,215 [shard 0:main] raft_group0 - gaining leadership
ERROR 2023-11-04 09:13:14,219 [shard 0:strm] storage_service - raft topology: failed to fence previous coordinator raft::request_aborted (Request is aborted by a caller)
INFO  2023-11-04 09:13:14,222 [shard 0:strm] storage_service - raft_state_monitor_fiber aborted with seastar::sleep_aborted (Sleep is aborted)

The fix mentioned above has a bug: although it catches the raft::request_aborted exception from start_operation/update_topology_state, it then calls sleep_abortable which throws an uncaught abort_requested_exception.

Will send a fix shortly.

@kbr-scylla
Copy link
Contributor

We should backport to 5.4 to prevent flakiness.
I'll wait ~one week before backporting.

kbr-scylla added a commit that referenced this issue Nov 16, 2023
…s_coordinator' from Piotr Dulikowski

When topology coordinator tries to fence the previous coordinator it
performs a group0 operation. The current topology coordinator might be
aborted in the meantime, which will result in a `raft::request_aborted`
exception being thrown. After the fix to #15728 was
merged, the exception is caught, but then `sleep_abortable` is called
which immediately throws `abort_requested_exception` as it uses the same
abort source as the group0 operation. The `fence_previous_coordinator`
function which does all those things is not supposed to throw
exceptions, if it does - it causes `raft_state_monitor_fiber` to exit,
completely disabling the topology coordinator functionality on that
node.

Modify the code in the following way:

- Catch `abort_requested_exception` thrown from `sleep_abortable` and
  exit the function if it happens. In addition to the described issue,
it will also handle the case when abort is requested while
`sleep_abortable` happens,
- Catch `raft::request_aborted` thrown from group0 operation, log the
  exception with lower verbosity and exit the function explicitly.

Finally, wrap both `fence_previous_coordinator` and `run` functions in a
`try` block with `on_fatal_internal_error` in the catch handler in order
to implement the behavior that adding `noexcept` was originally supposed
to introduce.

Fixes: #15747

Closes #15948

* github.com:scylladb/scylladb:
  raft topology: catch and abort on exceptions from topology_coordinator::run
  Revert "storage_service: raft topology: mark topology_coordinator::run function as noexcept"
  raft topology: don't print an error when fencing previous coordinator is aborted
  raft topology: handle abort exceptions from sleeping in fence_previous_coordinator

(cherry picked from commit 07e9522)
@kbr-scylla
Copy link
Contributor

Queued backport to 5.4 (will eventually end up in 2024.1).

@mykaul mykaul removed Backport candidate backport/5.4 Issues that should be backported to 5.4 branch once they'll be fixed labels Dec 13, 2023
@mykaul
Copy link
Contributor

mykaul commented Dec 13, 2023

Queued backport to 5.4 (will eventually end up in 2024.1).

Removed labels.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/raft P1 Urgent symptom/ci stability Issues that failed in ScyllaDB CI - tests and framework
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants