Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timeout on ARM in RaftAvailabilityTest.test_leader_transfers_recovery.acks=-1 #6902

Open
Lazin opened this issue Oct 24, 2022 · 3 comments · Fixed by #7297
Open

Timeout on ARM in RaftAvailabilityTest.test_leader_transfers_recovery.acks=-1 #6902

Lazin opened this issue Oct 24, 2022 · 3 comments · Fixed by #7297
Assignees
Labels
area/raft kind/bug Something isn't working

Comments

@Lazin
Copy link
Contributor

Lazin commented Oct 24, 2022

Version & Environment

Redpanda version: (use rpk version): dev

The transfer_leadership admin api call times out (504) - https://buildkite.com/redpanda/redpanda/builds/17112#01840352-dae2-40a6-9e26-a1a068b8f1bb

Traceback (most recent call last):

  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run

    data = self.run_test()

  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test

    return self.test_context.function(self.test)

  File "/usr/local/lib/python3.10/dist-packages/ducktape/mark/_mark.py", line 476, in wrapper

    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)

  File "/root/tests/rptest/services/cluster.py", line 35, in wrapped

    r = f(self, *args, **kwargs)

  File "/root/tests/rptest/tests/raft_availability_test.py", line 422, in test_leader_transfers_recovery

    self._transfer_leadership(admin, "kafka", self.topic,

  File "/root/tests/rptest/tests/raft_availability_test.py", line 146, in _transfer_leadership

    admin.transfer_leadership_to(topic=topic,

  File "/root/tests/rptest/services/admin.py", line 694, in transfer_leadership_to

    ret = self._request('post', path=path, node=leader)

  File "/root/tests/rptest/services/admin.py", line 335, in _request

    r.raise_for_status()

  File "/usr/local/lib/python3.10/dist-packages/requests/models.py", line 941, in raise_for_status

    raise HTTPError(http_error_msg, response=self)

requests.exceptions.HTTPError: 504 Server Error: Gateway Timeout for url: http://docker-rp-30:9644/v1/raft/1/transfer_leadership?target=1

JIRA Link: CORE-1055

@Lazin Lazin added kind/bug Something isn't working area/raft ci-failure labels Oct 24, 2022
@Lazin Lazin changed the title Timeout in RaftAvailabilityTest.test_leader_transfers_recovery.acks=-1 Timeout in RaftAvailabilityTest.test_leader_transfers_recovery.acks=-1 Oct 24, 2022
@dlex dlex self-assigned this Nov 7, 2022
@dlex
Copy link
Contributor

dlex commented Nov 10, 2022

The error code in the test means that leadership cannot indeed be tranfserred - at least at the moment.

INFO  2022-10-23 07:00:10,029 [shard 1] admin_api_server - admin_server.cc:1399 - Leadership transfer request for raft group 1 to node {1}
INFO  2022-10-23 07:00:10,030 [shard 1] raft - [group_id:1, {kafka/topic-ffsauwcmof/0}] consensus.cc:2937 - Starting leadership transfer from {id: {3}, revision: {18}} to {id: {1}, revision: {18}} in term 1
TRACE 2022-10-23 07:00:10,030 [shard 1] raft - [group_id:1, {kafka/topic-ffsauwcmof/0}] consensus.cc:2792 - transfer leadership: preparing target={id: {1}, revision: {18}}, dirty_offset=0  
TRACE 2022-10-23 07:00:10,030 [shard 1] raft - [group_id:1, {kafka/topic-ffsauwcmof/0}] consensus.cc:2798 - transfer leadership: cleared oplock               
DEBUG 2022-10-23 07:00:10,030 [shard 1] raft - [group_id:1, {kafka/topic-ffsauwcmof/0}] consensus.cc:2821 - transfer leadership: starting node {id: {1}, revision: {18}} recovery          
INFO  2022-10-23 07:00:10,030 [shard 1] raft - [group_id:1, {kafka/topic-ffsauwcmof/0}] consensus.cc:2840 - transfer leadership: waiting for node {id: {1}, revision: {18}} to catch up  
TRACE 2022-10-23 07:00:10,030 [shard 1] raft - [follower: {id: {1}, revision: {18}}] [group_id:1, {kafka/topic-ffsauwcmof/0}] - recovery_stm.cc:535 - Finished recovery
INFO  2022-10-23 07:00:10,030 [shard 1] raft - [group_id:1, {kafka/topic-ffsauwcmof/0}] consensus.cc:2856 - transfer leadership: finished waiting on node {id: {1}, revision: {18}} recovery 
WARN  2022-10-23 07:00:10,030 [shard 1] raft - [group_id:1, {kafka/topic-ffsauwcmof/0}] consensus.cc:3021 - Cannot transfer leadership: {id: {1}, revision: {18}} needs recovery (-9223372036854775808, -9223372036854775808, 0)

Looking further to understand the root cause

@jcsp jcsp changed the title Timeout in RaftAvailabilityTest.test_leader_transfers_recovery.acks=-1 Timeout on ARM in RaftAvailabilityTest.test_leader_transfers_recovery.acks=-1 Nov 11, 2022
@dlex dlex added the sev/high loss of availability, pathological performance degradation, recoverable corruption label Nov 11, 2022
@jcsp
Copy link
Contributor

jcsp commented Nov 22, 2022

Reopening because it's not clear if #7297 is the end of the story or if there is more work here to deal with how the consensus object got into the state to begin with:
#7297 (comment)

@jcsp jcsp reopened this Nov 22, 2022
@jcsp jcsp removed ci-failure sev/high loss of availability, pathological performance degradation, recoverable corruption labels Nov 22, 2022
@dlex
Copy link
Contributor

dlex commented Nov 22, 2022

A prior chat with @jcsp for reference.

Alexey Biryukov

as I can see from the log, recovery of a follower to transfer leadership to starts and immediately finishes as if no recovery is needed. However that is not the case and needs_recovery() still returns true.

in order for that to happen, recovery_stm::is_recovery_finished() must return true immediately. The last statement cannot do that (we know the values from the log), follower's meta is still there (log proves that), then it's all because of the preliminary checks. There's nothing about shutting down in the log so _ptr->_as.abort_requested() || _ptr->_bg.is_closed() are false. The current leader has just been elected a millisecond ago, but recovery_stm is created after that so _term != _ptr->term() || !_ptr->is_elected_leader() are false. The _stop_requested flag in this case reflects meta missing or is_recovery_finished() returned false before.

John Spray

The leader thinks recovery is finished, but has not updated follower stats to reflect that. it is updating them 130us later the next time it gets a heartbeat reply from the node.

I think we have two bugs here:

  • A node's follower stats should be updated when we exit recovery, maybe recovery_stm should be doing this? I haven't looked in detail, maybe you can look and/or talk to Michal about it next week
  • We are returning 504 instead of 503 in this case: it's not really a timeout, it's a flap (we thought recovery was done but it's not) -- even without the first bug this can happen, and is a 503 (plz retry) rather than a 504 (we couldn't do it in time).

For the second one, can you open a PR to change this block:

          auto& meta = _fstats.get(target_rni);
          if (needs_recovery(meta, _log.offsets().dirty_offset)) {
              vlog(
                _ctxlog.warn,
                "Cannot transfer leadership: {} needs recovery ({}, {}, "
                "{})",
                target_rni,
                meta.match_index,
                meta.last_dirty_log_index,
                _log.offsets().dirty_offset);

              return seastar::make_ready_future<std::error_code>(
                make_error_code(errc::timeout));
          }

If we return errc::exponential_backoff there instead, then admin_server.cc will convert to a 503 instead of a 504

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/raft kind/bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants