Timeout on ARM in `RaftAvailabilityTest.test_leader_transfers_recovery.acks=-1` #6902

Lazin · 2022-10-24T14:43:42Z

Version & Environment

Redpanda version: (use rpk version): dev

The transfer_leadership admin api call times out (504) - https://buildkite.com/redpanda/redpanda/builds/17112#01840352-dae2-40a6-9e26-a1a068b8f1bb

Traceback (most recent call last):

  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run

    data = self.run_test()

  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test

    return self.test_context.function(self.test)

  File "/usr/local/lib/python3.10/dist-packages/ducktape/mark/_mark.py", line 476, in wrapper

    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)

  File "/root/tests/rptest/services/cluster.py", line 35, in wrapped

    r = f(self, *args, **kwargs)

  File "/root/tests/rptest/tests/raft_availability_test.py", line 422, in test_leader_transfers_recovery

    self._transfer_leadership(admin, "kafka", self.topic,

  File "/root/tests/rptest/tests/raft_availability_test.py", line 146, in _transfer_leadership

    admin.transfer_leadership_to(topic=topic,

  File "/root/tests/rptest/services/admin.py", line 694, in transfer_leadership_to

    ret = self._request('post', path=path, node=leader)

  File "/root/tests/rptest/services/admin.py", line 335, in _request

    r.raise_for_status()

  File "/usr/local/lib/python3.10/dist-packages/requests/models.py", line 941, in raise_for_status

    raise HTTPError(http_error_msg, response=self)

requests.exceptions.HTTPError: 504 Server Error: Gateway Timeout for url: http://docker-rp-30:9644/v1/raft/1/transfer_leadership?target=1

JIRA Link: CORE-1055

The text was updated successfully, but these errors were encountered:

dlex · 2022-11-10T03:25:05Z

The error code in the test means that leadership cannot indeed be tranfserred - at least at the moment.

INFO  2022-10-23 07:00:10,029 [shard 1] admin_api_server - admin_server.cc:1399 - Leadership transfer request for raft group 1 to node {1}
INFO  2022-10-23 07:00:10,030 [shard 1] raft - [group_id:1, {kafka/topic-ffsauwcmof/0}] consensus.cc:2937 - Starting leadership transfer from {id: {3}, revision: {18}} to {id: {1}, revision: {18}} in term 1
TRACE 2022-10-23 07:00:10,030 [shard 1] raft - [group_id:1, {kafka/topic-ffsauwcmof/0}] consensus.cc:2792 - transfer leadership: preparing target={id: {1}, revision: {18}}, dirty_offset=0  
TRACE 2022-10-23 07:00:10,030 [shard 1] raft - [group_id:1, {kafka/topic-ffsauwcmof/0}] consensus.cc:2798 - transfer leadership: cleared oplock               
DEBUG 2022-10-23 07:00:10,030 [shard 1] raft - [group_id:1, {kafka/topic-ffsauwcmof/0}] consensus.cc:2821 - transfer leadership: starting node {id: {1}, revision: {18}} recovery          
INFO  2022-10-23 07:00:10,030 [shard 1] raft - [group_id:1, {kafka/topic-ffsauwcmof/0}] consensus.cc:2840 - transfer leadership: waiting for node {id: {1}, revision: {18}} to catch up  
TRACE 2022-10-23 07:00:10,030 [shard 1] raft - [follower: {id: {1}, revision: {18}}] [group_id:1, {kafka/topic-ffsauwcmof/0}] - recovery_stm.cc:535 - Finished recovery
INFO  2022-10-23 07:00:10,030 [shard 1] raft - [group_id:1, {kafka/topic-ffsauwcmof/0}] consensus.cc:2856 - transfer leadership: finished waiting on node {id: {1}, revision: {18}} recovery 
WARN  2022-10-23 07:00:10,030 [shard 1] raft - [group_id:1, {kafka/topic-ffsauwcmof/0}] consensus.cc:3021 - Cannot transfer leadership: {id: {1}, revision: {18}} needs recovery (-9223372036854775808, -9223372036854775808, 0)

Looking further to understand the root cause

jcsp · 2022-11-22T10:11:34Z

Reopening because it's not clear if #7297 is the end of the story or if there is more work here to deal with how the consensus object got into the state to begin with:
#7297 (comment)

dlex · 2022-11-22T19:56:02Z

A prior chat with @jcsp for reference.

Alexey Biryukov

as I can see from the log, recovery of a follower to transfer leadership to starts and immediately finishes as if no recovery is needed. However that is not the case and needs_recovery() still returns true.

in order for that to happen, recovery_stm::is_recovery_finished() must return true immediately. The last statement cannot do that (we know the values from the log), follower's meta is still there (log proves that), then it's all because of the preliminary checks. There's nothing about shutting down in the log so _ptr->_as.abort_requested() || _ptr->_bg.is_closed() are false. The current leader has just been elected a millisecond ago, but recovery_stm is created after that so _term != _ptr->term() || !_ptr->is_elected_leader() are false. The _stop_requested flag in this case reflects meta missing or is_recovery_finished() returned false before.

John Spray

The leader thinks recovery is finished, but has not updated follower stats to reflect that. it is updating them 130us later the next time it gets a heartbeat reply from the node.

I think we have two bugs here:

A node's follower stats should be updated when we exit recovery, maybe recovery_stm should be doing this? I haven't looked in detail, maybe you can look and/or talk to Michal about it next week

We are returning 504 instead of 503 in this case: it's not really a timeout, it's a flap (we thought recovery was done but it's not) -- even without the first bug this can happen, and is a 503 (plz retry) rather than a 504 (we couldn't do it in time).

For the second one, can you open a PR to change this block:
          auto& meta = _fstats.get(target_rni);
          if (needs_recovery(meta, _log.offsets().dirty_offset)) {
              vlog(
                _ctxlog.warn,
                "Cannot transfer leadership: {} needs recovery ({}, {}, "
                "{})",
                target_rni,
                meta.match_index,
                meta.last_dirty_log_index,
                _log.offsets().dirty_offset);

              return seastar::make_ready_future<std::error_code>(
                make_error_code(errc::timeout));
          }
If we return errc::exponential_backoff there instead, then admin_server.cc will convert to a 503 instead of a 504

Lazin added kind/bug Something isn't working area/raft ci-failure labels Oct 24, 2022

Lazin changed the title ~~Timeout in RaftAvailabilityTest.test_leader_transfers_recovery.acks=-1~~ Timeout in RaftAvailabilityTest.test_leader_transfers_recovery.acks=-1 Oct 24, 2022

dlex self-assigned this Nov 7, 2022

jcsp changed the title ~~Timeout in RaftAvailabilityTest.test_leader_transfers_recovery.acks=-1~~ Timeout on ARM in RaftAvailabilityTest.test_leader_transfers_recovery.acks=-1 Nov 11, 2022

dlex added the sev/high loss of availability, pathological performance degradation, recoverable corruption label Nov 11, 2022

dlex mentioned this issue Nov 15, 2022

raft: amend error code when leadership transfer can't proceed due to recovery #7297

Merged

6 tasks

jcsp closed this as completed in #7297 Nov 22, 2022

jcsp reopened this Nov 22, 2022

jcsp removed ci-failure sev/high loss of availability, pathological performance degradation, recoverable corruption labels Nov 22, 2022

vbotbuildovich mentioned this issue Nov 22, 2022

[v22.3.x] Timeout on ARM in RaftAvailabilityTest.test_leader_transfers_recovery.acks=-1 #7435

Closed

This was referenced Nov 22, 2022

[v22.2.x] Timeout on ARM in RaftAvailabilityTest.test_leader_transfers_recovery.acks=-1 #7444

Closed

[v22.1.x] Timeout on ARM in RaftAvailabilityTest.test_leader_transfers_recovery.acks=-1 #7503

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Timeout on ARM in `RaftAvailabilityTest.test_leader_transfers_recovery.acks=-1` #6902

Timeout on ARM in `RaftAvailabilityTest.test_leader_transfers_recovery.acks=-1` #6902

Lazin commented Oct 24, 2022 •

edited by jira bot

dlex commented Nov 10, 2022

jcsp commented Nov 22, 2022

dlex commented Nov 22, 2022

Timeout on ARM in RaftAvailabilityTest.test_leader_transfers_recovery.acks=-1 #6902

Timeout on ARM in RaftAvailabilityTest.test_leader_transfers_recovery.acks=-1 #6902

Comments

Lazin commented Oct 24, 2022 • edited by jira bot

Version & Environment

dlex commented Nov 10, 2022

jcsp commented Nov 22, 2022

dlex commented Nov 22, 2022

Timeout on ARM in `RaftAvailabilityTest.test_leader_transfers_recovery.acks=-1` #6902

Timeout on ARM in `RaftAvailabilityTest.test_leader_transfers_recovery.acks=-1` #6902

Lazin commented Oct 24, 2022 •

edited by jira bot