Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent partition state between brokers during transfer of leadership #8724

Closed
JFlath opened this issue Feb 8, 2023 · 1 comment · Fixed by #8741
Closed

Inconsistent partition state between brokers during transfer of leadership #8724

JFlath opened this issue Feb 8, 2023 · 1 comment · Fixed by #8741
Assignees
Labels
area/raft kind/bug Something isn't working sev/high loss of availability, pathological performance degradation, recoverable corruption
Milestone

Comments

@JFlath
Copy link
Contributor

JFlath commented Feb 8, 2023

Version & Environment

Redpanda version: (use rpk version): 22.2.7

What went wrong?

Observed in Cloud V1 that a parition was unable to successfully handover leadership from node 2 to node 0.

When node 2 originally became leader, it reported:

2023-02-08T11:49:08.016557115Z stderr F INFO  2023-02-08 11:49:08,016 [shard 6] raft - [group_id:6, {kafka/__consumer_offsets/1}] vote_stm.cc:276 - unable to replicate configuration as a leader - error code: 10 - raft::errc::timeout 

2023-02-08T11:49:06.239229932Z stderr F INFO  2023-02-08 11:49:06,239 [shard 6] raft - [group_id:6, {kafka/__consumer_offsets/1}] vote_stm.cc:279 - became the leader term: 46

On subsequent attempted transfer back to node 0, node 2 reported a loop of:

2023-02-08T14:01:27.415852486Z stderr F WARN  2023-02-08 14:01:27,415 [shard 6] raft - [group_id:6, {kafka/__consumer_offsets/1}] consensus.cc:2847 - transfer leadership: timed out waiting on node {id: {0}, revision: {45}} recovery
2023-02-08T14:01:19.188822006Z stderr F TRACE 2023-02-08 14:01:19,188 [shard 6] raft - [group_id:6, {kafka/__consumer_offsets/1}] consensus.cc:298 - Append entries response: {node_id: {id: {0}, revision: {45}}, target_node_id{id: {2}, revision: {45}}, group: {6}, term:{46}, last_dirty_log_index:{1444711}, last_flushed_log_index:{1444711}, last_term_base_offset:{-9223372036854775808}, result: failure}
2023-02-08T14:01:19.18865755Z stderr F TRACE 2023-02-08 14:01:19,188 [shard 6] raft - [group_id:6, {kafka/__consumer_offsets/1}] consensus.cc:498 - Updated node {id: {1}, revision: {45}} match 1444711 and next 1444712 indices
2023-02-08T14:01:19.188654974Z stderr F TRACE 2023-02-08 14:01:19,188 [shard 6] raft - [group_id:6, {kafka/__consumer_offsets/1}] consensus.cc:378 - Updated node {id: {1}, revision: {45}} last committed log index: 1444711
2023-02-08T14:01:19.188642488Z stderr F TRACE 2023-02-08 14:01:19,188 [shard 6] raft - [group_id:6, {kafka/__consumer_offsets/1}] consensus.cc:298 - Append entries response: {node_id: {id: {1}, revision: {45}}, target_node_id{id: {2}, revision: {45}}, group: {6}, term:{46}, last_dirty_log_index:{1444711}, last_flushed_log_index:{1444711}, last_term_base_offset:{-9223372036854775808}, result: success}
2023-02-08T14:01:19.188077944Z stderr F TRACE 2023-02-08 14:01:19,188 [shard 6] raft - [group_id:6, {kafka/__consumer_offsets/1}] consensus.cc:3166 - should_reconnect_follower({id: {0}, revision: {45}}): 0/3 fails, last ok 150ms ago
2023-02-08T14:01:19.038077907Z stderr F TRACE 2023-02-08 14:01:19,038 [shard 6] raft - [group_id:6, {kafka/__consumer_offsets/1}] consensus.cc:298 - Append entries response: {node_id: {id: {0}, revision: {45}}, target_node_id{id: {2}, revision: {45}}, group: {6}, term:{46}, last_dirty_log_index:{1444711}, last_flushed_log_index:{1444711}, last_term_base_offset:{-9223372036854775808}, result: failure}
2023-02-08T14:01:19.03792332Z stderr F TRACE 2023-02-08 14:01:19,037 [shard 6] raft - [group_id:6, {kafka/__consumer_offsets/1}] consensus.cc:498 - Updated node {id: {1}, revision: {45}} match 1444711 and next 1444712 indices
2023-02-08T14:01:19.037919199Z stderr F TRACE 2023-02-08 14:01:19,037 [shard 6] raft - [group_id:6, {kafka/__consumer_offsets/1}] consensus.cc:378 - Updated node {id: {1}, revision: {45}} last committed log index: 1444711

Node 0 reported:

2023-02-08T14:03:50.870559162Z stderr F DEBUG 2023-02-08 14:03:50,870 [shard 6] raft - [group_id:6, {kafka/__consumer_offsets/1}] consensus.cc:1721 - Rejecting append entries. mismatching entry term at offset: 1444711, current term: 45 request term: 46
2023-02-08T14:03:50.870555824Z stderr F TRACE 2023-02-08 14:03:50,870 [shard 6] raft - [group_id:6, {kafka/__consumer_offsets/1}] consensus.cc:1646 - Append entries request: {raft_group:{6}, commit_index:{1444711}, term:{46}, prev_log_index:{1444711}, prev_log_term:{46}, last_visible_index:{1444711}}
2023-02-08T14:03:50.719710032Z stderr F DEBUG 2023-02-08 14:03:50,719 [shard 6] raft - [group_id:6, {kafka/__consumer_offsets/1}] consensus.cc:1721 - Rejecting append entries. mismatching entry term at offset: 1444711, current term: 45 request term: 46
2023-02-08T14:03:50.719696017Z stderr F TRACE 2023-02-08 14:03:50,719 [shard 6] raft - [group_id:6, {kafka/__consumer_offsets/1}] consensus.cc:1646 - Append entries request: {raft_group:{6}, commit_index:{1444711}, term:{46}, prev_log_index:{1444711}, prev_log_term:{46}, last_visible_index:{1444711}}
2023-02-08T14:03:50.568880962Z stderr F DEBUG 2023-02-08 14:03:50,568 [shard 6] raft - [group_id:6, {kafka/__consumer_offsets/1}] consensus.cc:1721 - Rejecting append entries. mismatching entry term at offset: 1444711, current term: 45 request term: 46
2023-02-08T14:03:50.568877811Z stderr F TRACE 2023-02-08 14:03:50,568 [shard 6] raft - [group_id:6, {kafka/__consumer_offsets/1}] consensus.cc:1646 - Append entries request: {raft_group:{6}, commit_index:{1444711}, term:{46}, prev_log_index:{1444711}, prev_log_term:{46}, last_visible_index:{1444711}}
2023-02-08T14:03:50.418050098Z stderr F DEBUG 2023-02-08 14:03:50,418 [shard 6] raft - [group_id:6, {kafka/__consumer_offsets/1}] consensus.cc:1721 - Rejecting append entries. mismatching entry term at offset: 1444711, current term: 45 request term: 46
2023-02-08T14:03:50.418040572Z stderr F TRACE 2023-02-08 14:03:50,417 [shard 6] raft - [group_id:6, {kafka/__consumer_offsets/1}] consensus.cc:1646 - Append entries request: {raft_group:{6}, commit_index:{1444711}, term:{46}, prev_log_index:{1444711}, prev_log_term:{46}, last_visible_index:{1444711}}
2023-02-08T14:03:50.267242895Z stderr F DEBUG 2023-02-08 14:03:50,267 [shard 6] raft - [group_id:6, {kafka/__consumer_offsets/1}] consensus.cc:1721 - Rejecting append entries. mismatching entry term at offset: 1444711, current term: 45 request term: 46
2023-02-08T14:03:50.267232016Z stderr F TRACE 2023-02-08 14:03:50,267 [shard 6] raft - [group_id:6, {kafka/__consumer_offsets/1}] consensus.cc:1646 - Append entries request: {raft_group:{6}, commit_index:{1444711}, term:{46}, prev_log_index:{1444711}, prev_log_term:{46}, last_visible_index:{1444711}}
2023-02-08T14:03:50.116419554Z stderr F DEBUG 2023-02-08 14:03:50,116 [shard 6] raft - [group_id:6, {kafka/__consumer_offsets/1}] consensus.cc:1721 - Rejecting append entries. mismatching entry term at offset: 1444711, current term: 45 request term: 46
2023-02-08T14:03:50.116404928Z stderr F TRACE 2023-02-08 14:03:50,116 [shard 6] raft - [group_id:6, {kafka/__consumer_offsets/1}] consensus.cc:1646 - Append entries request: {raft_group:{6}, commit_index:{1444711}, term:{46}, prev_log_index:{1444711}, prev_log_term:{46}, last_visible_index:{1444711}}
2023-02-08T14:03:49.965604693Z stderr F DEBUG 2023-02-08 14:03:49,965 [shard 6] raft - [group_id:6, {kafka/__consumer_offsets/1}] consensus.cc:1721 - Rejecting append entries. mismatching entry term at offset: 1444711, current term: 45 request term: 46
@JFlath JFlath added the kind/bug Something isn't working label Feb 8, 2023
@piyushredpanda piyushredpanda added this to the v23.1 milestone Feb 8, 2023
@piyushredpanda piyushredpanda added area/raft sev/high loss of availability, pathological performance degradation, recoverable corruption labels Feb 8, 2023
@mmaslankaprv
Copy link
Member

It looks like this is related with the issue i saw few times in tests and i wasn't able to reproduce, i.e. reader stuck during recovery.

I analyzed the logs again and it looks like the current leader has the follower 0 marked as being recovered (i.e. it delivers missing records to the follower). I will add a timeout to the recovery read with an error log level so that we will be protected and see if the issue occur in tests

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/raft kind/bug Something isn't working sev/high loss of availability, pathological performance degradation, recoverable corruption
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants