Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed unavailability during follower & leader isolation #2856

Merged

Conversation

mmaslankaprv
Copy link
Member

@mmaslankaprv mmaslankaprv commented Nov 2, 2021

Cover letter

Fixed availability issue occurring during follower isolation. Follower isolation chaos test isolates one of the followers from other group members (it uses iptables to drop all the packets targeting isolated follower). In this situation leader is supposed to stop trying sending requests to isolated follower. The bug that has been fixed caused leader to always try to send append entries to follower even tho the follower is unreachable. This triggered backpressure propagation (based on constant number of in-flight requests per follower). The backpressure was propagated to the client which resulted in timeouts.

Metadata response heuristics

We use simple heuristic to tolerate isolation of a node hosting both partition leader and follower. Kafka clients request metadata refresh in case they receive error that is related with stale metadata - f.e. NOT_LEADER. Metadata request can be processed by any broker and there is no general rule for that which broker to choose to refresh metadata from. (f.e. Java kafka client uses the broker with active least loaded connection.) This may lead to the situation in which client will ask for metadata always the same broker. When that broker is isolated from rest of the cluster it will never update its metadata view. This way the client will always receive stale metadata.

This behavior may lead to a live lock in an event of network partition. If current partition leader is isolated from the cluster it will keep answering with its id in the leader_id field for that partition (according to policy where we return a former leader - there is no leader for that broker, it is a candidate). Client will retry produce or fetch request and receive NOT_LEADER error, this will force client to request metadata update, broker will respond with the same metadata and the whole cycle will loop indefinitely.

In order to break the loop and force client to make progress we use following heuristics:

  1. when current leader is unknown, return former leader (Kafka behavior)
  2. when current leader is unknown and previous leader is equal to current node id select random replica_id as a leader (indicate leader isolation)

With those heuristics we will always force the client to communicate with the nodes that may not be partitioned.

Release notes

Release note: [1-2 sentences of what this PR changes]

@jcsp
Copy link
Contributor

jcsp commented Nov 2, 2021

Does this issue require the follower be network partitioned, or can it just be down? If just down, then this seems like something we should be able to cover in a ducktape test (there's RaftAvailabilityTest.test_one_node_down that might be a good starting point)

@@ -210,7 +210,7 @@ In order to make it possible new fields have to be added to
// next index to send to this follower
model::offset next_index;
// timestamp of last append_entries_rpc call
clock_type::time_point last_append_timestamp;
clock_type::time_point last_sent_append_entries_req_timesptamp;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Big 👍 for clearer names

jcsp
jcsp previously approved these changes Nov 2, 2021
Copy link
Contributor

@jcsp jcsp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code change LGTM, would be interested in thoughts on whether we can encapsulate this in a ducktape test (chaos tests are good, but this seems like a simple enough situation to have a dedicated test for)

@dotnwat
Copy link
Member

dotnwat commented Nov 3, 2021

Does this issue require the follower be network partitioned, or can it just be down? If just down, then this seems like something we should be able to cover in a ducktape test (there's RaftAvailabilityTest.test_one_node_down that might be a good starting point)

are you thinking about network partition vs down in terms of being able to create the test in ducktape, or something more nuanced? assuming the test would drive everything implicitly through the leader and that client can reach leader, then from raft perspective, network partition seems like it would be the same as follower being down.

@dotnwat dotnwat self-requested a review November 3, 2021 21:23
Copy link
Member

@dotnwat dotnwat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a connection missing for me going from the problem:

leader was propagating backpressure to Kafka API even tho the follower was down

to the solution:

do not update reply timestamp when sending request

Is it because the leader would wait on a quorum of responses that never arrived?

  1. clarifying this in the PR would be really useful for future rafters
  2. is there a related chaos test we can reference as a reproducer?

@jcsp
Copy link
Contributor

jcsp commented Nov 3, 2021

are you thinking about network partition vs down in terms of being able to create the test in ducktape, or something more nuanced? assuming the test would drive everything implicitly through the leader and that client can reach leader, then from raft perspective, network partition seems like it would be the same as follower being down.

I was thinking of network partitions in the general sense of including cases where a node was only isolated from some of its peers (not sure how crazy the chaos tests get with that kind of thing). It seems like this is really just a straight "node down" situation though.

@dotnwat
Copy link
Member

dotnwat commented Nov 4, 2021

are you thinking about network partition vs down in terms of being able to create the test in ducktape, or something more nuanced? assuming the test would drive everything implicitly through the leader and that client can reach leader, then from raft perspective, network partition seems like it would be the same as follower being down.

I was thinking of network partitions in the general sense of including cases where a node was only isolated from some of its peers (not sure how crazy the chaos tests get with that kind of thing). It seems like this is really just a straight "node down" situation though.

got it. iirc correctly in raft there is only leader <-> follower communication so if we're thinking about a single raft group it might be the same scenario here.

@mmaslankaprv
Copy link
Member Author

There is a connection missing for me going from the problem:

leader was propagating backpressure to Kafka API even tho the follower was down

to the solution:

do not update reply timestamp when sending request

Is it because the leader would wait on a quorum of responses that never arrived?

  1. clarifying this in the PR would be really useful for future rafters
  2. is there a related chaos test we can reference as a reproducer?
  1. Done
  2. There is a failure scenario called isolate_follower. I mentioned that in PR description,

@dotnwat dotnwat self-requested a review November 4, 2021 14:18
dotnwat
dotnwat previously approved these changes Nov 4, 2021
@mmaslankaprv mmaslankaprv dismissed stale reviews from dotnwat and jcsp via eb60c02 November 4, 2021 16:40
@mmaslankaprv mmaslankaprv added this to the v21.10.1 milestone Nov 4, 2021
@mmaslankaprv mmaslankaprv force-pushed the raft-availability-fixes branch 5 times, most recently from b11d060 to 6622484 Compare November 5, 2021 11:14
rystsov
rystsov previously approved these changes Nov 6, 2021
Copy link
Contributor

@rystsov rystsov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, the changes are crisp, very easy to read!

Signed-off-by: Michal Maslanka <michal@vectorized.io>
Do not update last received append entries reply timestamp when updating
last sent append entries timestamp.

Signed-off-by: Michal Maslanka <michal@vectorized.io>
Kafka Metadata API responses should contain leader id even tho the
partition leader is unknown in a given instance of time. In Kafka the
previous partition leader is returned when leader id is unknown. In
order to be able to act according to Kafka behavior without loosing the
ability to track the current leader state added storing the previous
leader id into partitions leader table.

Signed-off-by: Michal Maslanka <michal@vectorized.io>
@mmaslankaprv mmaslankaprv changed the title Fixed unavailability during follower isolation Fixed unavailability during follower & leader isolation Nov 8, 2021
Signed-off-by: Michal Maslanka <michal@vectorized.io>
Signed-off-by: Michal Maslanka <michal@vectorized.io>
We use simple heuristic to tolerate isolation of a node hosting both
partition leader and follower.

Kafka clients request metadata refresh in case they receive error that
is
related with stale metadata - f.e. NOT_LEADER. Metadata request can be
processed by any broker and there is no general rule for that which
broker to choose to refresh metadata from. (f.e. Java kafka client uses
the
broker with active least loaded connection.) This may lead to the
situation
in which client will ask for metadata always the same broker. When that
broker is isolated from rest of the cluster it will never update its
metadata
view. This way the client will always receive stale metadata.

This behavior may lead to a live lock in an event of network partition.
If
current partition leader is isolated from the cluster it will keep
answering
with its id in the leader_id field for that partition (according to
policy
where we return a former leader - there is no leader for that broker, it
is a
candidate). Client will retry produce or fetch request and receive
NOT_LEADER
error, this will force client to request metadata update, broker will
respond
with the same metadata and the whole cycle will loop indefinitely.

In order to break the loop and force client to make progress we use
following
heuristics:

1) when current leader is unknown, return former leader (Kafka behavior)
2) when current leader is unknown and previous leader is equal to
   current
   node id select random replica_id as a leader (indicate leader
isolation)

With those heuristics we will always force the client to communicate
with the
nodes that may not be partitioned.

Signed-off-by: Michal Maslanka <michal@vectorized.io>
Signed-off-by: Michal Maslanka <michal@vectorized.io>
Signed-off-by: Michal Maslanka <michal@vectorized.io>
Signed-off-by: Michal Maslanka <michal@vectorized.io>
@emaxerrno
Copy link
Contributor

@mmaslankaprv and @rystsov - great work! let's backport to 21.10.x and 21.11.x

@mmaslankaprv mmaslankaprv merged commit 1c2a530 into redpanda-data:dev Nov 9, 2021
dotnwat added a commit that referenced this pull request Nov 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants