Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

raft: amend error code when leadership transfer can't proceed due to recovery #7297

Merged
merged 1 commit into from
Nov 22, 2022

Conversation

dlex
Copy link
Contributor

@dlex dlex commented Nov 15, 2022

Cover letter

When do_transfer_leadersip(), if a follower is still not caught up after prepare_transfer_leadership() is done, a timeout was returned. However it's not really a timeout, it's a flap (we thought recovery was done but it's not). This commit changes it to exponential_backoff so that admin API would return a 503 (plz retry) for that rather than
a 504 (we couldn't do it in time).

Fixes #6902

This not a fix for the root cause of the issue, only a change of interpretation of the error.

Backport Required

  • not a bug fix
  • issue does not exist in previous branches
  • papercut/not impactful enough to backport
  • v22.3.x
  • v22.2.x
  • v22.1.x

UX changes

  • Change HTTP error code when leadership transfer can't proceed due to recovery from 504 to 503

Release notes

Improvements

  • Change HTTP error code when leadership transfer can't proceed due to recovery from 504 to 503

When do_transfer_leadersip(), if a follower is still not caught up
after prepare_transfer_leadership() is done, a `timeout` was returned.
However it's not really a timeout, it's a flap (we thought recovery
was done but it's not). This commit changes it to `exponential_backoff`
so that admin API would return a 503 (plz retry) for that rather than
a 504 (we couldn't do it in time).
@dlex dlex requested a review from jcsp November 16, 2022 20:23
@jcsp jcsp changed the title It is a backoff if a follower still needs recovery after an attempt of catching up to transfer leadership raft: amend error code when leadership transfer can't proceed due to recovery Nov 16, 2022
@jcsp
Copy link
Contributor

jcsp commented Nov 16, 2022

I amended the title to be a bit more succinct.

Did you mean to mark this as Fixes #6902, or are you still looking at the root cause of how we exited recovery with follower stats not up to date?

@dlex
Copy link
Contributor Author

dlex commented Nov 21, 2022

I was looking into the failure of k8s-operator CI tests (only got limited support from devprod on that), but I was unable to identify the root cause so far. However I strogly think that the failure is unrelated, so retrying...

@dlex
Copy link
Contributor Author

dlex commented Nov 22, 2022

/backport v22.3.x

@dlex
Copy link
Contributor Author

dlex commented Nov 22, 2022

/backport v22.2.x

@andrewhsu
Copy link
Member

andrewhsu commented Nov 23, 2022

fyi i changed the release notes section of the pr body from none since it looks like this PR changes how an api behaves.

@dlex
Copy link
Contributor Author

dlex commented Nov 24, 2022

/backport v22.1.x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Timeout on ARM in RaftAvailabilityTest.test_leader_transfers_recovery.acks=-1
3 participants