Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cluster: fix leader balancer using low level leadership transfer interface #8941

Merged
merged 7 commits into from
Feb 17, 2023

Conversation

jcsp
Copy link
Contributor

@jcsp jcsp commented Feb 16, 2023

The behavior in #8560 relied on leadership transfers flowing through cluster::partition, but the leader balancer was using an RPC that calls directly into consensus::transfer_leadership. The cluster::partition path was only being taken on maintenance mode leadership drains and explicit admin API leader transfers.

This would also have meant that the leader balancer was skipping the graceful transfer of transaction state.

In this PR:

  • Add debug logging to make it more obvious what is happening if the archiver part of leadership transfer doesn't appear to be working properly.
  • Create a new RPC in controller_service to do leadership transfer via cluster::partition
  • Use this new RPC from leader balancer, with a fallback to the old RPC if the new one is not available. This will make upgrades work without requiring a feature flag, which is important to enable backporting this.
  • Tweak the logic in ntp_archiver to continue with uploading a manifest after uploading segments, even if in the process of a leadership transfer.

Fixes: #8745

Backports Required

  • none - too onerous to backport
  • none - not a bug fix
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v22.3.x
  • v22.2.x
  • v22.1.x

UX Changes

None

Release Notes

Bug fixes

  • A bug is fixed where background leadership balancing could unexpectedly interrupt transactional workloads.

@jcsp
Copy link
Contributor Author

jcsp commented Feb 16, 2023

/ci-repeat 5 debug skip-unit

@jcsp jcsp force-pushed the debug-archival-leadership-transfer branch from 875220d to 5491c26 Compare February 17, 2023 10:48
This makes it easier to see what happened in a test log if we
unexpectedly fail to gracefully quiesce the archiver during
leadership transfer.
This will call cluster::partition::transfer_leadership,
unlike the existing raft API that calls directly into
consensus::transfer_leadership.

This RPC will be safe for transactions and tiered storage,
which rely on extra preparation in cluster::partition
before doing the raft transfer.
This gives it flexibility to use raft client or
cluster client depending on which leadership transfer
API it will use (new or legacy).
...with a fallback to the legacy RPC if the new RPC is
not found.

This is done with a try+fallback approach rather than a feature
flag, so that the change is backportable without having to
bump the cluster version on a stable branch.
Previously we would drop out between writing segments
and writing manifest if we were transferring leadership.

What we want during a transfer is for the node to finish its
segment uploads, update the archival_metadata_stm so that
the new leader does not forget about these uploaded segments,
and upload the manifest to S3 so that the written segment
becomes visible promptly without having to wait for the
new leader to do its next segment upload to write the manifest.
@jcsp jcsp force-pushed the debug-archival-leadership-transfer branch from 5491c26 to 4b0785d Compare February 17, 2023 10:57
@jcsp
Copy link
Contributor Author

jcsp commented Feb 17, 2023

/ci-repeat 5 debug skip-unit

@jcsp jcsp changed the title cluster: debug logging around archival leadership transfer cluster: fix leader balancer using low level leadership transfer interface Feb 17, 2023
@jcsp jcsp marked this pull request as ready for review February 17, 2023 11:01
@jcsp jcsp added area/controller area/cloud-storage Shadow indexing subsystem and removed area/redpanda labels Feb 17, 2023
@jcsp jcsp merged commit bafa0e5 into redpanda-data:dev Feb 17, 2023
@jcsp jcsp deleted the debug-archival-leadership-transfer branch February 17, 2023 14:43
@vshtokman
Copy link
Contributor

/backport v22.3.x

@vbotbuildovich
Copy link
Collaborator

Failed to run cherry-pick command. I executed the below command:

git cherry-pick -x d7a66b8d74485ead58355f05442fadae7552614f fd8cae2ba84295e2b94235102f6abe22817071cb 42a224fdc1f11dad0e5b15cba50e57d9feb314e0 440a1976a324685473e07564de92d435f5eb08b1 c3b4218c87661ccdf9e96aa91d5535621579fbd4 6b4c45ae6fbd1b96d0838c30173c697f719503c4 4b0785deabb21da2f37d0ad5eb8291a0bca5ec62

Workflow run logs.

@vshtokman
Copy link
Contributor

@jcsp , could you look into backporting this when you have a chance?

@jcsp
Copy link
Contributor Author

jcsp commented Mar 8, 2023

When I marked this for backport, I must have forgotten that in v22.3.x we have a different code structure that makes this fix impractical (archivers do not belong to cluster::partition objects, so can't be cleanly handled during a leadership transfer). I think we're going to have to leave v22.3.x with the bad old behavior that leaks more orphan objects than we would like.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cloud-storage Shadow indexing subsystem area/controller
Projects
None yet
4 participants