Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cluster: enable controller replay if last_applied is ahead of log #5703

Merged
merged 4 commits into from
Nov 28, 2022

Conversation

jcsp
Copy link
Contributor

@jcsp jcsp commented Jul 28, 2022

Cover letter

This situation happens if someone deletes controller log
segments while leaving the kvstore in place.

Previously, the kvstore last_applied would cause the
node to hang waiting for the controller log to replay
to that offset. This is not a redpanda bug per-se, as it only happens
when the underlying system violates invariants about storage,
but it is a case where we can be more helpful.

Now, we log an error about the apparent inconsistency,
and proceed.

In general we do not want to ignore data inconsistency, but
this is a special case: deleting the controller log is something
a user might legitimately do in order to work around another
issue + force redpanda to rebuild the local copy of the
controller log.

Fixes #4950

UX changes

None

Release notes

Improvements

  • Controller log replay is more resilient to unexpected removal of log on disk.

@jcsp jcsp added kind/enhance New feature or request area/controller labels Jul 28, 2022
@jcsp jcsp force-pushed the controller-replay-after-dataloss branch 2 times, most recently from 30887d4 to 2089526 Compare August 2, 2022 13:08
@jcsp jcsp marked this pull request as ready for review August 2, 2022 13:09
@jcsp jcsp force-pushed the controller-replay-after-dataloss branch from 2089526 to d1e3ff2 Compare November 24, 2022 20:04
@jcsp jcsp requested review from mmaslankaprv and removed request for dotnwat, ztlpn, NyaliaLui and VadimPlh November 24, 2022 20:55
Noticed this while writing a test with an off by one on
a node id.  Internally the error handling is safe, but
in the API we're returning a 500 instead of a 400.
This situation happens if someone deletes controller log
segments while leaving the kvstore in place.

Previously, the kvstore last_applied would cause the
node to hang waiting for the controller log to replay
to that offset.

Now, we log an error about the apparent inconsistency,
and proceed.

In general we do not want to ignore data inconsistency, but
this is a special case: deleting the controller log is something
a user might legitimately do in order to work around another
issue + force redpanda to rebuild the local copy of the
controller log.
For tests that want to know "did node X log message Y" rather
than just "was message Y logged anywhere"
This test validates that it is possible to reset the
controller log on a single node by removing it, a procedure
occasionally used in the field in the event of a cluster
in a split brain situation resulting from interference with
a node's storage.
@jcsp jcsp force-pushed the controller-replay-after-dataloss branch from d1e3ff2 to 0eda42c Compare November 25, 2022 15:29
@mmaslankaprv mmaslankaprv self-requested a review November 28, 2022 10:33
@jcsp jcsp merged commit 79a8e3e into redpanda-data:dev Nov 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Simulating disk corruption - broker stuck on startup
2 participants