-
Notifications
You must be signed in to change notification settings - Fork 552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
k/group: recover leader epoch on leader change #17260
k/group: recover leader epoch on leader change #17260
Conversation
new failures in https://buildkite.com/redpanda/redpanda/builds/46616#018e668c-9526-49f5-97e6-de141da9ca44:
|
ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/46616#018e669e-a59b-4cd6-9f6d-56626702fc2e ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/46640#018e681b-7814-4b5c-a686-1dcbb45692dc ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/46640#018e681f-84c3-4a5d-a9b7-ed837d00e5be ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/46718#018e7701-2d62-4e94-8fe3-c807d2de1e0e |
/dt |
failing test with verifiable consumer; without write caching... https://ci-artifacts.dev.vectorized.cloud/redpanda/46616/018e668c-9526-49f5-97e6-de141da9ca44/vbuild/ducktape/results/final/report.html
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice. When you request reviewers, I think we'll want to get a few groups bharath/michal and probably someone from enterprise team.
/ci-repeat 10 |
VerifiableConsumer is KIP-320 compliant now. At least to the degree the java kafka client is. This will facilitates testing of the redpanda implementation of KIP-320 in group recovery (redpanda-data#17260) and data loss handling when write caching is enabled.
8e82687
to
0ed5c6b
Compare
VerifiableConsumer is KIP-320 compliant now. At least to the degree the java kafka client is. This will facilitates testing of the redpanda implementation of KIP-320 in group recovery (redpanda-data#17260) and data loss handling when write caching is enabled.
This tests consumer group commits. The test also shows a bug in which leader_offset is reset after cluster restart. This is fixed in a subsequent commit.
This was discovered while testing write caching feature. After leadership change or node restart we would reply with default field value `-2147483648` which breaks the KIP-320 logic. `check_leader_epoch` in redpanda treats negative epoch values as "not set" and, I believe, franz-go behaves the same. As result, KIP-320 fencing is not being applied and the client ends up with `OFFSET_OUT_OF_RANGE` error.
0ed5c6b
to
501b9d3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM nice !
/backport v23.3.x |
/backport v23.2.x |
VerifiableConsumer is KIP-320 compliant now. At least to the degree the java kafka client is. This will facilitates testing of the redpanda implementation of KIP-320 in group recovery (redpanda-data#17260) and data loss handling when write caching is enabled. (cherry picked from commit 9316309)
VerifiableConsumer is KIP-320 compliant now. At least to the degree the java kafka client is. This will facilitates testing of the redpanda implementation of KIP-320 in group recovery (redpanda-data#17260) and data loss handling when write caching is enabled. (cherry picked from commit 9316309)
From my understanding, this can cause a problem only when write caching is enabled.
It could also apply to ACKS=1 in an edge case but haven't thought it through. We make data available only after majority replicated so it’s very unlikely for a truncation to happen after that and trigger KIP-320
This was discovered while testing write caching feature. After leadership change or node restart we would reply with default field value
-2147483648
which breaks the KIP-320 logic.check_leader_epoch
in redpanda treats negative epoch values as "not set" and, I believe, franz-go behaves the same.As result, KIP-320 fencing is not being applied and the client ends up with
OFFSET_OUT_OF_RANGE
error.Backports Required
Release Notes