-
Notifications
You must be signed in to change notification settings - Fork 552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CORE-88 rptest: handle errors and retry them in list offsets request #17494
CORE-88 rptest: handle errors and retry them in list offsets request #17494
Conversation
def list_offsets(): | ||
try: | ||
test_admin = KafkaTestAdminClient(self.redpanda) | ||
return test_admin.list_offsets( | ||
group_id, [TopicPartition(self.topic_spec.name, 0)]) | ||
except Exception as e: | ||
self.logger.debug(f"Failed to list offsets: {e}") | ||
return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
redpanda will reply with error_code::not_coordinator
I'm not sure I understand this, because this error is a normal error which a Kafka client should retry transparently. The original CI failure had this assert len(offsets) == 1
failing, suggesting that list_offsets
wasn't failing with an exception. I think I'm missing something here about why this fixes the issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggesting that list_offsets wasn't failing with an exception
It wasn't failing because the top level error wasn't handled.
The commit message captures that.
The test didn't handle the top level error, now it does. We also retry it.
In particular,
90b20eb#diff-a46aa013457bb4458e7cc5ccb16f3c593127c354be444c0c4cf03c0f36b1449aR604-R606
def _list_offsets_send_process_response(self, response):
+ error_type = kerr.for_code(response.error_code)
+ if error_type is not kerr.NoError:
+ raise error_type("Error in list_offsets response")
The "client part" here is implemented by me on top of the python kafka client which doesn't have good infrastructure for retrying requests (none actually).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I still don't understand. Are there actually multiple logical changes in the same commit?
The test didn't handle the top level error, now it does.
What error is this? Is it different than not_coordinator?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it was 2 changes. Did split them now.
The error is error_code::not_coordinator
.
After broker restart it might take a while before the group offsets are available during which redpanda will reply with `error_code::not_coordinator`[^0] using the message top-level error_code field. [^0]: https://github.com/redpanda-data/redpanda/blob/501b9d35882a303def6061bdc522f67f0502ac1c/src/v/kafka/server/group_manager.cc#L1653
After broker restart it might take a while before the group offsets are available during which redpanda will reply with `error_code::not_coordinator`[^0]. We need to keep retrying after they become available. Fixes redpanda-data#17466 [^0]: https://github.com/redpanda-data/redpanda/blob/501b9d35882a303def6061bdc522f67f0502ac1c/src/v/kafka/server/group_manager.cc#L1653
90b20eb
to
97b4c33
Compare
ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/47380#018ea8ff-0efb-4956-85e2-225c781aebb9 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, lgtm.
It would have been really helpful for my understanding originally to point out that the client in consumer_group_test.py is very low-level and unlike a more capable client not_coordinator isn't handled, which led to
[INFO - 2024-03-28 01:22:53,309 - consumer_group_test - test_group_recovery - lineno:424]: Got offsets after restart: {}
Instead of an error that the caller could handle. The comparison that failed was then assert len(offsets) == {}
.
/backport v23.3.x |
/backport v23.2.x |
After broker restart it might take a while before the group offsets are available during which redpanda will reply with
error_code::not_coordinator
1. The test didn't handle the top level error, now it does. We also retry it.Introduced in #17260
Fixes #17466
Backports Required
Release Notes
Footnotes
https://github.com/redpanda-data/redpanda/blob/501b9d35882a303def6061bdc522f67f0502ac1c/src/v/kafka/server/group_manager.cc#L1653 ↩