Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CORE-88 rptest: handle errors and retry them in list offsets request #17494

Merged
merged 2 commits into from
Apr 6, 2024

Conversation

nvartolomei
Copy link
Contributor

@nvartolomei nvartolomei commented Mar 29, 2024

After broker restart it might take a while before the group offsets are available during which redpanda will reply with
error_code::not_coordinator1. The test didn't handle the top level error, now it does. We also retry it.

Introduced in #17260
Fixes #17466

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v23.3.x
  • v23.2.x

Release Notes

  • none

Footnotes

  1. https://github.com/redpanda-data/redpanda/blob/501b9d35882a303def6061bdc522f67f0502ac1c/src/v/kafka/server/group_manager.cc#L1653

Comment on lines 417 to 425
def list_offsets():
try:
test_admin = KafkaTestAdminClient(self.redpanda)
return test_admin.list_offsets(
group_id, [TopicPartition(self.topic_spec.name, 0)])
except Exception as e:
self.logger.debug(f"Failed to list offsets: {e}")
return None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

redpanda will reply with error_code::not_coordinator

I'm not sure I understand this, because this error is a normal error which a Kafka client should retry transparently. The original CI failure had this assert len(offsets) == 1 failing, suggesting that list_offsets wasn't failing with an exception. I think I'm missing something here about why this fixes the issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggesting that list_offsets wasn't failing with an exception

It wasn't failing because the top level error wasn't handled.

The commit message captures that.

The test didn't handle the top level error, now it does. We also retry it.

In particular,

90b20eb#diff-a46aa013457bb4458e7cc5ccb16f3c593127c354be444c0c4cf03c0f36b1449aR604-R606

    def _list_offsets_send_process_response(self, response):
+        error_type = kerr.for_code(response.error_code)
+        if error_type is not kerr.NoError:
+            raise error_type("Error in list_offsets response")

The "client part" here is implemented by me on top of the python kafka client which doesn't have good infrastructure for retrying requests (none actually).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I still don't understand. Are there actually multiple logical changes in the same commit?

The test didn't handle the top level error, now it does.

What error is this? Is it different than not_coordinator?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it was 2 changes. Did split them now.
The error is error_code::not_coordinator.

After broker restart it might take a while before the group offsets are
available during which redpanda will reply with
`error_code::not_coordinator`[^0] using the message top-level error_code
field.

[^0]: https://github.com/redpanda-data/redpanda/blob/501b9d35882a303def6061bdc522f67f0502ac1c/src/v/kafka/server/group_manager.cc#L1653
After broker restart it might take a while before the group offsets
are available during which redpanda will reply with
`error_code::not_coordinator`[^0]. We need to keep retrying after they
become available.

Fixes redpanda-data#17466

[^0]: https://github.com/redpanda-data/redpanda/blob/501b9d35882a303def6061bdc522f67f0502ac1c/src/v/kafka/server/group_manager.cc#L1653
@vbotbuildovich
Copy link
Collaborator

@nvartolomei nvartolomei requested a review from dotnwat April 4, 2024 20:43
Copy link
Member

@dotnwat dotnwat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, lgtm.

It would have been really helpful for my understanding originally to point out that the client in consumer_group_test.py is very low-level and unlike a more capable client not_coordinator isn't handled, which led to

[INFO  - 2024-03-28 01:22:53,309 - consumer_group_test - test_group_recovery - lineno:424]: Got offsets after restart: {}

Instead of an error that the caller could handle. The comparison that failed was then assert len(offsets) == {}.

@dotnwat dotnwat merged commit 2c96d9b into redpanda-data:dev Apr 6, 2024
16 checks passed
@vbotbuildovich
Copy link
Collaborator

/backport v23.3.x

@vbotbuildovich
Copy link
Collaborator

/backport v23.2.x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CI Failure (assert len(offsets) == 1) in ConsumerGroupTest.test_group_recovery
3 participants