CORE-88 rptest: handle errors and retry them in list offsets request #17494

nvartolomei · 2024-03-29T08:10:50Z

After broker restart it might take a while before the group offsets are available during which redpanda will reply with
error_code::not_coordinator¹. The test didn't handle the top level error, now it does. We also retry it.

Introduced in #17260
Fixes #17466

Backports Required

Release Notes

none

https://github.com/redpanda-data/redpanda/blob/501b9d35882a303def6061bdc522f67f0502ac1c/src/v/kafka/server/group_manager.cc#L1653 ↩

dotnwat · 2024-04-02T19:37:49Z

tests/rptest/tests/consumer_group_test.py

+        def list_offsets():
+            try:
+                test_admin = KafkaTestAdminClient(self.redpanda)
+                return test_admin.list_offsets(
+                    group_id, [TopicPartition(self.topic_spec.name, 0)])
+            except Exception as e:
+                self.logger.debug(f"Failed to list offsets: {e}")
+                return None


redpanda will reply with error_code::not_coordinator

I'm not sure I understand this, because this error is a normal error which a Kafka client should retry transparently. The original CI failure had this assert len(offsets) == 1 failing, suggesting that list_offsets wasn't failing with an exception. I think I'm missing something here about why this fixes the issue.

suggesting that list_offsets wasn't failing with an exception

It wasn't failing because the top level error wasn't handled.

The commit message captures that.

The test didn't handle the top level error, now it does. We also retry it.

In particular,

90b20eb#diff-a46aa013457bb4458e7cc5ccb16f3c593127c354be444c0c4cf03c0f36b1449aR604-R606

def _list_offsets_send_process_response(self, response): + error_type = kerr.for_code(response.error_code) + if error_type is not kerr.NoError: + raise error_type("Error in list_offsets response")

The "client part" here is implemented by me on top of the python kafka client which doesn't have good infrastructure for retrying requests (none actually).

Sorry I still don't understand. Are there actually multiple logical changes in the same commit?

The test didn't handle the top level error, now it does.

What error is this? Is it different than not_coordinator?

Yes, it was 2 changes. Did split them now.
The error is error_code::not_coordinator.

After broker restart it might take a while before the group offsets are available during which redpanda will reply with `error_code::not_coordinator`[^0] using the message top-level error_code field. [^0]: https://github.com/redpanda-data/redpanda/blob/501b9d35882a303def6061bdc522f67f0502ac1c/src/v/kafka/server/group_manager.cc#L1653

After broker restart it might take a while before the group offsets are available during which redpanda will reply with `error_code::not_coordinator`[^0]. We need to keep retrying after they become available. Fixes redpanda-data#17466 [^0]: https://github.com/redpanda-data/redpanda/blob/501b9d35882a303def6061bdc522f67f0502ac1c/src/v/kafka/server/group_manager.cc#L1653

vbotbuildovich · 2024-04-04T13:27:14Z

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/47380#018ea8ff-0efb-4956-85e2-225c781aebb9

dotnwat

Thanks, lgtm.

It would have been really helpful for my understanding originally to point out that the client in consumer_group_test.py is very low-level and unlike a more capable client not_coordinator isn't handled, which led to

[INFO  - 2024-03-28 01:22:53,309 - consumer_group_test - test_group_recovery - lineno:424]: Got offsets after restart: {}

Instead of an error that the caller could handle. The comparison that failed was then assert len(offsets) == {}.

vbotbuildovich · 2024-04-06T00:37:46Z

/backport v23.3.x

vbotbuildovich · 2024-04-06T00:37:47Z

/backport v23.2.x

nvartolomei requested review from mmaslankaprv and dotnwat March 29, 2024 08:28

dotnwat reviewed Apr 2, 2024

View reviewed changes

nvartolomei added 2 commits April 4, 2024 11:36

nvartolomei force-pushed the nv/group-recovery-retry branch from 90b20eb to 97b4c33 Compare April 4, 2024 10:50

nvartolomei requested a review from dotnwat April 4, 2024 20:43

dotnwat approved these changes Apr 6, 2024

View reviewed changes

dotnwat merged commit 2c96d9b into redpanda-data:dev Apr 6, 2024
16 checks passed

dotnwat changed the title ~~rptest: handle errors and retry them in list offsets request~~ CORE-88 rptest: handle errors and retry them in list offsets request Apr 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CORE-88 rptest: handle errors and retry them in list offsets request #17494

CORE-88 rptest: handle errors and retry them in list offsets request #17494

nvartolomei commented Mar 29, 2024 •

edited

dotnwat Apr 2, 2024

nvartolomei Apr 3, 2024

dotnwat Apr 3, 2024

nvartolomei Apr 4, 2024 •

edited

vbotbuildovich commented Apr 4, 2024

dotnwat left a comment

vbotbuildovich commented Apr 6, 2024

vbotbuildovich commented Apr 6, 2024

CORE-88 rptest: handle errors and retry them in list offsets request #17494

CORE-88 rptest: handle errors and retry them in list offsets request #17494

Conversation

nvartolomei commented Mar 29, 2024 • edited

Backports Required

Release Notes

Footnotes

dotnwat Apr 2, 2024

Choose a reason for hiding this comment

nvartolomei Apr 3, 2024

Choose a reason for hiding this comment

dotnwat Apr 3, 2024

Choose a reason for hiding this comment

nvartolomei Apr 4, 2024 • edited

Choose a reason for hiding this comment

vbotbuildovich commented Apr 4, 2024

dotnwat left a comment

Choose a reason for hiding this comment

vbotbuildovich commented Apr 6, 2024

vbotbuildovich commented Apr 6, 2024

nvartolomei commented Mar 29, 2024 •

edited

nvartolomei Apr 4, 2024 •

edited