-
Notifications
You must be signed in to change notification settings - Fork 705
rptest: partially deflake test_stress_consumer_group_commits
#26185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rptest: partially deflake test_stress_consumer_group_commits
#26185
Conversation
e961f3a to
12c29b8
Compare
CI test resultstest results on build#66142
test results on build#66150
test results on build#66155
|
|
/ci-repeat 1 |
12c29b8 to
d775923
Compare
|
/ci-repeat 1 |
Retry command for Build#66154please wait until all jobs are finished before running the slash command |
This test can race with adjacent segment compaction while collecting a
consumer offsets summary, with a backtrace like
```
File "/root/tests/rptest/tests/consumer_group_test.py", line 1690, in test_stress_consumer_group_commits
summary = olv.consumer_offsets_summary(n)
File "/root/tests/rptest/clients/offline_log_viewer.py", line 59, in consumer_offsets_summary
return self._json_cmd(node, "--type consumer_offsets_summary")
File "/root/tests/rptest/clients/offline_log_viewer.py", line 34, in _json_cmd
json_out = node.account.ssh_output(cmd, combine_stderr=False)
File "/opt/.ducktape-venv/lib/python3.10/site-packages/ducktape/cluster/remoteaccount.py", line 41, in wrapper
return method(self, *args, **kwargs)
File "/opt/.ducktape-venv/lib/python3.10/site-packages/ducktape/cluster/remoteaccount.py", line 421, in ssh_output
raise RemoteCommandError(self, cmd, exit_status, stderr.read())
FileNotFoundError: [Errno 2] No such file or directory: \'/var/lib/redpanda/data/kafka/__consumer_offsets/0_22/2239422-4-v1.log\'\n'
```
Use a `wait_until()` wrapper instead to retry this case.
This test could generate enough output to fill up the agent nodes in CI:
```
[WARNING - 2025-05-18 15:24:36,604 - test - copy_service_logs - lineno:162]: Error copying log backtraces from
{'path': '/var/lib/redpanda/redpanda_backtrace.log', 'collect_default': True} to /build/tests/results/2025-05-18--002/ConsumerGroupOffsetResetTest/test_stress_consumer_group_commits/15/RedpandaService-0-140178931023632/docker-rp-18.service <RedpandaService-0-140178931023632: num_nodes: 3, nodes: ['docker-rp-16', 'docker-rp-17', 'docker-rp-18']>: [Errno 28] No space left on device
```
Use a custom `LoggingConfig` to reduce the amount of output generated.
63ad603 to
a28cdec
Compare
| lambda: get_summary(n), | ||
| timeout_sec=60, | ||
| backoff_sec=1, | ||
| retry_on_exc=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This way we catch the asserts above. Are they unlikely to be transient?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yikes good point. I forget that asserts in Python just throw. i'll have a follow-up fix to this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though I doubt these errors are transient, we should fail the test on the first sighting of a failed assert.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test can race with adjacent segment compaction while collecting a node's consumer offsets summary, with a backtrace like
Use a
wait_until()wrapper instead to retry this exceptional case.The second commit also adds a
LoggingConfigargument to pass to theRedpandaServicebase class for this test, since it generates large enough logs to fill up the CI agent node's disks.Backports Required
Release Notes