Skip to content

Conversation

@WillemKauf
Copy link
Contributor

@WillemKauf WillemKauf commented May 17, 2025

This test can race with adjacent segment compaction while collecting a node's consumer offsets summary, with a backtrace like

File "/root/tests/rptest/tests/consumer_group_test.py", line 1690, in test_stress_consumer_group_commits
    summary = olv.consumer_offsets_summary(n)
  File "/root/tests/rptest/clients/offline_log_viewer.py", line 59, in consumer_offsets_summary
    return self._json_cmd(node, "--type consumer_offsets_summary")
  File "/root/tests/rptest/clients/offline_log_viewer.py", line 34, in _json_cmd
    json_out = node.account.ssh_output(cmd, combine_stderr=False)
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/ducktape/cluster/remoteaccount.py", line 41, in wrapper
    return method(self, *args, **kwargs)
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/ducktape/cluster/remoteaccount.py", line 421, in ssh_output
    raise RemoteCommandError(self, cmd, exit_status, stderr.read())
  FileNotFoundError: [Errno 2] No such file or directory: \'/var/lib/redpanda/data/kafka/__consumer_offsets/0_22/2239422-4-v1.log\'\n'

Use a wait_until() wrapper instead to retry this exceptional case.

The second commit also adds a LoggingConfig argument to pass to the RedpandaService base class for this test, since it generates large enough logs to fill up the CI agent node's disks.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v25.1.x
  • v24.3.x
  • v24.2.x

Release Notes

  • none

@WillemKauf WillemKauf force-pushed the stress_group_consumer_offsets_fix branch from e961f3a to 12c29b8 Compare May 17, 2025 22:32
@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented May 18, 2025

CI test results

test results on build#66142
test_class test_method test_arguments test_kind job_url test_status passed reason
CloudStorageScrubberTest test_scrubber {"cloud_storage_type": 2} ducktape https://buildkite.com/redpanda/redpanda/builds/66142#0196e094-f563-45e9-94b2-d751b93ce66e FLAKY 20/21 upstream reliability is '100.0'. current run reliability is '95.23809523809523'. drift is 4.7619 and the allowed drift is set to 50. The test should PASS
RandomNodeOperationsTest test_node_operations {"cloud_storage_type": 2, "enable_failures": false, "mixed_versions": false, "with_chunked_compaction": false, "with_iceberg": false, "with_tiered_storage": true} ducktape https://buildkite.com/redpanda/redpanda/builds/66142#0196e094-f561-49b6-bc90-444930c27f5c FLAKY 20/21 upstream reliability is '97.09302325581395'. current run reliability is '95.23809523809523'. drift is 1.85493 and the allowed drift is set to 50. The test should PASS
test results on build#66150
test_class test_method test_arguments test_kind job_url test_status passed reason
PartitionReassignmentsTest test_add_partitions_with_inprogress_reassignments ducktape https://buildkite.com/redpanda/redpanda/builds/66150#0196e35b-9b85-4941-82f4-01faeed2a2e3 FLAKY 17/21 upstream reliability is '90.14084507042254'. current run reliability is '80.95238095238095'. drift is 9.18846 and the allowed drift is set to 50. The test should PASS
RandomNodeOperationsTest test_node_operations {"cloud_storage_type": 2, "enable_failures": false, "mixed_versions": false, "with_chunked_compaction": false, "with_iceberg": false, "with_tiered_storage": true} ducktape https://buildkite.com/redpanda/redpanda/builds/66150#0196e35b-9b84-49bd-9115-42aaf93a9ca5 FLAKY 20/21 upstream reliability is '97.09302325581395'. current run reliability is '95.23809523809523'. drift is 1.85493 and the allowed drift is set to 50. The test should PASS
test results on build#66155
test_class test_method test_arguments test_kind job_url test_status passed reason
DataMigrationsApiTest test_migrated_topic_data_integrity {"params": {"cancellation": null, "use_alias": false}, "transfer_leadership": true} ducktape https://buildkite.com/redpanda/redpanda/builds/66155#0196e592-0006-411a-b367-26d86e31cc0d FLAKY 20/21 upstream reliability is '98.19494584837545'. current run reliability is '95.23809523809523'. drift is 2.95685 and the allowed drift is set to 50. The test should PASS
RestCatalogConnectionTest test_redpanda_connection_to_rest_catalog {"cloud_storage_type": 1} ducktape https://buildkite.com/redpanda/redpanda/builds/66155#0196e592-0006-411a-b367-26d86e31cc0d FLAKY 20/21 upstream reliability is '100.0'. current run reliability is '95.23809523809523'. drift is 4.7619 and the allowed drift is set to 50. The test should PASS
DeleteRecordsKafkaTest test_delete_records_non_empty_topic {"truncate_point": "start_offset"} ducktape https://buildkite.com/redpanda/redpanda/builds/66155#0196e592-0006-411a-b367-26d86e31cc0d FLAKY 20/21 upstream reliability is '98.72881355932203'. current run reliability is '95.23809523809523'. drift is 3.49072 and the allowed drift is set to 50. The test should PASS
RandomNodeOperationsTest test_node_operations {"cloud_storage_type": 2, "enable_failures": false, "mixed_versions": false, "with_chunked_compaction": false, "with_iceberg": false, "with_tiered_storage": false} ducktape https://buildkite.com/redpanda/redpanda/builds/66155#0196e533-cf3c-4e90-9a5d-21904220f8e2 FLAKY 20/21 upstream reliability is '97.6878612716763'. current run reliability is '95.23809523809523'. drift is 2.44977 and the allowed drift is set to 50. The test should PASS
DisablingPartitionsTest test_disable ducktape https://buildkite.com/redpanda/redpanda/builds/66155#0196e592-0005-4aad-8909-a12e3a6b40f4 FLAKY 14/21 upstream reliability is '85.84070796460178'. current run reliability is '66.66666666666666'. drift is 19.17404 and the allowed drift is set to 50. The test should PASS

@WillemKauf
Copy link
Contributor Author

/ci-repeat 1
skip-redpanda-build
skip-units
tests/rptest/tests/consumer_group_test.py::ConsumerGroupOffsetResetTest.test_stress_consumer_group_commits

@WillemKauf WillemKauf force-pushed the stress_group_consumer_offsets_fix branch from 12c29b8 to d775923 Compare May 18, 2025 11:27
@WillemKauf
Copy link
Contributor Author

/ci-repeat 1
skip-redpanda-build
skip-units
tests/rptest/tests/consumer_group_test.py::ConsumerGroupOffsetResetTest.test_stress_consumer_group_commits

@vbotbuildovich
Copy link
Collaborator

Retry command for Build#66154

please wait until all jobs are finished before running the slash command

/ci-repeat 1
tests/rptest/tests/consumer_group_test.py::ConsumerGroupOffsetResetTest.test_stress_consumer_group_commits

This test can race with adjacent segment compaction while collecting a
consumer offsets summary, with a backtrace like

```
File "/root/tests/rptest/tests/consumer_group_test.py", line 1690, in test_stress_consumer_group_commits
    summary = olv.consumer_offsets_summary(n)
  File "/root/tests/rptest/clients/offline_log_viewer.py", line 59, in consumer_offsets_summary
    return self._json_cmd(node, "--type consumer_offsets_summary")
  File "/root/tests/rptest/clients/offline_log_viewer.py", line 34, in _json_cmd
    json_out = node.account.ssh_output(cmd, combine_stderr=False)
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/ducktape/cluster/remoteaccount.py", line 41, in wrapper
    return method(self, *args, **kwargs)
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/ducktape/cluster/remoteaccount.py", line 421, in ssh_output
    raise RemoteCommandError(self, cmd, exit_status, stderr.read())
  FileNotFoundError: [Errno 2] No such file or directory: \'/var/lib/redpanda/data/kafka/__consumer_offsets/0_22/2239422-4-v1.log\'\n'
```

Use a `wait_until()` wrapper instead to retry this case.
This test could generate enough output to fill up the agent nodes in CI:

```
[WARNING - 2025-05-18 15:24:36,604 - test - copy_service_logs - lineno:162]: Error copying log backtraces from
{'path': '/var/lib/redpanda/redpanda_backtrace.log', 'collect_default': True} to /build/tests/results/2025-05-18--002/ConsumerGroupOffsetResetTest/test_stress_consumer_group_commits/15/RedpandaService-0-140178931023632/docker-rp-18.service <RedpandaService-0-140178931023632: num_nodes: 3, nodes: ['docker-rp-16', 'docker-rp-17', 'docker-rp-18']>: [Errno 28] No space left on device
```

Use a custom `LoggingConfig` to reduce the amount of output generated.
@WillemKauf WillemKauf force-pushed the stress_group_consumer_offsets_fix branch from 63ad603 to a28cdec Compare May 18, 2025 20:29
@WillemKauf WillemKauf requested a review from mmaslankaprv May 19, 2025 03:10
@WillemKauf WillemKauf merged commit 88a043a into redpanda-data:dev May 20, 2025
16 checks passed
lambda: get_summary(n),
timeout_sec=60,
backoff_sec=1,
retry_on_exc=True,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This way we catch the asserts above. Are they unlikely to be transient?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yikes good point. I forget that asserts in Python just throw. i'll have a follow-up fix to this.

Copy link
Contributor Author

@WillemKauf WillemKauf May 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though I doubt these errors are transient, we should fail the test on the first sighting of a failed assert.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants