Tests: KCL to retry reassignment request when clashing with partition balancer #26149

bashtanov · 2025-05-14T21:17:15Z

Sometimes explicitly requested partition reassignment will clash with a reassignment triggered by Partition Balancer. Retry in this case.

No backport as behaviour changes (more specific error code).

Backports Required

Release Notes

Improvements

In AlterPartitionReassignmentsResponse per-partition response REASSIGNMENT_IN_PROGRESS error code is used if a reassignment is requested while Partition Balancer is moving partition replicas.

It's in the header, so it was probably intended to be implemented. With it implemented we can print any command key which is helpful for adding a debug print of all commands going through controller with their types and keys. We're probably miles away from being able to print all values.

We already classify cluster::errc::no_update_in_progress as error_code::no_reassignment_in_progress. This is to do similar for cluster::errc::update_in_progress. This error code is used in node joining, but it is handled there specifically.

We already have 3 similar handlers and looking to add one more. Separate repeated code.

partition balancer may kick in at any time, so this error should be handled gracefully

vbotbuildovich · 2025-05-15T03:03:32Z

CI test results

test results on build#66014

test_class	test_method	test_arguments	test_kind	job_url	test_status	passed	reason
ArchivalTest	test_single_partition_leadership_transfer	{"cloud_storage_type": 2}	ducktape	https://buildkite.com/redpanda/redpanda/builds/66014#0196d0e0-a87f-46cb-8204-66d84d11d95e	FLAKY	20/21	upstream reliability is '100.0'. current run reliability is '95.23809523809523'. drift is 4.7619 and the allowed drift is set to 50. The test should PASS
CloudStorageChunkReadTest	test_read_chunks		ducktape	https://buildkite.com/redpanda/redpanda/builds/66014#0196d0e2-b57d-4da4-add1-9d38e66ef0fc	FLAKY	19/21	upstream reliability is '98.31932773109243'. current run reliability is '90.47619047619048'. drift is 7.84314 and the allowed drift is set to 50. The test should PASS
DataTransformsTest	test_compression	{"compression_type": "gzip"}	ducktape	https://buildkite.com/redpanda/redpanda/builds/66014#0196d0e2-b57d-4da4-add1-9d38e66ef0fc	FLAKY	20/21	upstream reliability is '99.14529914529915'. current run reliability is '95.23809523809523'. drift is 3.9072 and the allowed drift is set to 50. The test should PASS
DataTransformsTest	test_compression	{"compression_type": "snappy"}	ducktape	https://buildkite.com/redpanda/redpanda/builds/66014#0196d0e2-b57d-4da4-add1-9d38e66ef0fc	FLAKY	20/21	upstream reliability is '100.0'. current run reliability is '95.23809523809523'. drift is 4.7619 and the allowed drift is set to 50. The test should PASS
DataTransformsTest	test_identity	{"transactional": false, "wait_running": false}	ducktape	https://buildkite.com/redpanda/redpanda/builds/66014#0196d0e2-b57d-4da4-add1-9d38e66ef0fc	FLAKY	20/21	upstream reliability is '99.14529914529915'. current run reliability is '95.23809523809523'. drift is 3.9072 and the allowed drift is set to 50. The test should PASS
RandomNodeOperationsTest	test_node_operations	{"cloud_storage_type": 2, "enable_failures": false, "mixed_versions": false, "with_chunked_compaction": true, "with_iceberg": false, "with_tiered_storage": false}	ducktape	https://buildkite.com/redpanda/redpanda/builds/66014#0196d0e0-a87f-46cb-8204-66d84d11d95e	FLAKY	20/21	upstream reliability is '99.20634920634922'. current run reliability is '95.23809523809523'. drift is 3.96825 and the allowed drift is set to 50. The test should PASS
RandomNodeOperationsTest	test_node_operations	{"cloud_storage_type": 2, "enable_failures": false, "mixed_versions": false, "with_chunked_compaction": true, "with_iceberg": false, "with_tiered_storage": true}	ducktape	https://buildkite.com/redpanda/redpanda/builds/66014#0196d0e0-a87f-46cb-8204-66d84d11d95e	FLAKY	20/21	upstream reliability is '98.7878787878788'. current run reliability is '95.23809523809523'. drift is 3.54978 and the allowed drift is set to 50. The test should PASS
RandomNodeOperationsTest	test_node_operations	{"cloud_storage_type": 1, "enable_failures": true, "mixed_versions": true, "with_chunked_compaction": true, "with_iceberg": false, "with_tiered_storage": false}	ducktape	https://buildkite.com/redpanda/redpanda/builds/66014#0196d0e0-a880-4436-9a96-35d024ba78be	FLAKY	20/21	upstream reliability is '99.01960784313727'. current run reliability is '95.23809523809523'. drift is 3.78151 and the allowed drift is set to 50. The test should PASS

bashtanov · 2025-05-15T13:07:13Z

this should fix https://redpandadata.atlassian.net/browse/CORE-9958

bharathv

on "improvements" in release notes of this PR, I thought that is only reserved for user facing improvements? This is a test only change though.

bharathv · 2025-05-21T20:23:04Z

tests/rptest/clients/kcl.py

+                    return False
+
+                # A concurrent reassignment may be triggered by partition_balancer
+                if has_partition_err(l, "REASSIGNMENT_IN_PROGRESS"):


I'm wondering if the caller should handle the retry in this case. Reassignments are typically submitted with the assumption that the original assignment is still valid. However, the presence of a REASSIGNMENT_IN_PROGRESS status suggests that the original assignment may no longer apply, so it might be more appropriate for the caller to decide how to proceed.

I thought reassignment specifies desired state, and if a concurrent reassignment has changed something the desired state still applies. WDYT?

bashtanov · 2025-05-21T21:49:34Z

This is a test only change though

No, we actually change the behaviour of Redpanda by returning a different error code. I think users should be prepared to it.

bharathv · 2025-05-21T23:20:29Z

This is a test only change though

No, we actually change the behaviour of Redpanda by returning a different error code. I think users should be prepared to it.

duh! I totally missed that commit. 🙈

bashtanov · 2025-06-04T12:44:52Z

/backport v25.1.x v24.3.x v24.2.x

bashtanov · 2025-06-04T12:46:47Z

/backport v25.1.x

bashtanov · 2025-06-04T12:46:54Z

/backport v24.3.x

bashtanov · 2025-06-04T12:46:59Z

/backport v24.2.x

bashtanov added 4 commits May 14, 2025 16:21

tests/clients/kcl: refactor error handling

a038894

We already have 3 similar handlers and looking to add one more. Separate repeated code.

tests/clients/kcl: retry on REASSIGNMENT_IN_PROGRESS

cf4093c

partition balancer may kick in at any time, so this error should be handled gracefully

github-actions bot added the area/redpanda label May 14, 2025

bashtanov requested review from bharathv, mmaslankaprv and ztlpn May 15, 2025 07:33

bharathv reviewed May 21, 2025

View reviewed changes

bashtanov requested a review from bharathv June 2, 2025 08:22

bharathv approved these changes Jun 2, 2025

View reviewed changes

bashtanov merged commit fbd6eee into redpanda-data:dev Jun 3, 2025
18 of 21 checks passed

vbotbuildovich mentioned this pull request Jun 4, 2025

[v25.1.x] Tests: KCL to retry reassignment request when clashing with partition balancer #26347

Merged

bashtanov mentioned this pull request Jun 13, 2025

Deflake PartitionReassignmentsTest #26452

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tests: KCL to retry reassignment request when clashing with partition balancer #26149

Tests: KCL to retry reassignment request when clashing with partition balancer #26149

Uh oh!

bashtanov commented May 14, 2025 •

edited

Loading

Uh oh!

vbotbuildovich commented May 15, 2025

Uh oh!

bashtanov commented May 15, 2025 •

edited

Loading

Uh oh!

bharathv left a comment

Uh oh!

bharathv May 21, 2025

Uh oh!

bashtanov May 21, 2025

Uh oh!

bashtanov commented May 21, 2025

Uh oh!

bharathv commented May 21, 2025

Uh oh!

Uh oh!

bashtanov commented Jun 4, 2025

Uh oh!

bashtanov commented Jun 4, 2025

Uh oh!

bashtanov commented Jun 4, 2025

Uh oh!

bashtanov commented Jun 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Tests: KCL to retry reassignment request when clashing with partition balancer #26149

Tests: KCL to retry reassignment request when clashing with partition balancer #26149

Uh oh!

Conversation

bashtanov commented May 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Backports Required

Release Notes

Improvements

Uh oh!

vbotbuildovich commented May 15, 2025

CI test results

Uh oh!

bashtanov commented May 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bharathv left a comment

Choose a reason for hiding this comment

Uh oh!

bharathv May 21, 2025

Choose a reason for hiding this comment

Uh oh!

bashtanov May 21, 2025

Choose a reason for hiding this comment

Uh oh!

bashtanov commented May 21, 2025

Uh oh!

bharathv commented May 21, 2025

Uh oh!

Uh oh!

bashtanov commented Jun 4, 2025

Uh oh!

bashtanov commented Jun 4, 2025

Uh oh!

bashtanov commented Jun 4, 2025

Uh oh!

bashtanov commented Jun 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bashtanov commented May 14, 2025 •

edited

Loading

bashtanov commented May 15, 2025 •

edited

Loading