Skip to content

Conversation

@bashtanov
Copy link
Contributor

@bashtanov bashtanov commented May 14, 2025

Sometimes explicitly requested partition reassignment will clash with a reassignment triggered by Partition Balancer. Retry in this case.

No backport as behaviour changes (more specific error code).

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v25.1.x
  • v24.3.x
  • v24.2.x

Release Notes

Improvements

  • In AlterPartitionReassignmentsResponse per-partition response REASSIGNMENT_IN_PROGRESS error code is used if a reassignment is requested while Partition Balancer is moving partition replicas.

bashtanov added 4 commits May 14, 2025 16:21
It's in the header, so it was probably intended to be implemented.

With it implemented we can print any command key which is helpful for
adding a debug print of all commands going through controller with their
types and keys. We're probably miles away from being able to print all
values.
We already classify cluster::errc::no_update_in_progress as
error_code::no_reassignment_in_progress. This is to do similar for
cluster::errc::update_in_progress. This error code is used in node
joining, but it is handled there specifically.
We already have 3 similar handlers and looking to add one more. Separate
repeated code.
partition balancer may kick in at any time, so this error should be
handled gracefully
@vbotbuildovich
Copy link
Collaborator

CI test results

test results on build#66014
test_class test_method test_arguments test_kind job_url test_status passed reason
ArchivalTest test_single_partition_leadership_transfer {"cloud_storage_type": 2} ducktape https://buildkite.com/redpanda/redpanda/builds/66014#0196d0e0-a87f-46cb-8204-66d84d11d95e FLAKY 20/21 upstream reliability is '100.0'. current run reliability is '95.23809523809523'. drift is 4.7619 and the allowed drift is set to 50. The test should PASS
CloudStorageChunkReadTest test_read_chunks ducktape https://buildkite.com/redpanda/redpanda/builds/66014#0196d0e2-b57d-4da4-add1-9d38e66ef0fc FLAKY 19/21 upstream reliability is '98.31932773109243'. current run reliability is '90.47619047619048'. drift is 7.84314 and the allowed drift is set to 50. The test should PASS
DataTransformsTest test_compression {"compression_type": "gzip"} ducktape https://buildkite.com/redpanda/redpanda/builds/66014#0196d0e2-b57d-4da4-add1-9d38e66ef0fc FLAKY 20/21 upstream reliability is '99.14529914529915'. current run reliability is '95.23809523809523'. drift is 3.9072 and the allowed drift is set to 50. The test should PASS
DataTransformsTest test_compression {"compression_type": "snappy"} ducktape https://buildkite.com/redpanda/redpanda/builds/66014#0196d0e2-b57d-4da4-add1-9d38e66ef0fc FLAKY 20/21 upstream reliability is '100.0'. current run reliability is '95.23809523809523'. drift is 4.7619 and the allowed drift is set to 50. The test should PASS
DataTransformsTest test_identity {"transactional": false, "wait_running": false} ducktape https://buildkite.com/redpanda/redpanda/builds/66014#0196d0e2-b57d-4da4-add1-9d38e66ef0fc FLAKY 20/21 upstream reliability is '99.14529914529915'. current run reliability is '95.23809523809523'. drift is 3.9072 and the allowed drift is set to 50. The test should PASS
RandomNodeOperationsTest test_node_operations {"cloud_storage_type": 2, "enable_failures": false, "mixed_versions": false, "with_chunked_compaction": true, "with_iceberg": false, "with_tiered_storage": false} ducktape https://buildkite.com/redpanda/redpanda/builds/66014#0196d0e0-a87f-46cb-8204-66d84d11d95e FLAKY 20/21 upstream reliability is '99.20634920634922'. current run reliability is '95.23809523809523'. drift is 3.96825 and the allowed drift is set to 50. The test should PASS
RandomNodeOperationsTest test_node_operations {"cloud_storage_type": 2, "enable_failures": false, "mixed_versions": false, "with_chunked_compaction": true, "with_iceberg": false, "with_tiered_storage": true} ducktape https://buildkite.com/redpanda/redpanda/builds/66014#0196d0e0-a87f-46cb-8204-66d84d11d95e FLAKY 20/21 upstream reliability is '98.7878787878788'. current run reliability is '95.23809523809523'. drift is 3.54978 and the allowed drift is set to 50. The test should PASS
RandomNodeOperationsTest test_node_operations {"cloud_storage_type": 1, "enable_failures": true, "mixed_versions": true, "with_chunked_compaction": true, "with_iceberg": false, "with_tiered_storage": false} ducktape https://buildkite.com/redpanda/redpanda/builds/66014#0196d0e0-a880-4436-9a96-35d024ba78be FLAKY 20/21 upstream reliability is '99.01960784313727'. current run reliability is '95.23809523809523'. drift is 3.78151 and the allowed drift is set to 50. The test should PASS

@bashtanov
Copy link
Contributor Author

bashtanov commented May 15, 2025

Copy link
Contributor

@bharathv bharathv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on "improvements" in release notes of this PR, I thought that is only reserved for user facing improvements? This is a test only change though.

return False

# A concurrent reassignment may be triggered by partition_balancer
if has_partition_err(l, "REASSIGNMENT_IN_PROGRESS"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if the caller should handle the retry in this case. Reassignments are typically submitted with the assumption that the original assignment is still valid. However, the presence of a REASSIGNMENT_IN_PROGRESS status suggests that the original assignment may no longer apply, so it might be more appropriate for the caller to decide how to proceed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought reassignment specifies desired state, and if a concurrent reassignment has changed something the desired state still applies. WDYT?

@bashtanov
Copy link
Contributor Author

This is a test only change though

No, we actually change the behaviour of Redpanda by returning a different error code. I think users should be prepared to it.

@bharathv
Copy link
Contributor

This is a test only change though

No, we actually change the behaviour of Redpanda by returning a different error code. I think users should be prepared to it.

duh! I totally missed that commit. 🙈

@bashtanov bashtanov requested a review from bharathv June 2, 2025 08:22
@bashtanov bashtanov merged commit fbd6eee into redpanda-data:dev Jun 3, 2025
18 of 21 checks passed
@bashtanov
Copy link
Contributor Author

/backport v25.1.x v24.3.x v24.2.x

@bashtanov
Copy link
Contributor Author

/backport v25.1.x

@bashtanov
Copy link
Contributor Author

/backport v24.3.x

@bashtanov
Copy link
Contributor Author

/backport v24.2.x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants