-
Notifications
You must be signed in to change notification settings - Fork 552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flex partition assignment stage 1 #16617
Flex partition assignment stage 1 #16617
Conversation
/ci-repeat |
new failures in https://buildkite.com/redpanda/redpanda/builds/45073#018db180-a3f6-4091-9223-728f4285b24a:
|
ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/45073#018db180-a3f3-40bf-96df-53328736e0fe ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/45073#018db191-d49c-42c3-a804-12ea3c3fd9db ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/45611#018e0995-8456-4ec0-9188-dd044f79dc2a ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/45651#018e0b94-481d-4d5d-9b0c-33a98b3d4ed5 ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/45697#018e0f4e-f744-4cd9-b77f-f3cd1bdd3a01 ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/45879#018e1ee8-ff89-4c93-b9e5-a535df6d47af |
// per-shard state | ||
// | ||
// node_hash_map for pointer stability | ||
absl::node_hash_map<model::ntp, placement_state> _states; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider using a hierarchical data structure to save memory
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am actually on a fence about this as it would complicate the code considerably (we are constantly adding and removing stuff to/from this map). And we need it to be per-ntp (not per-group for example), as we have to be able to find previous incarnation of the partition with the same ntp to delete it before creating the new instance.
18cc588
to
a56c40a
Compare
And use it as a substitution for has_local_replicas
a56c40a
to
0343185
Compare
/// this field will contain the corresponding shard revision. | ||
model::shard_revision_id _is_initial_at_revision; | ||
/// If x-shard transfer is in progress, will hold the destination. | ||
std::optional<ss::shard_id> _next; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't that equal to target.shard
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its purpose is to hold the destination if the target changes (so that we can finish started transfer). I'll update the comment.
This is a helper struct that has various replica-related metadata all in one place, so that the API user doesn't have to query several maps manually.
0343185
to
0d081f3
Compare
src/v/cluster/types.h
Outdated
/// partition. | ||
struct shard_placement_target { | ||
model::revision_id log_revision; | ||
ss::shard_id shard = -1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shard_id is unsigned, think assigning -1 results in an underflow, perhaps a good usecase for std::optional
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's intentional, I wanted to have a definite but invalid default value (I really dislike fields that can be initialized with garbage :)). std::optional is not really fitting here (a possibility for absence of the target shard should be expressed by std::optional<shard_placement_target>
). Added a constructor instead to ensure that this field is always initialized.
This will make it easier to construct mock in_progress_update instances in unit tests.
This is an independent reconciliation dimension in addition to regular topic_table revision, i.e. shard_revision for a partition can change with revision remaining the same and controller_backend will have to reconcile it.
shard_placement_table is a node-local data structure tracking ntp -> shard mapping for each partition hosted on this node.
shard_balancer runs on shard 0 of each node and manages assignments of partitions hosted on this node to shards.
Wire up controller_backend with shard_placement_table and shard_balancer. Controller backend is notified by shard_balancer and then checks shard placement table for reconciliation actions it needs to perform.
0d081f3
to
8e8fd4a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔥
Introduce
shard_balancer
andshard_placement_table
and use them to drive reconciliation of partition placement across shards. Cross-shard movements are executed with the new “push” protocol. The assignments themselves still come from controller metadata.Backports Required
Release Notes