Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flex partition assignment stage 1 #16617

Merged
merged 15 commits into from
Mar 11, 2024

Conversation

ztlpn
Copy link
Contributor

@ztlpn ztlpn commented Feb 16, 2024

Introduce shard_balancer and shard_placement_table and use them to drive reconciliation of partition placement across shards. Cross-shard movements are executed with the new “push” protocol. The assignments themselves still come from controller metadata.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v23.3.x
  • v23.2.x
  • v23.1.x

Release Notes

  • none

@ztlpn
Copy link
Contributor Author

ztlpn commented Feb 16, 2024

/ci-repeat

@ztlpn ztlpn marked this pull request as ready for review February 16, 2024 09:24
@vbotbuildovich
Copy link
Collaborator

new failures in https://buildkite.com/redpanda/redpanda/builds/45073#018db180-a3f6-4091-9223-728f4285b24a:

"rptest.tests.retention_policy_test.ShadowIndexingCloudRetentionTest.test_cloud_time_based_retention.cloud_storage_type=CloudStorageType.ABS"

// per-shard state
//
// node_hash_map for pointer stability
absl::node_hash_map<model::ntp, placement_state> _states;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider using a hierarchical data structure to save memory

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am actually on a fence about this as it would complicate the code considerably (we are constantly adding and removing stuff to/from this map). And we need it to be per-ntp (not per-group for example), as we have to be able to find previous incarnation of the partition with the same ntp to delete it before creating the new instance.

@ztlpn ztlpn force-pushed the flex-assignment-shard-balancer branch 2 times, most recently from 18cc588 to a56c40a Compare March 4, 2024 20:52
@ztlpn ztlpn requested a review from mmaslankaprv March 5, 2024 08:42
@ztlpn ztlpn force-pushed the flex-assignment-shard-balancer branch from a56c40a to 0343185 Compare March 5, 2024 14:11
/// this field will contain the corresponding shard revision.
model::shard_revision_id _is_initial_at_revision;
/// If x-shard transfer is in progress, will hold the destination.
std::optional<ss::shard_id> _next;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't that equal to target.shard ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its purpose is to hold the destination if the target changes (so that we can finish started transfer). I'll update the comment.

This is a helper struct that has various replica-related metadata
all in one place, so that the API user doesn't have to query several
maps manually.
@ztlpn ztlpn force-pushed the flex-assignment-shard-balancer branch from 0343185 to 0d081f3 Compare March 6, 2024 22:46
@ztlpn ztlpn requested a review from mmaslankaprv March 6, 2024 22:49
/// partition.
struct shard_placement_target {
model::revision_id log_revision;
ss::shard_id shard = -1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shard_id is unsigned, think assigning -1 results in an underflow, perhaps a good usecase for std::optional

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's intentional, I wanted to have a definite but invalid default value (I really dislike fields that can be initialized with garbage :)). std::optional is not really fitting here (a possibility for absence of the target shard should be expressed by std::optional<shard_placement_target>). Added a constructor instead to ensure that this field is always initialized.

This will make it easier to construct mock in_progress_update instances
in unit tests.
This is an independent reconciliation dimension in addition to regular
topic_table revision, i.e. shard_revision for a partition can change
with revision remaining the same and controller_backend will have to
reconcile it.
shard_placement_table is a node-local data structure tracking
ntp -> shard mapping for each partition hosted on this node.
shard_balancer runs on shard 0 of each node and manages assignments of
partitions hosted on this node to shards.
Wire up controller_backend with shard_placement_table and
shard_balancer. Controller backend is notified by shard_balancer and
then checks shard placement table for reconciliation actions it needs to
perform.
@ztlpn ztlpn force-pushed the flex-assignment-shard-balancer branch from 0d081f3 to 8e8fd4a Compare March 8, 2024 15:21
@ztlpn ztlpn requested a review from bharathv March 8, 2024 15:23
@ztlpn ztlpn merged commit 67348a1 into redpanda-data:dev Mar 11, 2024
16 checks passed
Copy link
Member

@dotnwat dotnwat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants