`shard_placement_table`: stress test and fixes #17194

ztlpn · 2024-03-19T23:21:42Z

Add stress test testing the logic in shard_placement_table and fix a few bugs that it found.

Backports Required

Release Notes

none

vbotbuildovich · 2024-03-20T01:31:26Z

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/46472#018e5935-7d6d-4145-a44e-ffea99a170fa

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/46622#018e670d-c4e5-45b9-8309-b2a322f471c2

vbotbuildovich · 2024-03-22T18:02:23Z

new failures in https://buildkite.com/redpanda/redpanda/builds/46622#018e670d-c4ea-4d8b-afc4-ee9eee8f1675:

"rptest.tests.consumer_group_recovery_test.ConsumerOffsetsRecoveryTest.test_consumer_offsets_partition_recovery"

ztlpn · 2024-03-22T18:53:13Z

Test failure looks like an instance of #15261

ztlpn · 2024-03-22T18:56:07Z

/ci-repeat 1
skip-units
skip-redpanda-build
dt-repeat=5
tests/rptest/tests/consumer_group_recovery_test.py

dotnwat · 2024-03-23T18:20:23Z

src/v/cluster/shard_placement_table.cc

+              auto dest_it = dest._states.find(ntp);
+              if (
+                dest_it == dest._states.end()
+                || dest_it->second.shard_revision < shard_rev) {


are there any anomalies that could be identified by a dest_it->second.shard_revision > shard_rev relationship? I assumed this would be equality, but I could also see > having the meaning that the destination shard has at least the minimum amount of information to pass this consistency check.

are there any anomalies that could be identified by a dest_it->second.shard_revision > shard_rev relationship?

This is not that big of a deal than the < anomaly. The important thing is to ensure that the current state is never newer than target (or assigned), if this is not true, reconciliation can go wrong. If OTOH target is a bit newer than expected, we'll just make a shortcut in the partition path across shards.

But this is a partial fix anyway (the main one is 092fa5f)

dotnwat · 2024-03-23T18:23:06Z

src/v/cluster/shard_placement_table.cc

when we update the
target before

is target the same as destination shard?

is target the same as destination shard?

Destination or initial. The shard where the partition is supposed to be.

dotnwat · 2024-03-23T18:25:37Z

src/v/cluster/shard_placement_table.h

+        /// If this shard is the initial shard for some incarnation of this
+        /// partition on this node, this field will contain the corresponding
+        /// log revision.
+        std::optional<model::revision_id> _is_initial_for;


Why is tracking the initial shard important? Is it because some residual state sticks around on the initial shard instead? I didn't look at the RFC is this explained there? I'll go take a look at that.

Well, we have to remember that the initial shard can create the partition without waiting for anybody. But we can't do it by immediately setting the current state to "hosted", because it can still be occupied by the previous log revision. So _is_initial_for becomes an intermediate place where we can store this information while the previous log revision is being cleaned up.

dotnwat · 2024-03-23T18:30:18Z

src/v/cluster/shard_table.h

+        // Revision check protects against race conditions between operations
+        // on instances of the same ntp with different log revisions (e.g. after
+        // a topic was deleted and then re-created). These operations can happen
+        // on different shards, therefore erase() corresponding to the old
+        // instance can happen after update() corresponding to the new one. Note
+        // that concurrent updates are not a problem during cross-shard
+        // transfers because even though corresponding erase() and update() will
+        // have the same log_revision, update() will always come after erase().


dotnwat · 2024-03-23T18:37:23Z

src/v/cluster/tests/shard_placement_table_test.cc

+            }
+        }
+
+        co_await spt.local().set_target(


i can't quite tell, but it looks like concurrent partition movements are allowed and occur in the stress test. is that true?

In the test a new shard assignment or a new log revision can be introduced concurrently while an old transfer is still ongoing. For several log revisions transfers can be concurrent.

dotnwat · 2024-03-23T18:38:00Z

src/v/cluster/tests/shard_placement_table_test.cc

@@ -37,15 +37,33 @@ struct ntp_table {
    model::revision_id revision;
 };

+using partition_key = std::pair<model::ntp, model::revision_id>;


and that there is max 1 transfer in
progress.

Oh i just saw this in the next commit message!

It will be used in cases when shard_placement_table is in the middle of an update and controller_backend has to wait for it to finish before reconciling partitions.

Introduce additional check when preparing xshard-transfer that shard-local states at source and destination are consistent (i.e. that the destination knows that it is the destination)

Use log revision instead of shard revision for tracking initial shard for a partition. This is important in the scenario when we update the target before even creating the initial instance. E.g.: 1. set initial target to s1 (log revision lr, shard revision sr1) 2. quickly update target to s2 (same log revision lr, shard revision sr2) 3. shard s1 must still be confident that it can create the partition, as there is no previous instance with the same log revision.

Add a function that calculates expected log revision of a partition on a particular node based on the corresponding partition_replicas_view

Introduce helper that calculates required reconciliation action for an NTP on this shard, and use it in controller_backend

Concurrent updates are a problem only for partitions with different log revisions, as updates corresponding to x-shard movements (when log revision stays the same but the shard revision changes) are sequential.

Previously we stored the target shard for the ntp on all shards with state related to that ntp. This way during a cross-shard transfer the source shard knows immediately where to transfer state to. The drawback is that it is hard to synchronize state updates coming from different controller_backend reconciliation fibers and from updating targets. Introduce a simple assignment marker instead that shows if the partition should be present or not on the current shard (to get the destination shard we have to query the node-wide map on shard 0)

Even if we haven't yet started the partition instance anywhere (and therefore state.current is nullopt on all shards) we must still allow transfers to happen. In this case the role of state.current is played by state._is_initial_for.

Check that in the process of reconciliation we don't launch several partition instances concurrently; and that there is max 1 transfer in progress.

ztlpn · 2024-03-25T13:12:42Z

force-push to resolve conflicts

ztlpn requested review from bharathv and mmaslankaprv March 19, 2024 23:21

github-actions bot added the area/redpanda label Mar 19, 2024

ztlpn changed the title ~~Flexible partition core assignment: stress test and fixes~~ shard_placement_table: stress test and fixes Mar 19, 2024

ztlpn mentioned this pull request Mar 21, 2024

CI Failure (TimeoutError waiting on previous or new assignments) in PartitionMoveInterruption.test_cancelling_partition_move_x_core #17203

Closed

dotnwat reviewed Mar 23, 2024

View reviewed changes

ztlpn added 12 commits March 25, 2024 14:09

cluster: add waiting_for_shard_placement_update error code

8b17451

It will be used in cases when shard_placement_table is in the middle of an update and controller_backend has to wait for it to finish before reconciling partitions.

c/shard_placement_table: wait for target on destination

69fc36e

Introduce additional check when preparing xshard-transfer that shard-local states at source and destination are consistent (i.e. that the destination knows that it is the destination)

c/cluster_utils: add log_revision_on_node

51cc623

Add a function that calculates expected log revision of a partition on a particular node based on the corresponding partition_replicas_view

c/shard_placement_table: introduce get_reconciliation_action

9efc94a

Introduce helper that calculates required reconciliation action for an NTP on this shard, and use it in controller_backend

c/shard_table: use log revision for dealing with concurrent updates

a17ab11

Concurrent updates are a problem only for partitions with different log revisions, as updates corresponding to x-shard movements (when log revision stays the same but the shard revision changes) are sequential.

c/shard_placement_table: add stress utest

3a64553

c/shard_placement_table/utest: add reconciliation invariants checks

4d4903d

Check that in the process of reconciliation we don't launch several partition instances concurrently; and that there is max 1 transfer in progress.

c/shard_placement_table/utest: add quiescent state invariants check

ff19628

c/shard_placement_table/ut: stricter checks during reconciliation

fe2f0c8

ztlpn force-pushed the flex-assignment-stress-test branch from e05e6c3 to fe2f0c8 Compare March 25, 2024 13:12

mmaslankaprv approved these changes Mar 26, 2024

View reviewed changes

ztlpn merged commit a284b0f into redpanda-data:dev Mar 26, 2024
17 checks passed

ztlpn deleted the flex-assignment-stress-test branch March 26, 2024 12:11

graphcareful mentioned this pull request Jun 3, 2024

[v23.2.x] CI Failure (TimeoutError waiting on previous or new assignments) in PartitionMoveInterruption.test_cancelling_partition_move_x_core #18761

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`shard_placement_table`: stress test and fixes #17194

`shard_placement_table`: stress test and fixes #17194

ztlpn commented Mar 19, 2024

vbotbuildovich commented Mar 20, 2024 •

edited

vbotbuildovich commented Mar 22, 2024

ztlpn commented Mar 22, 2024

ztlpn commented Mar 22, 2024

dotnwat Mar 23, 2024

ztlpn Mar 26, 2024 •

edited

dotnwat Mar 23, 2024

ztlpn Mar 26, 2024

dotnwat Mar 23, 2024

ztlpn Mar 26, 2024

dotnwat Mar 23, 2024

dotnwat Mar 23, 2024

ztlpn Mar 26, 2024

dotnwat Mar 23, 2024

ztlpn commented Mar 25, 2024

shard_placement_table: stress test and fixes #17194

shard_placement_table: stress test and fixes #17194

Conversation

ztlpn commented Mar 19, 2024

Backports Required

Release Notes

vbotbuildovich commented Mar 20, 2024 • edited

vbotbuildovich commented Mar 22, 2024

ztlpn commented Mar 22, 2024

ztlpn commented Mar 22, 2024

Choose a reason for hiding this comment

ztlpn Mar 26, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ztlpn commented Mar 25, 2024

`shard_placement_table`: stress test and fixes #17194

`shard_placement_table`: stress test and fixes #17194

vbotbuildovich commented Mar 20, 2024 •

edited

ztlpn Mar 26, 2024 •

edited