Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

shard_placement_table: stress test and fixes #17194

Merged
merged 12 commits into from
Mar 26, 2024

Conversation

ztlpn
Copy link
Contributor

@ztlpn ztlpn commented Mar 19, 2024

Add stress test testing the logic in shard_placement_table and fix a few bugs that it found.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v23.3.x
  • v23.2.x

Release Notes

  • none

@ztlpn ztlpn changed the title Flexible partition core assignment: stress test and fixes shard_placement_table: stress test and fixes Mar 19, 2024
@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Mar 20, 2024

@vbotbuildovich
Copy link
Collaborator

new failures in https://buildkite.com/redpanda/redpanda/builds/46622#018e670d-c4ea-4d8b-afc4-ee9eee8f1675:

"rptest.tests.consumer_group_recovery_test.ConsumerOffsetsRecoveryTest.test_consumer_offsets_partition_recovery"

@ztlpn
Copy link
Contributor Author

ztlpn commented Mar 22, 2024

Test failure looks like an instance of #15261

@ztlpn
Copy link
Contributor Author

ztlpn commented Mar 22, 2024

/ci-repeat 1
skip-units
skip-redpanda-build
dt-repeat=5
tests/rptest/tests/consumer_group_recovery_test.py

auto dest_it = dest._states.find(ntp);
if (
dest_it == dest._states.end()
|| dest_it->second.shard_revision < shard_rev) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are there any anomalies that could be identified by a dest_it->second.shard_revision > shard_rev relationship? I assumed this would be equality, but I could also see > having the meaning that the destination shard has at least the minimum amount of information to pass this consistency check.

Copy link
Contributor Author

@ztlpn ztlpn Mar 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are there any anomalies that could be identified by a dest_it->second.shard_revision > shard_rev relationship?

This is not that big of a deal than the < anomaly. The important thing is to ensure that the current state is never newer than target (or assigned), if this is not true, reconciliation can go wrong. If OTOH target is a bit newer than expected, we'll just make a shortcut in the partition path across shards.

But this is a partial fix anyway (the main one is 092fa5f)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when we update the
target before

is target the same as destination shard?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is target the same as destination shard?

Destination or initial. The shard where the partition is supposed to be.

Comment on lines 107 to 132
/// If this shard is the initial shard for some incarnation of this
/// partition on this node, this field will contain the corresponding
/// log revision.
std::optional<model::revision_id> _is_initial_for;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is tracking the initial shard important? Is it because some residual state sticks around on the initial shard instead? I didn't look at the RFC is this explained there? I'll go take a look at that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, we have to remember that the initial shard can create the partition without waiting for anybody. But we can't do it by immediately setting the current state to "hosted", because it can still be occupied by the previous log revision. So _is_initial_for becomes an intermediate place where we can store this information while the previous log revision is being cleaned up.

Comment on lines +80 to +87
// Revision check protects against race conditions between operations
// on instances of the same ntp with different log revisions (e.g. after
// a topic was deleted and then re-created). These operations can happen
// on different shards, therefore erase() corresponding to the old
// instance can happen after update() corresponding to the new one. Note
// that concurrent updates are not a problem during cross-shard
// transfers because even though corresponding erase() and update() will
// have the same log_revision, update() will always come after erase().
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙏

}
}

co_await spt.local().set_target(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i can't quite tell, but it looks like concurrent partition movements are allowed and occur in the stress test. is that true?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the test a new shard assignment or a new log revision can be introduced concurrently while an old transfer is still ongoing. For several log revisions transfers can be concurrent.

@@ -37,15 +37,33 @@ struct ntp_table {
model::revision_id revision;
};

using partition_key = std::pair<model::ntp, model::revision_id>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and that there is max 1 transfer in
progress.

Oh i just saw this in the next commit message!

ztlpn added 12 commits March 25, 2024 14:09
It will be used in cases when shard_placement_table is in the middle of
an update and controller_backend has to wait for it to finish before
reconciling partitions.
Introduce additional check when preparing xshard-transfer that
shard-local states at source and destination are consistent (i.e. that
the destination knows that it is the destination)
Use log revision instead of shard revision for tracking initial shard
for a partition. This is important in the scenario when we update the
target before even creating the initial instance. E.g.:
1. set initial target to s1 (log revision lr, shard revision sr1)
2. quickly update target to s2 (same log revision lr, shard revision sr2)
3. shard s1 must still be confident that it can create the partition,
   as there is no previous instance with the same log revision.
Add a function that calculates expected log revision of a partition on a
particular node based on the corresponding partition_replicas_view
Introduce helper that calculates required reconciliation action for an
NTP on this shard, and use it in controller_backend
Concurrent updates are a problem only for partitions with different log
revisions, as updates corresponding to x-shard movements (when log
revision stays the same but the shard revision changes) are sequential.
Previously we stored the target shard for the ntp on all shards with
state related to that ntp. This way during a cross-shard transfer the
source shard knows immediately where to transfer state to. The drawback
is that it is hard to synchronize state updates coming from
different controller_backend reconciliation fibers and from updating
targets. Introduce a simple assignment marker instead that shows if the
partition should be present or not on the current shard (to get the
destination shard we have to query the node-wide map on shard 0)
Even if we haven't yet started the partition instance anywhere (and
therefore state.current is nullopt on all shards) we must still allow
transfers to happen. In this case the role of state.current is played by
state._is_initial_for.
Check that in the process of reconciliation we don't launch several
partition instances concurrently; and that there is max 1 transfer in
progress.
@ztlpn ztlpn force-pushed the flex-assignment-stress-test branch from e05e6c3 to fe2f0c8 Compare March 25, 2024 13:12
@ztlpn
Copy link
Contributor Author

ztlpn commented Mar 25, 2024

force-push to resolve conflicts

@ztlpn ztlpn merged commit a284b0f into redpanda-data:dev Mar 26, 2024
17 checks passed
@ztlpn ztlpn deleted the flex-assignment-stress-test branch March 26, 2024 12:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants