Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[docdb] Do not remote bootstrap newly split tablets to raft peers who have accepted split op in parent tablet #8891

Closed
robertsami opened this issue Jun 14, 2021 · 0 comments
Assignees
Labels
area/docdb YugabyteDB core features kind/bug This issue is a bug

Comments

@robertsami
Copy link
Contributor

robertsami commented Jun 14, 2021

Suppose we are splitting a tablet X into children subtablets Y and Z. We have nodes in our cluster A, B, & C.

Let's say A is the leader of tablet X, and successfully commits the split operation. However, the operation is only applied successfully at A and B. Then, when node A tries to communicate raft consensus messages for tablet Y to node C, node C will respond with TabletServerErrorPB::TABLET_NOT_FOUND, and node A will attempt to trigger a remote bootstrap of tablet Y to node C: https://github.com/yugabyte/yugabyte-db/blob/master/src/yb/consensus/consensus_queue.cc#L1093

In many cases, this execution path can race against applying the split op at node C and cause us to remote bootstrap subtablets Y and Z to node C right before we have a chance to apply the split op locally. Applying the split op locally is a near zero-cost operation relative to remote bootstrapping tablets which requires streaming a bunch of data that node C already has anyway. In these cases, we should avoid remote bootstrapping tablets Y and Z to node C in order to give node C a chance to apply the split op itself locally.

In order to achieve this, we can modify StartRemoteBootstrapRequestPB to include an optional string parent_tablet_id. When present, in ConsensusServiceImpl::UpdateConsensus, we will check for a parent_tablet_id at node C's tablet manager. If the parent X is present, but the child Y is not, instead of proceeding with remote bootstrap, we can respond with a new enum value, TabletServerErrorPB::PARENT_NOT_SPLIT.

In this baseline proposal, node A will continue to send a StartRemoteBootstrapRequest to node C over and over. Eventually, we can also add logic such that when node A receives this error, its able to avoid re-initiating a remote bootstrap of tablet Y to node C, or at least wait before doing so again.

TODO: let's figure out how we want to make this backwards compatible. See if we can use the capabilities framework (https://github.com/yugabyte/yugabyte-db/blob/master/src/yb/util/capabilities.h#L31)

@bmatican bmatican added the area/docdb YugabyteDB core features label Jun 16, 2021
robertsami added a commit that referenced this issue Jul 3, 2021
…eers who have accepted split op in parent tablet

Summary:
When a split operation is raft committed, it will be asynchronously applied at each node. Suppose some nodes A and B apply the split op before another node C. Then for each subtablet, node A and B will form raft groups which include node C. If raft messages are exchanged and the leader of a raft group realized that node C is still not aware of a new subtablet, it will initiate a remote bootstrap, not realizing that node C has not had enough time to try applying the split op. On the other hand, applying the split op locally at node C would be much more efficient, since no data needs to be streamed.

To guard against remote bootstraps of newly split subtablets, we augment the remote bootstrap protocol such that StartRemoteBootstrapRequestPB includes an optional parent_tablet_id which is filled out if the tablet_id being bootstrapped is the result of a tablet split. In this case, we will reject the request at the tablet_service level if the local catalog manager is aware of the parent tablet id, since this means there's a chance for it to locally apply the raft-committed split op rather than remote bootstrapping to create the new subtablets.

Test Plan: ybd --cxx-test integration-tests_tablet-split-itest --gtest_filter TabletSplitExternalMiniClusterITest.FaultedSplitNodeRejectsRemoteBootstrap

Reviewers: bogdan, timur

Reviewed By: timur

Subscribers: ybase, bogdan

Differential Revision: https://phabricator.dev.yugabyte.com/D12006
@ttyusupov ttyusupov added the kind/bug This issue is a bug label Jul 29, 2021
ttyusupov added a commit that referenced this issue Apr 11, 2022
Summary:
The following scenario was failing:
1. We have 3 nodes RF=3
2. TS-1 goes offline
3. Parent tablet receives SPLIT_OP
4. Parent tablet applies SPLIT_OP on TS-2 and TS-3 and flushed_op_id is updated to split_op_id, because SPLIT_OP is applied.
5. TS-4 is added
6. Parent tablet got RBSed to TS-4
7. Parent tablet on TS-4 doesn’t apply SPLIT_OP, because local bootstrap is only replaying ops starting with `flushed_op_id + 1`. Child tablets are not RBSed to TS-4 because we reject RBS of child tablets if the parent tablet is already present (#8891) and rely on the parent tablet to split locally, but it can’t because of `flushed_op_id`.
8. Parent tablet is not deleted, because not all replicas are split.

Implemented the following fix for this issue:
- Updated `ConsensusServiceImpl::StartRemoteBootstrap` to allow RBS of child tablets if parent tablet replica has been already split or remote bootstrapped from a replica that has been already split.
- Switched test flag `load_balancer_skip_inactive_tablets` to false by default, because it was added as a temporary workaround to reduce the risk of the issue being fixed.

Additional changes:
- Fixed tablet metadata split_op_id initialization for child tablets (should be empty) and added sanity checks.
- Fixed local tablet bootstrap to properly set op_id for SPLIT_OP being replayed.
- Improved some log messages.

Also updated `TabletSplitExternalMiniClusterITest.ReplaceNodeForParentTablet` to verify the scenario above. Added `TabletSplitITest.ParentRemoteBootstrapAfterWritesToChildren` that verifies that child tablets are not affected by split parent tablet bootstrap and replay of `SPLIT_OP`.

Test Plan: TabletSplitExternalMiniClusterITest.ReplaceNodeForParentTablet, TabletSplitITest.ParentRemoteBootstrapAfterWritesToChildren

Reviewers: bogdan, rsami

Reviewed By: rsami

Subscribers: jenkins-bot, ybase, bogdan

Differential Revision: https://phabricator.dev.yugabyte.com/D13617
ttyusupov added a commit that referenced this issue Jul 4, 2022
Summary:
The following scenario was failing:
1. We have 3 nodes RF=3
2. TS-1 goes offline
3. Parent tablet receives SPLIT_OP
4. Parent tablet applies SPLIT_OP on TS-2 and TS-3 and flushed_op_id is updated to split_op_id, because SPLIT_OP is applied.
5. TS-4 is added
6. Parent tablet got RBSed to TS-4
7. Parent tablet on TS-4 doesn’t apply SPLIT_OP, because local bootstrap is only replaying ops starting with `flushed_op_id + 1`. Child tablets are not RBSed to TS-4 because we reject RBS of child tablets if the parent tablet is already present (#8891) and rely on the parent tablet to split locally, but it can’t because of `flushed_op_id`.
8. Parent tablet is not deleted, because not all replicas are split.

Implemented the following fix for this issue:
- Updated `ConsensusServiceImpl::StartRemoteBootstrap` to allow RBS of child tablets if parent tablet replica has been already split or remote bootstrapped from a replica that has been already split.
- Switched test flag `load_balancer_skip_inactive_tablets` to false by default, because it was added as a temporary workaround to reduce the risk of the issue being fixed.

Additional changes:
- Fixed tablet metadata split_op_id initialization for child tablets (should be empty) and added sanity checks.
- Fixed local tablet bootstrap to properly set op_id for SPLIT_OP being replayed.
- Improved some log messages.

Also updated `TabletSplitExternalMiniClusterITest.ReplaceNodeForParentTablet` to verify the scenario above. Added `TabletSplitITest.ParentRemoteBootstrapAfterWritesToChildren` that verifies that child tablets are not affected by split parent tablet bootstrap and replay of `SPLIT_OP`.

Original commit: 9e056a1 / D13617

Test Plan: TabletSplitExternalMiniClusterITest.ReplaceNodeForParentTablet, TabletSplitITest.ParentRemoteBootstrapAfterWritesToChildren

Reviewers: rsami

Reviewed By: rsami

Subscribers: jenkins-bot, bogdan, yugaware

Differential Revision: https://phabricator.dev.yugabyte.com/D18009
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/docdb YugabyteDB core features kind/bug This issue is a bug
Projects
None yet
Development

No branches or pull requests

3 participants