New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DocDB] Delete Snapshot keeps on retrying indefinitely after master failover #16367
Comments
…master failover Summary: When a delete_snapshot command is issued and (1) one of the tablets involved in the delete snapshot operation is deleted (For example: due to tablet splitting) and (2) the master leader fails over (thus, the new master will not have any state for this deleted parent tablet), The master snapshot coordinator is unable to find a tserver replica for this tablet and thus keeps retrying to run the DELETE_ON_TABLET Async task. This should not be confounded with [[ #16631 | GH-16631 ]] where the master was having stale tablet info and thus was able to send rpc to a Tserver. In this bug however, the master is unable to send RPCs and keeps retrying to run the async task. The diff fixes the issue by adding `HandleReplicaLookupFailure ` which optionally return a `MonitoredTaskState ` to transition to in case of failing to lookup the replica for this task. By default, this function returns `nullopt`. In this bug, for an AsyncTabletSnapshotOp: if the returned error status is "NOT_FOUND" and the `TabletSnapshotOpRequestPB` is `DELETE_ON_TABLET`, then consider the task as completed and hence the master snapshot coordinator stops trying to rerun `DELETE_ON_TABLET` RPCs for this already deleted tablet. Jira Issue(s): DB-5784 Test Plan: ybd --cxx_test yb-backup-cross-feature-test --gtest-filter YBBackupTest.DeleteSnapshotAfterTabletSplittingAndMasterFailover Reviewers: skedia, zdrudi Reviewed By: zdrudi Subscribers: bogdan, slingam, ybase Differential Revision: https://phabricator.dev.yugabyte.com/D24645
Any fix for this? |
The issue is fixed with this commit. |
@yamen-haddad thanks for update. thanks again |
The fix is currently on master. |
@yamen-haddad thanks. please comment here when backporting is done |
@yamen-haddad docker image yugabytedb/yugabyte:2.14.10.0-b23 is generated two days ago, does it contain above fix? |
@datavisorzhizhu The fix is waiting for review so it did not make it yet to 2.14. |
…finitely after master failover Summary: Original commit: 2ad4b72 / D24645 When a delete_snapshot command is issued and (1) one of the tablets involved in the delete snapshot operation is deleted (For example: due to tablet splitting) and (2) the master leader fails over (thus, the new master will not have any state for this deleted parent tablet), The master snapshot coordinator is unable to find a tserver replica for this tablet and thus keeps retrying to run the DELETE_ON_TABLET Async task. This should not be confounded with [[ #16631 | GH-16631 ]] where the master was having stale tablet info and thus was able to send rpc to a Tserver. In this bug however, the master is unable to send RPCs and keeps retrying to run the async task. The diff fixes the issue by adding `HandleReplicaLookupFailure ` which optionally return a `MonitoredTaskState ` to transition to in case of failing to lookup the replica for this task. By default, this function returns `nullopt`. In this bug, for an AsyncTabletSnapshotOp: if the returned error status is "NOT_FOUND" and the `TabletSnapshotOpRequestPB` is `DELETE_ON_TABLET`, then consider the task as completed and hence the master snapshot coordinator stops trying to rerun `DELETE_ON_TABLET` RPCs for this already deleted tablet. Jira Issue(s): DB-5784 Test Plan: ybd --cxx_test yb-backup-cross-feature-test --gtest-filter YBBackupTest.DeleteSnapshotAfterTabletSplittingAndMasterFailover Reviewers: skedia, zdrudi Reviewed By: zdrudi Subscribers: bogdan, slingam, ybase Differential Revision: https://phorge.dev.yugabyte.com/D25914
…finitely after master failover Summary: Original commit: 2ad4b72 / D24645 When a delete_snapshot command is issued and (1) one of the tablets involved in the delete snapshot operation is deleted (For example: due to tablet splitting) and (2) the master leader fails over (thus, the new master will not have any state for this deleted parent tablet), The master snapshot coordinator is unable to find a tserver replica for this tablet and thus keeps retrying to run the DELETE_ON_TABLET Async task. This should not be confounded with [[ #16631 | GH-16631 ]] where the master was having stale tablet info and thus was able to send rpc to a Tserver. In this bug however, the master is unable to send RPCs and keeps retrying to run the async task. The diff fixes the issue by adding `HandleReplicaLookupFailure ` which optionally return a `MonitoredTaskState ` to transition to in case of failing to lookup the replica for this task. By default, this function returns `nullopt`. In this bug, for an AsyncTabletSnapshotOp: if the returned error status is "NOT_FOUND" and the `TabletSnapshotOpRequestPB` is `DELETE_ON_TABLET`, then consider the task as completed and hence the master snapshot coordinator stops trying to rerun `DELETE_ON_TABLET` RPCs for this already deleted tablet. Jira Issue(s): DB-5784 Test Plan: ybd --cxx_test yb-backup-cross-feature-test --gtest-filter YBBackupTest.DeleteSnapshotAfterTabletSplittingAndMasterFailover Reviewers: skedia, zdrudi Reviewed By: zdrudi Subscribers: ybase, slingam, bogdan Differential Revision: https://phorge.dev.yugabyte.com/D25912
@datavisorzhizhu The fix is now backported and will be in [2.18.1.0], [2.16.6.0], [2.19.0.0]. A backport to 2.14 is till in review. |
@yamen-haddad [2.16.6.0] is also acceptable to me. thanks |
…finitely after master failover Summary: Original commit: 2ad4b72 / D24645 When a delete_snapshot command is issued and (1) one of the tablets involved in the delete snapshot operation is deleted (For example: due to tablet splitting) and (2) the master leader fails over (thus, the new master will not have any state for this deleted parent tablet), The master snapshot coordinator is unable to find a tserver replica for this tablet and thus keeps retrying to run the DELETE_ON_TABLET Async task. This should not be confounded with [[ #16631 | GH-16631 ]] where the master was having stale tablet info and thus was able to send rpc to a Tserver. In this bug however, the master is unable to send RPCs and keeps retrying to run the async task. The diff fixes the issue by adding `HandleReplicaLookupFailure ` which optionally return a `MonitoredTaskState ` to transition to in case of failing to lookup the replica for this task. By default, this function returns `nullopt`. In this bug, for an AsyncTabletSnapshotOp: if the returned error status is "NOT_FOUND" and the `TabletSnapshotOpRequestPB` is `DELETE_ON_TABLET`, then consider the task as completed and hence the master snapshot coordinator stops trying to rerun `DELETE_ON_TABLET` RPCs for this already deleted tablet. Jira Issue(s): DB-5784 Test Plan: Tested manually as many test utilities were not backported to 2.14. To run the test run the following: ./bin/yb-ctl destroy ./bin/yb-ctl create --rf=3 ./bin/ysqlsh --create a table with 3 pre-split tablets. CREATE TABLE t (k INT, v INT) SPLIT INTO 3 TABLETS; INSERT INTO t (k,v) SELECT i,i FROM generate_series(1,100) AS i; \q --create a database snapshot. ./build/latest/bin/yb-admin --master_addresses 127.0.0.1:7100,127.0.0.2:7100,127.0.0.3:7100 create_database_snapshot ysql.yugabyte --split one of the tablets to make 4 tablets and wait for the parent tablet to be deleted ./build/latest/bin/yb-admin --master_addresses 127.0.0.1:7100,127.0.0.2:7100,127.0.0.3:7100 list_tablets ysql.yugabyte t 0 ./build/latest/bin/yb-admin --master_addresses 127.0.0.1:7100,127.0.0.2:7100,127.0.0.3:7100 split_tablet ./build/latest/bin/yb-admin --master_addresses 127.0.0.1:7100,127.0.0.2:7100,127.0.0.3:7100 list_snapshots SHOW_DELETED --perform a master leader failover - the new master will not have any state for this deleted parent tablet ./bin/yb-ctl status kill -9 <master leader pid> --delete the snapshot ./build/latest/bin/yb-admin --master_addresses 127.0.0.1:7100,127.0.0.2:7100,127.0.0.3:7100 delete_snapshot <snapshot id> Before the fix: The master keeps retrying to run the DELETE_ON_TABLET for the already deleted tablet and the state of the snapshot is always DELETING. after the fix: The master does not keep retrying to send the delete RPC and the snapshot is marked as DELETED. Reviewers: skedia, zdrudi Reviewed By: zdrudi Subscribers: ybase, slingam, bogdan Differential Revision: https://phorge.dev.yugabyte.com/D25915
@yamen-haddad , Looks like this can be closed now? |
@datavisorzhizhu Just to let you know that the fix will be available in 2.14.11.0. |
@rthallamko3 Yes. Thanks. |
@yamen-haddad thanks for recent update |
Jira Link: DB-5784
Description
Observed this in the latest master. Steps to repro:
Warning: Please confirm that this issue does not contain any sensitive information
The text was updated successfully, but these errors were encountered: