[DocDB] Delete Snapshot keeps on retrying indefinitely after master failover #16367

sanketkedia · 2023-03-09T21:27:58Z

Jira Link: DB-5784

Description

Observed this in the latest master. Steps to repro:

Create a database snapshot on a table (or multiple tables)
Split one of the tablets and wait for the parent tablet to be deleted
Perform a master leader failover - the new master will not have any state for this deleted parent tablet
Delete the snapshot created in (1). The snapshot rpc to delete the parent tablet will keep on retrying indefinitely since it won't find a leader for this tablet. The snapshot entry in syscatalog persists the list of tablets and after a split - parent tablet is deleted, so deleting the snapshot attempts to send rpcs to a non-existent tablet

Warning: Please confirm that this issue does not contain any sensitive information

I confirm this issue does not contain any sensitive information.

…master failover Summary: When a delete_snapshot command is issued and (1) one of the tablets involved in the delete snapshot operation is deleted (For example: due to tablet splitting) and (2) the master leader fails over (thus, the new master will not have any state for this deleted parent tablet), The master snapshot coordinator is unable to find a tserver replica for this tablet and thus keeps retrying to run the DELETE_ON_TABLET Async task. This should not be confounded with [[ #16631 | GH-16631 ]] where the master was having stale tablet info and thus was able to send rpc to a Tserver. In this bug however, the master is unable to send RPCs and keeps retrying to run the async task. The diff fixes the issue by adding `HandleReplicaLookupFailure ` which optionally return a `MonitoredTaskState ` to transition to in case of failing to lookup the replica for this task. By default, this function returns `nullopt`. In this bug, for an AsyncTabletSnapshotOp: if the returned error status is "NOT_FOUND" and the `TabletSnapshotOpRequestPB` is `DELETE_ON_TABLET`, then consider the task as completed and hence the master snapshot coordinator stops trying to rerun `DELETE_ON_TABLET` RPCs for this already deleted tablet. Jira Issue(s): DB-5784 Test Plan: ybd --cxx_test yb-backup-cross-feature-test --gtest-filter YBBackupTest.DeleteSnapshotAfterTabletSplittingAndMasterFailover Reviewers: skedia, zdrudi Reviewed By: zdrudi Subscribers: bogdan, slingam, ybase Differential Revision: https://phabricator.dev.yugabyte.com/D24645

datavisorzhizhu · 2023-05-21T09:45:17Z

Any fix for this？

yamen-haddad · 2023-05-22T17:12:06Z

The issue is fixed with this commit.
The ticket is still open as backporting the fix to stable releases is required.

datavisorzhizhu · 2023-05-22T22:56:26Z

@yamen-haddad thanks for update.
if possible, could you list stable releases containing the fix and the time for them?

thanks again

yamen-haddad · 2023-05-24T17:15:30Z

The fix is currently on master.
It will be backported to 2.18, 2.16, 2.14. This should happen within this week and the next week as the fix is waiting on the review of another fix.
I can comment again when the fix has been successfully backported.

datavisorzhizhu · 2023-05-24T23:35:23Z

@yamen-haddad thanks. please comment here when backporting is done

datavisorzhizhu · 2023-06-01T06:26:08Z

@yamen-haddad docker image yugabytedb/yugabyte:2.14.10.0-b23 is generated two days ago, does it contain above fix?

yamen-haddad · 2023-06-02T01:06:19Z

@datavisorzhizhu The fix is waiting for review so it did not make it yet to 2.14.
I will update here once the fix is landed.

…finitely after master failover Summary: Original commit: 2ad4b72 / D24645 When a delete_snapshot command is issued and (1) one of the tablets involved in the delete snapshot operation is deleted (For example: due to tablet splitting) and (2) the master leader fails over (thus, the new master will not have any state for this deleted parent tablet), The master snapshot coordinator is unable to find a tserver replica for this tablet and thus keeps retrying to run the DELETE_ON_TABLET Async task. This should not be confounded with [[ #16631 | GH-16631 ]] where the master was having stale tablet info and thus was able to send rpc to a Tserver. In this bug however, the master is unable to send RPCs and keeps retrying to run the async task. The diff fixes the issue by adding `HandleReplicaLookupFailure ` which optionally return a `MonitoredTaskState ` to transition to in case of failing to lookup the replica for this task. By default, this function returns `nullopt`. In this bug, for an AsyncTabletSnapshotOp: if the returned error status is "NOT_FOUND" and the `TabletSnapshotOpRequestPB` is `DELETE_ON_TABLET`, then consider the task as completed and hence the master snapshot coordinator stops trying to rerun `DELETE_ON_TABLET` RPCs for this already deleted tablet. Jira Issue(s): DB-5784 Test Plan: ybd --cxx_test yb-backup-cross-feature-test --gtest-filter YBBackupTest.DeleteSnapshotAfterTabletSplittingAndMasterFailover Reviewers: skedia, zdrudi Reviewed By: zdrudi Subscribers: bogdan, slingam, ybase Differential Revision: https://phorge.dev.yugabyte.com/D25914

…finitely after master failover Summary: Original commit: 2ad4b72 / D24645 When a delete_snapshot command is issued and (1) one of the tablets involved in the delete snapshot operation is deleted (For example: due to tablet splitting) and (2) the master leader fails over (thus, the new master will not have any state for this deleted parent tablet), The master snapshot coordinator is unable to find a tserver replica for this tablet and thus keeps retrying to run the DELETE_ON_TABLET Async task. This should not be confounded with [[ #16631 | GH-16631 ]] where the master was having stale tablet info and thus was able to send rpc to a Tserver. In this bug however, the master is unable to send RPCs and keeps retrying to run the async task. The diff fixes the issue by adding `HandleReplicaLookupFailure ` which optionally return a `MonitoredTaskState ` to transition to in case of failing to lookup the replica for this task. By default, this function returns `nullopt`. In this bug, for an AsyncTabletSnapshotOp: if the returned error status is "NOT_FOUND" and the `TabletSnapshotOpRequestPB` is `DELETE_ON_TABLET`, then consider the task as completed and hence the master snapshot coordinator stops trying to rerun `DELETE_ON_TABLET` RPCs for this already deleted tablet. Jira Issue(s): DB-5784 Test Plan: ybd --cxx_test yb-backup-cross-feature-test --gtest-filter YBBackupTest.DeleteSnapshotAfterTabletSplittingAndMasterFailover Reviewers: skedia, zdrudi Reviewed By: zdrudi Subscribers: ybase, slingam, bogdan Differential Revision: https://phorge.dev.yugabyte.com/D25912

yamen-haddad · 2023-06-02T20:29:50Z

@datavisorzhizhu The fix is now backported and will be in [2.18.1.0], [2.16.6.0], [2.19.0.0]. A backport to 2.14 is till in review.
Any specific reason you are interested in this fix specifically?

datavisorzhizhu · 2023-06-02T22:17:32Z

@yamen-haddad [2.16.6.0] is also acceptable to me. thanks

…finitely after master failover Summary: Original commit: 2ad4b72 / D24645 When a delete_snapshot command is issued and (1) one of the tablets involved in the delete snapshot operation is deleted (For example: due to tablet splitting) and (2) the master leader fails over (thus, the new master will not have any state for this deleted parent tablet), The master snapshot coordinator is unable to find a tserver replica for this tablet and thus keeps retrying to run the DELETE_ON_TABLET Async task. This should not be confounded with [[ #16631 | GH-16631 ]] where the master was having stale tablet info and thus was able to send rpc to a Tserver. In this bug however, the master is unable to send RPCs and keeps retrying to run the async task. The diff fixes the issue by adding `HandleReplicaLookupFailure ` which optionally return a `MonitoredTaskState ` to transition to in case of failing to lookup the replica for this task. By default, this function returns `nullopt`. In this bug, for an AsyncTabletSnapshotOp: if the returned error status is "NOT_FOUND" and the `TabletSnapshotOpRequestPB` is `DELETE_ON_TABLET`, then consider the task as completed and hence the master snapshot coordinator stops trying to rerun `DELETE_ON_TABLET` RPCs for this already deleted tablet. Jira Issue(s): DB-5784 Test Plan: Tested manually as many test utilities were not backported to 2.14. To run the test run the following: ./bin/yb-ctl destroy ./bin/yb-ctl create --rf=3 ./bin/ysqlsh --create a table with 3 pre-split tablets. CREATE TABLE t (k INT, v INT) SPLIT INTO 3 TABLETS; INSERT INTO t (k,v) SELECT i,i FROM generate_series(1,100) AS i; \q --create a database snapshot. ./build/latest/bin/yb-admin --master_addresses 127.0.0.1:7100,127.0.0.2:7100,127.0.0.3:7100 create_database_snapshot ysql.yugabyte --split one of the tablets to make 4 tablets and wait for the parent tablet to be deleted ./build/latest/bin/yb-admin --master_addresses 127.0.0.1:7100,127.0.0.2:7100,127.0.0.3:7100 list_tablets ysql.yugabyte t 0 ./build/latest/bin/yb-admin --master_addresses 127.0.0.1:7100,127.0.0.2:7100,127.0.0.3:7100 split_tablet ./build/latest/bin/yb-admin --master_addresses 127.0.0.1:7100,127.0.0.2:7100,127.0.0.3:7100 list_snapshots SHOW_DELETED --perform a master leader failover - the new master will not have any state for this deleted parent tablet ./bin/yb-ctl status kill -9 <master leader pid> --delete the snapshot ./build/latest/bin/yb-admin --master_addresses 127.0.0.1:7100,127.0.0.2:7100,127.0.0.3:7100 delete_snapshot <snapshot id> Before the fix: The master keeps retrying to run the DELETE_ON_TABLET for the already deleted tablet and the state of the snapshot is always DELETING. after the fix: The master does not keep retrying to send the delete RPC and the snapshot is marked as DELETED. Reviewers: skedia, zdrudi Reviewed By: zdrudi Subscribers: ybase, slingam, bogdan Differential Revision: https://phorge.dev.yugabyte.com/D25915

rthallamko3 · 2023-06-08T15:23:01Z

@yamen-haddad , Looks like this can be closed now?

yamen-haddad · 2023-06-08T21:47:26Z

@datavisorzhizhu Just to let you know that the fix will be available in 2.14.11.0.

yamen-haddad · 2023-06-08T21:47:53Z

@rthallamko3 Yes. Thanks.

datavisorzhizhu · 2023-06-09T07:02:15Z

@yamen-haddad thanks for recent update

sanketkedia added area/docdb YugabyteDB core features status/awaiting-triage Issue awaiting triage labels Mar 9, 2023

sanketkedia self-assigned this Mar 9, 2023

sanketkedia added this to To do in Master components via automation Mar 9, 2023

sanketkedia assigned lingamsandeep Mar 9, 2023

yugabyte-ci added kind/bug This issue is a bug priority/medium Medium priority issue labels Mar 9, 2023

yamen-haddad self-assigned this Mar 29, 2023

yamen-haddad changed the title ~~[DocDB] Delete Snapshot keeps on retrying indefinitely~~ [DocDB] Delete Snapshot keeps on retrying indefinitely after master failover Apr 20, 2023

yugabyte-ci unassigned sanketkedia Apr 20, 2023

yugabyte-ci removed the status/awaiting-triage Issue awaiting triage label Apr 20, 2023

yamen-haddad added 2.14 Backport Required 2.16 Backport Required 2.18 Backport Required labels May 24, 2023

yamen-haddad closed this as completed Jun 8, 2023

Master components automation moved this from To do to Done Jun 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DocDB] Delete Snapshot keeps on retrying indefinitely after master failover #16367

[DocDB] Delete Snapshot keeps on retrying indefinitely after master failover #16367

sanketkedia commented Mar 9, 2023 •

edited by yamen-haddad

datavisorzhizhu commented May 21, 2023

yamen-haddad commented May 22, 2023 •

edited

datavisorzhizhu commented May 22, 2023

yamen-haddad commented May 24, 2023

datavisorzhizhu commented May 24, 2023

datavisorzhizhu commented Jun 1, 2023

yamen-haddad commented Jun 2, 2023

yamen-haddad commented Jun 2, 2023

datavisorzhizhu commented Jun 2, 2023

rthallamko3 commented Jun 8, 2023

yamen-haddad commented Jun 8, 2023

yamen-haddad commented Jun 8, 2023

datavisorzhizhu commented Jun 9, 2023

[DocDB] Delete Snapshot keeps on retrying indefinitely after master failover #16367

[DocDB] Delete Snapshot keeps on retrying indefinitely after master failover #16367

Comments

sanketkedia commented Mar 9, 2023 • edited by yamen-haddad

Description

Warning: Please confirm that this issue does not contain any sensitive information

datavisorzhizhu commented May 21, 2023

yamen-haddad commented May 22, 2023 • edited

datavisorzhizhu commented May 22, 2023

yamen-haddad commented May 24, 2023

datavisorzhizhu commented May 24, 2023

datavisorzhizhu commented Jun 1, 2023

yamen-haddad commented Jun 2, 2023

yamen-haddad commented Jun 2, 2023

datavisorzhizhu commented Jun 2, 2023

rthallamko3 commented Jun 8, 2023

yamen-haddad commented Jun 8, 2023

yamen-haddad commented Jun 8, 2023

datavisorzhizhu commented Jun 9, 2023

sanketkedia commented Mar 9, 2023 •

edited by yamen-haddad

yamen-haddad commented May 22, 2023 •

edited