Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DocDB] Delete Snapshot keeps on retrying indefinitely after master failover #16367

Closed
1 task done
sanketkedia opened this issue Mar 9, 2023 · 13 comments
Closed
1 task done

Comments

@sanketkedia
Copy link
Contributor

sanketkedia commented Mar 9, 2023

Jira Link: DB-5784

Description

Observed this in the latest master. Steps to repro:

  1. Create a database snapshot on a table (or multiple tables)
  2. Split one of the tablets and wait for the parent tablet to be deleted
  3. Perform a master leader failover - the new master will not have any state for this deleted parent tablet
  4. Delete the snapshot created in (1). The snapshot rpc to delete the parent tablet will keep on retrying indefinitely since it won't find a leader for this tablet. The snapshot entry in syscatalog persists the list of tablets and after a split - parent tablet is deleted, so deleting the snapshot attempts to send rpcs to a non-existent tablet

Warning: Please confirm that this issue does not contain any sensitive information

  • I confirm this issue does not contain any sensitive information.
@sanketkedia sanketkedia added area/docdb YugabyteDB core features status/awaiting-triage Issue awaiting triage labels Mar 9, 2023
@sanketkedia sanketkedia self-assigned this Mar 9, 2023
@sanketkedia sanketkedia added this to To do in Master components via automation Mar 9, 2023
@yugabyte-ci yugabyte-ci added kind/bug This issue is a bug priority/medium Medium priority issue labels Mar 9, 2023
@yamen-haddad yamen-haddad self-assigned this Mar 29, 2023
@yamen-haddad yamen-haddad changed the title [DocDB] Delete Snapshot keeps on retrying indefinitely [DocDB] Delete Snapshot keeps on retrying indefinitely after master failover Apr 20, 2023
@yugabyte-ci yugabyte-ci removed the status/awaiting-triage Issue awaiting triage label Apr 20, 2023
yamen-haddad added a commit that referenced this issue Apr 28, 2023
…master failover

Summary:
When a delete_snapshot command is issued and (1) one of the tablets involved in the delete snapshot operation is deleted (For example: due to tablet splitting) and (2) the master leader fails over (thus, the new master will not have any state for this deleted parent tablet), The master snapshot coordinator is unable to find a tserver replica for this tablet and thus keeps retrying to run the DELETE_ON_TABLET Async task. This should not be confounded with [[ #16631 | GH-16631 ]] where the master was having stale tablet info and thus was able to send rpc to a Tserver. In this bug however, the master is unable to send RPCs and keeps retrying to run the async task.
The diff fixes the issue by adding `HandleReplicaLookupFailure ` which optionally return a `MonitoredTaskState ` to transition to in case of failing to lookup the replica for this task.
By default, this function returns `nullopt`.
In this bug, for an AsyncTabletSnapshotOp: if the returned error status is "NOT_FOUND" and the `TabletSnapshotOpRequestPB` is `DELETE_ON_TABLET`, then consider the task as completed and hence the master snapshot coordinator stops trying to rerun `DELETE_ON_TABLET` RPCs for this already deleted tablet.
Jira Issue(s): DB-5784

Test Plan: ybd --cxx_test yb-backup-cross-feature-test --gtest-filter YBBackupTest.DeleteSnapshotAfterTabletSplittingAndMasterFailover

Reviewers: skedia, zdrudi

Reviewed By: zdrudi

Subscribers: bogdan, slingam, ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D24645
@datavisorzhizhu
Copy link

Any fix for this?

@yamen-haddad
Copy link
Member

yamen-haddad commented May 22, 2023

The issue is fixed with this commit.
The ticket is still open as backporting the fix to stable releases is required.

@datavisorzhizhu
Copy link

@yamen-haddad thanks for update.
if possible, could you list stable releases containing the fix and the time for them?

thanks again

@yamen-haddad
Copy link
Member

The fix is currently on master.
It will be backported to 2.18, 2.16, 2.14. This should happen within this week and the next week as the fix is waiting on the review of another fix.
I can comment again when the fix has been successfully backported.

@datavisorzhizhu
Copy link

@yamen-haddad thanks. please comment here when backporting is done

@datavisorzhizhu
Copy link

@yamen-haddad docker image yugabytedb/yugabyte:2.14.10.0-b23 is generated two days ago, does it contain above fix?

@yamen-haddad
Copy link
Member

@datavisorzhizhu The fix is waiting for review so it did not make it yet to 2.14.
I will update here once the fix is landed.

yamen-haddad added a commit that referenced this issue Jun 2, 2023
…finitely after master failover

Summary:
Original commit: 2ad4b72 / D24645
When a delete_snapshot command is issued and (1) one of the tablets involved in the delete snapshot operation is deleted (For example: due to tablet splitting) and (2) the master leader fails over (thus, the new master will not have any state for this deleted parent tablet), The master snapshot coordinator is unable to find a tserver replica for this tablet and thus keeps retrying to run the DELETE_ON_TABLET Async task. This should not be confounded with [[ #16631 | GH-16631 ]] where the master was having stale tablet info and thus was able to send rpc to a Tserver. In this bug however, the master is unable to send RPCs and keeps retrying to run the async task.
The diff fixes the issue by adding `HandleReplicaLookupFailure ` which optionally return a `MonitoredTaskState ` to transition to in case of failing to lookup the replica for this task.
By default, this function returns `nullopt`.
In this bug, for an AsyncTabletSnapshotOp: if the returned error status is "NOT_FOUND" and the `TabletSnapshotOpRequestPB` is `DELETE_ON_TABLET`, then consider the task as completed and hence the master snapshot coordinator stops trying to rerun `DELETE_ON_TABLET` RPCs for this already deleted tablet.
Jira Issue(s): DB-5784

Test Plan: ybd --cxx_test yb-backup-cross-feature-test --gtest-filter YBBackupTest.DeleteSnapshotAfterTabletSplittingAndMasterFailover

Reviewers: skedia, zdrudi

Reviewed By: zdrudi

Subscribers: bogdan, slingam, ybase

Differential Revision: https://phorge.dev.yugabyte.com/D25914
yamen-haddad added a commit that referenced this issue Jun 2, 2023
…finitely after master failover

Summary:
Original commit: 2ad4b72 / D24645
When a delete_snapshot command is issued and (1) one of the tablets involved in the delete snapshot operation is deleted (For example: due to tablet splitting) and (2) the master leader fails over (thus, the new master will not have any state for this deleted parent tablet), The master snapshot coordinator is unable to find a tserver replica for this tablet and thus keeps retrying to run the DELETE_ON_TABLET Async task. This should not be confounded with [[ #16631 | GH-16631 ]] where the master was having stale tablet info and thus was able to send rpc to a Tserver. In this bug however, the master is unable to send RPCs and keeps retrying to run the async task.
The diff fixes the issue by adding `HandleReplicaLookupFailure ` which optionally return a `MonitoredTaskState ` to transition to in case of failing to lookup the replica for this task.
By default, this function returns `nullopt`.
In this bug, for an AsyncTabletSnapshotOp: if the returned error status is "NOT_FOUND" and the `TabletSnapshotOpRequestPB` is `DELETE_ON_TABLET`, then consider the task as completed and hence the master snapshot coordinator stops trying to rerun `DELETE_ON_TABLET` RPCs for this already deleted tablet.
Jira Issue(s): DB-5784

Test Plan: ybd --cxx_test yb-backup-cross-feature-test --gtest-filter YBBackupTest.DeleteSnapshotAfterTabletSplittingAndMasterFailover

Reviewers: skedia, zdrudi

Reviewed By: zdrudi

Subscribers: ybase, slingam, bogdan

Differential Revision: https://phorge.dev.yugabyte.com/D25912
@yamen-haddad
Copy link
Member

@datavisorzhizhu The fix is now backported and will be in [2.18.1.0], [2.16.6.0], [2.19.0.0]. A backport to 2.14 is till in review.
Any specific reason you are interested in this fix specifically?

@datavisorzhizhu
Copy link

@yamen-haddad [2.16.6.0] is also acceptable to me. thanks

yamen-haddad added a commit that referenced this issue Jun 7, 2023
…finitely after master failover

Summary:
Original commit: 2ad4b72 / D24645
When a delete_snapshot command is issued and (1) one of the tablets involved in the delete snapshot operation is deleted (For example: due to tablet splitting) and (2) the master leader fails over (thus, the new master will not have any state for this deleted parent tablet), The master snapshot coordinator is unable to find a tserver replica for this tablet and thus keeps retrying to run the DELETE_ON_TABLET Async task. This should not be confounded with [[ #16631 | GH-16631 ]] where the master was having stale tablet info and thus was able to send rpc to a Tserver. In this bug however, the master is unable to send RPCs and keeps retrying to run the async task.
The diff fixes the issue by adding `HandleReplicaLookupFailure ` which optionally return a `MonitoredTaskState ` to transition to in case of failing to lookup the replica for this task.
By default, this function returns `nullopt`.
In this bug, for an AsyncTabletSnapshotOp: if the returned error status is "NOT_FOUND" and the `TabletSnapshotOpRequestPB` is `DELETE_ON_TABLET`, then consider the task as completed and hence the master snapshot coordinator stops trying to rerun `DELETE_ON_TABLET` RPCs for this already deleted tablet.
Jira Issue(s): DB-5784

Test Plan:
Tested manually as many test utilities were not backported to 2.14.
To run the test run the following:
./bin/yb-ctl destroy
./bin/yb-ctl create --rf=3
./bin/ysqlsh

--create a table with 3 pre-split tablets.
CREATE TABLE t (k INT, v INT) SPLIT INTO 3 TABLETS;
INSERT INTO t (k,v) SELECT i,i FROM generate_series(1,100) AS i;
\q
--create a database snapshot.
./build/latest/bin/yb-admin --master_addresses 127.0.0.1:7100,127.0.0.2:7100,127.0.0.3:7100 create_database_snapshot ysql.yugabyte
--split one of the tablets to make 4 tablets and wait for the parent tablet to be deleted
./build/latest/bin/yb-admin --master_addresses 127.0.0.1:7100,127.0.0.2:7100,127.0.0.3:7100 list_tablets ysql.yugabyte t 0
./build/latest/bin/yb-admin --master_addresses 127.0.0.1:7100,127.0.0.2:7100,127.0.0.3:7100 split_tablet
./build/latest/bin/yb-admin --master_addresses 127.0.0.1:7100,127.0.0.2:7100,127.0.0.3:7100 list_snapshots SHOW_DELETED

--perform a master leader failover - the new master will not have any state for this deleted parent tablet
./bin/yb-ctl status
kill -9 <master leader pid>
--delete the snapshot
./build/latest/bin/yb-admin --master_addresses 127.0.0.1:7100,127.0.0.2:7100,127.0.0.3:7100 delete_snapshot <snapshot id>

Before the fix: The master keeps retrying to run the DELETE_ON_TABLET for the already deleted tablet and the state of the snapshot is always DELETING.
after the fix: The master does not keep retrying to send the delete RPC and the snapshot is marked as DELETED.

Reviewers: skedia, zdrudi

Reviewed By: zdrudi

Subscribers: ybase, slingam, bogdan

Differential Revision: https://phorge.dev.yugabyte.com/D25915
@rthallamko3
Copy link
Contributor

@yamen-haddad , Looks like this can be closed now?

@yamen-haddad
Copy link
Member

@datavisorzhizhu Just to let you know that the fix will be available in 2.14.11.0.

@yamen-haddad
Copy link
Member

@rthallamko3 Yes. Thanks.

Master components automation moved this from To do to Done Jun 8, 2023
@datavisorzhizhu
Copy link

@yamen-haddad thanks for recent update

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

6 participants