Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PITR: Mechanism to automatically take snapshots at predefined interval #7126

Closed
bmatican opened this issue Feb 5, 2021 · 0 comments
Closed
Assignees
Labels
area/docdb YugabyteDB core features

Comments

@bmatican
Copy link
Contributor

bmatican commented Feb 5, 2021

After #7125, we could improve on this, by automatically taking snapshots, to make sure we don't end up increasing history retention to very large values, by relying on the users creating snapshots themselves.

@bmatican bmatican added the area/docdb YugabyteDB core features label Feb 5, 2021
spolitov added a commit that referenced this issue Mar 2, 2021
Summary:
This diff adds ability to create snapshot schedules on master.
Those schedules does nothing currently, but will be used as a main configuration in PITR.

With these, a user should be able to specify
- which tables should be snapshot together (ie: individual tables, whole keyspaces, whole cluster)
- how frequently should the snapshots be taken
- how long should the snapshots be kept for

Test Plan: ybd --cxx-test snapshot-schedule-test

Reviewers: oleg, bogdan

Reviewed By: bogdan

Subscribers: amitanand, nicolas, zyu, ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D10726
spolitov added a commit that referenced this issue Mar 8, 2021
Summary:
This diff is a follow up to D10726 / 0fde8db.

With this diff, the interval setting of schedules will be respected. We will now take snapshots:
- when a schedule is first created
- repeatedly, after the set amount of time, specified by each schedule's interval

Test Plan: ybd --gtest_filter SnapshotScheduleTest.Snapshot

Reviewers: nicolas, amitanand, bogdan

Reviewed By: bogdan

Subscribers: ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D10781
polarweasel pushed a commit to lizayugabyte/yugabyte-db that referenced this issue Mar 9, 2021
Summary:
This diff adds ability to create snapshot schedules on master.
Those schedules does nothing currently, but will be used as a main configuration in PITR.

With these, a user should be able to specify
- which tables should be snapshot together (ie: individual tables, whole keyspaces, whole cluster)
- how frequently should the snapshots be taken
- how long should the snapshots be kept for

Test Plan: ybd --cxx-test snapshot-schedule-test

Reviewers: oleg, bogdan

Reviewed By: bogdan

Subscribers: amitanand, nicolas, zyu, ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D10726
spolitov added a commit that referenced this issue Mar 12, 2021
Summary:
This diff adds logic to cleanup schedules snapshots that are out of our retention bounds.

We leverage the existing cleanup mechanism for snapshots, by deleting the oldest snapshot from every schedule, in each polling interval.

Test Plan: ybd --gtest_filter SnapshotScheduleTest.GC

Reviewers: amitanand, bogdan

Reviewed By: bogdan

Subscribers: ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D10864
spolitov added a commit that referenced this issue Mar 16, 2021
Summary:
This diff adds snapshot schedule loading during tablet bootstrap.
So schedules are loaded after master restart.

Test Plan: ybd --gtest_filter SnapshotScheduleTest.Restart

Reviewers: bogdan, amitanand, oleg

Reviewed By: oleg

Subscribers: ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D10916
spolitov added a commit that referenced this issue Mar 17, 2021
Summary:
This diff adds new history retention mechanism for tablets that participate in point in time restore.

Normally, history retention is controlled by this gflag `timestamp_history_retention_interval_sec`, potentially cleaning up any values older than this interval.
After this diff, history for such tablets will only be cleaned up to the point before the latest PITR snapshot.

To achieve this, we need to know two things on the TS side
1) Which schedule is a tablet a part of.
The master knows this information, based on the filters set on PITR schedules. It can send this schedule ID together with any schedule related snapshots, to any involved TS. We can then persist this information in the tablet metadata. Ideally, the first PITR related snapshot will do this, but in case of any errors, snapshots are retried automatically by the master, so we have a guarantee that the TS will eventually update this persistent metadata.

2) What are the history retention requirements for each relevant schedule.
We use the TS heartbeat response to flow this information from the master to the TS. Then, on any compaction, we will choose the minimum between the existent flag vs any of the retention policies of any of the schedules a tablet is involved in.

Test Plan: ybd --gtest_filter SnapshotScheduleTest.Snapshot

Reviewers: amitanand, mbautin, bogdan

Reviewed By: bogdan

Subscribers: ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D10861
spolitov added a commit that referenced this issue Mar 19, 2021
Summary: This diff adds logic to propagate correct history retention to newly created tablets and particiate in existing snapshot schedule.

Test Plan: ybd --gtest_filter SnapshotScheduleTest.Index

Reviewers: bogdan, amitanand

Reviewed By: amitanand

Subscribers: ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D10968
spolitov added a commit that referenced this issue Mar 19, 2021
Summary: This diff adds logic to take system catalog snapshot while taking scheduled snapshot.

Test Plan: ybd --gtest_filter SnapshotScheduleTest.RestoreSchema

Reviewers: amitanand, bogdan

Reviewed By: bogdan

Subscribers: ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D10957
spolitov added a commit that referenced this issue Apr 1, 2021
Summary:
This diff adds logic to restore table schema. After this, we should be able to undo an ALTER TABLE operation!

There are two important changes as part of this diff.
1) Restoring master side sys_catalog metadata.
2) Sending the restored version of the schema from the master to the TS, as part of the explicit command to restore the TS.

As part of applying the restore operation on the master, we add new state tracking, which can do the diff between current sys_catalog state vs the state at the time at which we want to restore. This is done by restoring the corresponding sys_catalog snapshot into a temporary directory, with the HybridTime filter applied, for the restore_at time. We then load the relevant TABLE and TABLET data into memory and overwrite the existing rocksdb data directly in memory. This is safe to do because
- It is done as part of the apply step of a raft operation, so it is already persisted and will be replayed accordingly at bootstrap, in case of a restart.
- It is done on both leader and follower.

Once the master state is rolled back, we then run the TS side of the restore operation. The master now sends over the restored schema information, as part of the Restore request. On the TS side, we update our tablet schema information on disk accordingly.

Note: In between the master state being rolled back and all the TS processing their respective restores, there is a time window in which the master can receive heartbeats from a TS, with newer schema information than what the master has persisted. Currently, that seems to only lead to some log spew, but will be investigated later, as part of fault tolerance testing.

Test Plan: ybd --gtest_filter SnapshotScheduleTest.RestoreSchema

Reviewers: amitanand, bogdan

Reviewed By: bogdan

Subscribers: ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D11013
spolitov added a commit that referenced this issue Apr 1, 2021
Summary:
This diff adds 2 commands to yb-admin:
1) create_snapshot_schedule <snapshot_interval_in_minutes> <snapshot_retention_in_minutes> <table> [<table>]...
   Where:
     snapshot_interval_in_minutes - snapshot interval specified in minutes.
     snapshot_retention_in_minutes - snapshot retention specified in minutes.
     Followed by the list of tables list in other yb-admin commands.

2) list_snapshot_schedules [<schedule_id>]
   Where:
     schedule_id - optional argument to specify schedule id, that should be listed.
     When not specified all schedules are listed.

Test Plan: ybd --cxx-test yb-admin-test --gtest_filter AdminCliTest.SnapshotSchedule

Reviewers: bogdan, oleg

Reviewed By: oleg

Subscribers: ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D11077
spolitov added a commit that referenced this issue Apr 12, 2021
Summary:
This diff adds admin command restore_snapshot_schedule to restore snapshot schedule at specified time.
Also added command list_snapshot_restorations to list snapshot restorations.

Test Plan: ybd --cxx-test yb-admin-test --gtest_filter AdminCliTest.SnapshotSchedule

Reviewers: bogdan, oleg

Reviewed By: oleg

Subscribers: ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D11177
spolitov added a commit that referenced this issue Apr 19, 2021
Summary:
It could happen that tablets or tables were created after the restoration point and current cluster state.
In this case, we should remove them during restoration.

Since currently, we cannot create a filter for not-created tables, new test checks only that created index and its tablets are deleted after restore.

Test for revering creation of regular tablets should be added in upcoming diffs when new filters will be supported.

Test Plan: ybd --gtest_filter SnapshotScheduleTest.RemmoveNewTablets

Reviewers: bogdan

Reviewed By: bogdan

Subscribers: jenkins-bot, ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D11240
spolitov added a commit that referenced this issue May 5, 2021
Summary:
Adds the ability to restore the table that was previously deleted.
When the tablet that participates in the snapshot schedule is being deleted, it is marked as hidden instead of performing actual delete.
Such tablets reject reads and write, but could be restored to some point in time.

Cleanup for such tables should be implemented in follow-up diffs.

Test Plan: ybd --gtest_filter YbAdminSnapshotScheduleTest.SnapshotScheduleUndeleteTable

Reviewers: bogdan

Reviewed By: bogdan

Subscribers: rahuldesirazu, skedia, mbautin, ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D11389
spolitov added a commit that referenced this issue May 11, 2021
Summary:
If table is participated in snapshot schedule, it's tablets are not being deleted immediately when the table is deleted.
Because those tablets could be restored by user request, we instead just mark them as hidden.

This diff adds logic to cleanup such tablets when there is no schedule that could be used to restore those tablets. This means both
- this tablet is still covered by some schedule's filter, but it is no longer in the retention interval for any of them
- this tablet is not covered by any schedule's filter anymore, as they've all been deleted

Also fixed a bug with `SnapshotState::TryStartDelete` when the snapshot did not have any tablets.

Test Plan: ybd --gtest_filter YbAdminSnapshotScheduleTest.CleanupDeletedTablets

Reviewers: bogdan

Reviewed By: bogdan

Subscribers: ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D11489
spolitov added a commit that referenced this issue May 14, 2021
Summary:
Adds the ability to restore the table that was previously deleted.
When the tablet that participates in the snapshot schedule is being deleted, it is marked as hidden instead of performing actual delete.
Such tablets reject reads and write, but could be restored to some point in time.

Cleanup for such tables should be implemented in follow-up diffs.

Original commit: D11389 / 9fd73c7

Test Plan:
ybd --gtest_filter YbAdminSnapshotScheduleTest.SnapshotScheduleUndeleteTable
Jenkins: rebase: 2.6

Reviewers: bogdan

Reviewed By: bogdan

Subscribers: ybase, mbautin, skedia, rahuldesirazu

Differential Revision: https://phabricator.dev.yugabyte.com/D11594
spolitov added a commit that referenced this issue May 14, 2021
Summary:
If table is participated in snapshot schedule, it's tablets are not being deleted immediately when the table is deleted.
Because those tablets could be restored by user request, we instead just mark them as hidden.

This diff adds logic to cleanup such tablets when there is no schedule that could be used to restore those tablets. This means both
- this tablet is still covered by some schedule's filter, but it is no longer in the retention interval for any of them
- this tablet is not covered by any schedule's filter anymore, as they've all been deleted

Also fixed a bug with `SnapshotState::TryStartDelete` when the snapshot did not have any tablets.

Original commit: D11489/4e9665ad7ee022ef0d118940a1086aac5ffd1110

Test Plan:
ybd --gtest_filter YbAdminSnapshotScheduleTest.CleanupDeletedTablets
Jenkins: rebase: 2.6

Reviewers: bogdan

Reviewed By: bogdan

Subscribers: ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D11598
spolitov added a commit that referenced this issue May 14, 2021
Summary:
This diff adds handling of hidden tablets during master failover.

We introduce a new persistent hide_state on the Table objects in the master.
- When deleting a table covered by PITR, we leave it in RUNNING state, but change the hide_state to HIDING.
- Once all tablets are also hidden, we transition the table's hide_state to HIDDEN
- Once the table goes out of PITR scope, we then change it from RUNNING to DELETED

This also buttons up all callsites that use GetTables, to ensure we don't display hidden tables to clients that do not care about them. This is relevant for YCQL system tables, for example. In the master UIs, we can keep displaying hidden tables as well.

Test Plan: ybd --gtest_filter YbAdminSnapshotScheduleTest.UndeleteTableWithRestart

Reviewers: bogdan

Reviewed By: bogdan

Subscribers: ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D11563
spolitov added a commit that referenced this issue May 22, 2021
Summary:
Fixes issues uncovered by YbAdminSnapshotScheduleTest.UndeleteIndex test.
1) DeleteTableInMemory could be called multiple times in the case of the index table.
   There is a check that just does noop when the table was already deleted.
   Adjusted this check to do the same when the table is being hidden.
2) Don't remove the table from names map during delete, when it was previously hidden.
   Otherwise, it would crash with fatal during cleanup.
3) DeleteTabletListAndSendRequests executes delete on tablet before commiting tablet info changes.
   As a result tablet could be deleted before and callback called, before info changes in memory.
   So table would hang in delete state. Because callback would think that tablet is not being deleted.
4) Decreased log flooding when compactions are being enabled in RocksDB.
   When compactions are being enabled we call SetOptions twice for each RocksDB, and each of them dumps all current options values.
   So while we have regular and intents DB we have 4 dumps of all rocksdb options.

Also added debug logging to `RWCLock::WriteLock()`, when it takes a too long time to acquire this lock, it would log the stack trace of the successful write lock.

Test Plan: ybd --gtest_filter YbAdminSnapshotScheduleTest.UndeleteIndex -n 20

Reviewers: bogdan

Reviewed By: bogdan

Subscribers: amitanand, ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D11614
spolitov added a commit that referenced this issue May 25, 2021
Summary:
This diff adds handling of hidden tablets during master failover.

We introduce a new persistent hide_state on the Table objects in the master.
- When deleting a table covered by PITR, we leave it in RUNNING state, but change the hide_state to HIDING.
- Once all tablets are also hidden, we transition the table's hide_state to HIDDEN
- Once the table goes out of PITR scope, we then change it from RUNNING to DELETED

This also buttons up all callsites that use GetTables, to ensure we don't display hidden tables to clients that do not care about them. This is relevant for YCQL system tables, for example. In the master UIs, we can keep displaying hidden tables as well.

Original commit: D11563 / c221319

Test Plan:
ybd --gtest_filter YbAdminSnapshotScheduleTest.UndeleteTableWithRestart
Jenkins: rebase: 2.6

Reviewers: bogdan

Reviewed By: bogdan

Subscribers: ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D11697
YintongMa pushed a commit to YintongMa/yugabyte-db that referenced this issue May 26, 2021
Summary:
This diff adds logic to cleanup schedules snapshots that are out of our retention bounds.

We leverage the existing cleanup mechanism for snapshots, by deleting the oldest snapshot from every schedule, in each polling interval.

Test Plan: ybd --gtest_filter SnapshotScheduleTest.GC

Reviewers: amitanand, bogdan

Reviewed By: bogdan

Subscribers: ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D10864
YintongMa pushed a commit to YintongMa/yugabyte-db that referenced this issue May 26, 2021
Summary:
This diff adds snapshot schedule loading during tablet bootstrap.
So schedules are loaded after master restart.

Test Plan: ybd --gtest_filter SnapshotScheduleTest.Restart

Reviewers: bogdan, amitanand, oleg

Reviewed By: oleg

Subscribers: ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D10916
YintongMa pushed a commit to YintongMa/yugabyte-db that referenced this issue May 26, 2021
Summary:
This diff adds new history retention mechanism for tablets that participate in point in time restore.

Normally, history retention is controlled by this gflag `timestamp_history_retention_interval_sec`, potentially cleaning up any values older than this interval.
After this diff, history for such tablets will only be cleaned up to the point before the latest PITR snapshot.

To achieve this, we need to know two things on the TS side
1) Which schedule is a tablet a part of.
The master knows this information, based on the filters set on PITR schedules. It can send this schedule ID together with any schedule related snapshots, to any involved TS. We can then persist this information in the tablet metadata. Ideally, the first PITR related snapshot will do this, but in case of any errors, snapshots are retried automatically by the master, so we have a guarantee that the TS will eventually update this persistent metadata.

2) What are the history retention requirements for each relevant schedule.
We use the TS heartbeat response to flow this information from the master to the TS. Then, on any compaction, we will choose the minimum between the existent flag vs any of the retention policies of any of the schedules a tablet is involved in.

Test Plan: ybd --gtest_filter SnapshotScheduleTest.Snapshot

Reviewers: amitanand, mbautin, bogdan

Reviewed By: bogdan

Subscribers: ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D10861
YintongMa pushed a commit to YintongMa/yugabyte-db that referenced this issue May 26, 2021
Summary: This diff adds logic to propagate correct history retention to newly created tablets and particiate in existing snapshot schedule.

Test Plan: ybd --gtest_filter SnapshotScheduleTest.Index

Reviewers: bogdan, amitanand

Reviewed By: amitanand

Subscribers: ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D10968
YintongMa pushed a commit to YintongMa/yugabyte-db that referenced this issue May 26, 2021
Summary: This diff adds logic to take system catalog snapshot while taking scheduled snapshot.

Test Plan: ybd --gtest_filter SnapshotScheduleTest.RestoreSchema

Reviewers: amitanand, bogdan

Reviewed By: bogdan

Subscribers: ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D10957
YintongMa pushed a commit to YintongMa/yugabyte-db that referenced this issue May 26, 2021
Summary:
This diff adds logic to restore table schema. After this, we should be able to undo an ALTER TABLE operation!

There are two important changes as part of this diff.
1) Restoring master side sys_catalog metadata.
2) Sending the restored version of the schema from the master to the TS, as part of the explicit command to restore the TS.

As part of applying the restore operation on the master, we add new state tracking, which can do the diff between current sys_catalog state vs the state at the time at which we want to restore. This is done by restoring the corresponding sys_catalog snapshot into a temporary directory, with the HybridTime filter applied, for the restore_at time. We then load the relevant TABLE and TABLET data into memory and overwrite the existing rocksdb data directly in memory. This is safe to do because
- It is done as part of the apply step of a raft operation, so it is already persisted and will be replayed accordingly at bootstrap, in case of a restart.
- It is done on both leader and follower.

Once the master state is rolled back, we then run the TS side of the restore operation. The master now sends over the restored schema information, as part of the Restore request. On the TS side, we update our tablet schema information on disk accordingly.

Note: In between the master state being rolled back and all the TS processing their respective restores, there is a time window in which the master can receive heartbeats from a TS, with newer schema information than what the master has persisted. Currently, that seems to only lead to some log spew, but will be investigated later, as part of fault tolerance testing.

Test Plan: ybd --gtest_filter SnapshotScheduleTest.RestoreSchema

Reviewers: amitanand, bogdan

Reviewed By: bogdan

Subscribers: ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D11013
YintongMa pushed a commit to YintongMa/yugabyte-db that referenced this issue May 26, 2021
…edules

Summary:
This diff adds 2 commands to yb-admin:
1) create_snapshot_schedule <snapshot_interval_in_minutes> <snapshot_retention_in_minutes> <table> [<table>]...
   Where:
     snapshot_interval_in_minutes - snapshot interval specified in minutes.
     snapshot_retention_in_minutes - snapshot retention specified in minutes.
     Followed by the list of tables list in other yb-admin commands.

2) list_snapshot_schedules [<schedule_id>]
   Where:
     schedule_id - optional argument to specify schedule id, that should be listed.
     When not specified all schedules are listed.

Test Plan: ybd --cxx-test yb-admin-test --gtest_filter AdminCliTest.SnapshotSchedule

Reviewers: bogdan, oleg

Reviewed By: oleg

Subscribers: ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D11077
YintongMa pushed a commit to YintongMa/yugabyte-db that referenced this issue May 26, 2021
Summary:
This diff adds admin command restore_snapshot_schedule to restore snapshot schedule at specified time.
Also added command list_snapshot_restorations to list snapshot restorations.

Test Plan: ybd --cxx-test yb-admin-test --gtest_filter AdminCliTest.SnapshotSchedule

Reviewers: bogdan, oleg

Reviewed By: oleg

Subscribers: ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D11177
YintongMa pushed a commit to YintongMa/yugabyte-db that referenced this issue May 26, 2021
Summary:
It could happen that tablets or tables were created after the restoration point and current cluster state.
In this case, we should remove them during restoration.

Since currently, we cannot create a filter for not-created tables, new test checks only that created index and its tablets are deleted after restore.

Test for revering creation of regular tablets should be added in upcoming diffs when new filters will be supported.

Test Plan: ybd --gtest_filter SnapshotScheduleTest.RemmoveNewTablets

Reviewers: bogdan

Reviewed By: bogdan

Subscribers: jenkins-bot, ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D11240
YintongMa pushed a commit to YintongMa/yugabyte-db that referenced this issue May 26, 2021
Summary:
Adds the ability to restore the table that was previously deleted.
When the tablet that participates in the snapshot schedule is being deleted, it is marked as hidden instead of performing actual delete.
Such tablets reject reads and write, but could be restored to some point in time.

Cleanup for such tables should be implemented in follow-up diffs.

Test Plan: ybd --gtest_filter YbAdminSnapshotScheduleTest.SnapshotScheduleUndeleteTable

Reviewers: bogdan

Reviewed By: bogdan

Subscribers: rahuldesirazu, skedia, mbautin, ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D11389
YintongMa pushed a commit to YintongMa/yugabyte-db that referenced this issue May 26, 2021
Summary:
If table is participated in snapshot schedule, it's tablets are not being deleted immediately when the table is deleted.
Because those tablets could be restored by user request, we instead just mark them as hidden.

This diff adds logic to cleanup such tablets when there is no schedule that could be used to restore those tablets. This means both
- this tablet is still covered by some schedule's filter, but it is no longer in the retention interval for any of them
- this tablet is not covered by any schedule's filter anymore, as they've all been deleted

Also fixed a bug with `SnapshotState::TryStartDelete` when the snapshot did not have any tablets.

Test Plan: ybd --gtest_filter YbAdminSnapshotScheduleTest.CleanupDeletedTablets

Reviewers: bogdan

Reviewed By: bogdan

Subscribers: ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D11489
YintongMa pushed a commit to YintongMa/yugabyte-db that referenced this issue May 26, 2021
Summary:
This diff adds handling of hidden tablets during master failover.

We introduce a new persistent hide_state on the Table objects in the master.
- When deleting a table covered by PITR, we leave it in RUNNING state, but change the hide_state to HIDING.
- Once all tablets are also hidden, we transition the table's hide_state to HIDDEN
- Once the table goes out of PITR scope, we then change it from RUNNING to DELETED

This also buttons up all callsites that use GetTables, to ensure we don't display hidden tables to clients that do not care about them. This is relevant for YCQL system tables, for example. In the master UIs, we can keep displaying hidden tables as well.

Test Plan: ybd --gtest_filter YbAdminSnapshotScheduleTest.UndeleteTableWithRestart

Reviewers: bogdan

Reviewed By: bogdan

Subscribers: ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D11563
YintongMa pushed a commit to YintongMa/yugabyte-db that referenced this issue May 26, 2021
Summary:
Fixes issues uncovered by YbAdminSnapshotScheduleTest.UndeleteIndex test.
1) DeleteTableInMemory could be called multiple times in the case of the index table.
   There is a check that just does noop when the table was already deleted.
   Adjusted this check to do the same when the table is being hidden.
2) Don't remove the table from names map during delete, when it was previously hidden.
   Otherwise, it would crash with fatal during cleanup.
3) DeleteTabletListAndSendRequests executes delete on tablet before commiting tablet info changes.
   As a result tablet could be deleted before and callback called, before info changes in memory.
   So table would hang in delete state. Because callback would think that tablet is not being deleted.
4) Decreased log flooding when compactions are being enabled in RocksDB.
   When compactions are being enabled we call SetOptions twice for each RocksDB, and each of them dumps all current options values.
   So while we have regular and intents DB we have 4 dumps of all rocksdb options.

Also added debug logging to `RWCLock::WriteLock()`, when it takes a too long time to acquire this lock, it would log the stack trace of the successful write lock.

Test Plan: ybd --gtest_filter YbAdminSnapshotScheduleTest.UndeleteIndex -n 20

Reviewers: bogdan

Reviewed By: bogdan

Subscribers: amitanand, ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D11614
spolitov added a commit that referenced this issue May 29, 2021
Summary:
Sometime delete tablet could take some time, for instance because of remote bootstrap.
It could cause the SnapshotScheduleTest.RemoveNewTablets test to fail, because it expects that tablet to be deleted.

This diff fixes SnapshotScheduleTest.RemoveNewTablets by adding WaitFor during check that all necessary tablets were deleted.

Test Plan: ybd --gtest_filter SnapshotScheduleTest.RemoveNewTablets -n 200

Reviewers: bogdan

Reviewed By: bogdan

Subscribers: ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D11756
spolitov added a commit that referenced this issue Jun 15, 2021
Summary:
Fixes issues uncovered by YbAdminSnapshotScheduleTest.UndeleteIndex test.
1) DeleteTableInMemory could be called multiple times in the case of the index table.
   There is a check that just does noop when the table was already deleted.
   Adjusted this check to do the same when the table is being hidden.
2) Don't remove the table from names map during delete, when it was previously hidden.
   Otherwise, it would crash with fatal during cleanup.
3) DeleteTabletListAndSendRequests executes delete on tablet before commiting tablet info changes.
   As a result tablet could be deleted before and callback called, before info changes in memory.
   So table would hang in delete state. Because callback would think that tablet is not being deleted.
4) Decreased log flooding when compactions are being enabled in RocksDB.
   When compactions are being enabled we call SetOptions twice for each RocksDB, and each of them dumps all current options values.
   So while we have regular and intents DB we have 4 dumps of all rocksdb options.

Also added debug logging to `RWCLock::WriteLock()`, when it takes a too long time to acquire this lock, it would log the stack trace of the successful write lock.

Original diff: D11614/5fc9ce1e301015b563af652868d18b1f5cbf4395

Test Plan:
ybd --gtest_filter YbAdminSnapshotScheduleTest.UndeleteIndex -n 20
Jenkins: rebase: 2.6

Reviewers: bogdan

Reviewed By: bogdan

Subscribers: ybase, amitanand

Differential Revision: https://phabricator.dev.yugabyte.com/D11904
spolitov added a commit that referenced this issue Jun 17, 2021
Summary:
Sometime delete tablet could take some time, for instance because of remote bootstrap.
It could cause the SnapshotScheduleTest.RemoveNewTablets test to fail, because it expects that tablet to be deleted.

This diff fixes SnapshotScheduleTest.RemoveNewTablets by adding WaitFor during check that all necessary tablets were deleted.

Original commit: D11756 / 7537427

Test Plan:
ybd --gtest_filter SnapshotScheduleTest.RemoveNewTablets -n 200
Jenkins: rebase: 2.6

Reviewers: bogdan

Reviewed By: bogdan

Subscribers: ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D11951
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/docdb YugabyteDB core features
Projects
None yet
Development

No branches or pull requests

2 participants