[PITR-YCQL-Tablet Splitting] After restore, select data from table throws OperationTimedOut: error #10361

Arjun-yb · 2021-10-20T12:13:26Z

DB Version: 2.8.0.0-b2
Steps:

Create database and table(employees_1)
Create snapshot schedule
Collect time(t1), number of tablets(n)
Start workload and observe the tablet splitting count increases(>n)
Stop the workload and restore to the time(t1) which is collected at step:3
And select the data from the table(employees_1)

ycqlsh:test> select count(*) from employees_1;
OperationTimedOut: errors={'A.B.C.D': 'Client request timeout. See Session.execute[_async](timeout)'}, last_host=A.B.C.D

Observations:

Tablet count increases after running workload.
If user restores data and select data from table, it throws above error(if the tablets count increases in between collect time and restore time)
If the tablets count is same(in between collect time and restore time) restore works fine and user is able to select data from table.

The text was updated successfully, but these errors were encountered:

Summary: After tablet splitting, currently we don't update the partition list version which can lead to the meta cache getting stale and thus queries failing. This diff increments the partition list version post restoration. We also had a unit-test already existing that was passing falsely. This existing unit-test has deficiencies: 1. Splitting was never happening because of less data. 2. After splitting, we didn't touch the data so the cache was still the old partition. Thus, when we query after restoration (we restore to a time before the split), it works because the stale cache is technically correct. 3. The cluster had 3 tservers, and the leaders of 2 tablets were on different tservers. So it happened that the cache invalidation in (2) happened for a different tserver and the data was read from another tserver again passing falsely. Fixed the unit-test to address these problems. Test Plan: ybd --cxx_test yb_admin_snapshot_schedule_test --gtest-filter YbAdminSnapshotScheduleTest.RestoreAfterSplit Reviewers: sergei, bogdan, timur Reviewed By: timur Subscribers: ybase Differential Revision: https://phabricator.dev.yugabyte.com/D16303

Arjun-yb assigned sanketkedia Oct 20, 2021

Arjun-yb added this to To do in PITR via automation Oct 20, 2021

sanketkedia closed this as completed Jun 1, 2022

PITR automation moved this from To do to Done Jun 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PITR-YCQL-Tablet Splitting] After restore, select data from table throws OperationTimedOut: error #10361

[PITR-YCQL-Tablet Splitting] After restore, select data from table throws OperationTimedOut: error #10361

Arjun-yb commented Oct 20, 2021 •

edited

Navigation Menu

[PITR-YCQL-Tablet Splitting] After restore, select data from table throws OperationTimedOut: error #10361

[PITR-YCQL-Tablet Splitting] After restore, select data from table throws OperationTimedOut: error #10361

Comments

Arjun-yb commented Oct 20, 2021 • edited

Arjun-yb commented Oct 20, 2021 •

edited