Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PITR-YCQL-Tablet Splitting] After restore, select data from table throws OperationTimedOut: error #10361

Closed
Arjun-yb opened this issue Oct 20, 2021 · 0 comments
Assignees
Projects

Comments

@Arjun-yb
Copy link
Contributor

Arjun-yb commented Oct 20, 2021

DB Version: 2.8.0.0-b2
Steps:

  1. Create database and table(employees_1)
  2. Create snapshot schedule
  3. Collect time(t1), number of tablets(n)
  4. Start workload and observe the tablet splitting count increases(>n)
  5. Stop the workload and restore to the time(t1) which is collected at step:3
  6. And select the data from the table(employees_1)
ycqlsh:test> select count(*) from employees_1;
OperationTimedOut: errors={'A.B.C.D': 'Client request timeout. See Session.execute[_async](timeout)'}, last_host=A.B.C.D

Observations:

  1. Tablet count increases after running workload.
  2. If user restores data and select data from table, it throws above error(if the tablets count increases in between collect time and restore time)
  3. If the tablets count is same(in between collect time and restore time) restore works fine and user is able to select data from table.
@Arjun-yb Arjun-yb added this to To do in PITR via automation Oct 20, 2021
sanketkedia added a commit that referenced this issue Apr 12, 2022
Summary:
After tablet splitting, currently we don't update the partition list version which can lead
to the meta cache getting stale and thus queries failing.

This diff increments the partition list version post restoration. We also had a unit-test already
existing that was passing falsely. This existing unit-test has deficiencies:
1. Splitting was never happening because of less data.
2. After splitting, we didn't touch the data so the cache was still the old partition. Thus, when we
query after restoration (we restore to a time before the split), it works because the stale cache is technically correct.
3. The cluster had 3 tservers, and the leaders of 2 tablets were on different tservers. So it
happened that the cache invalidation in (2) happened for a different tserver and the data was read from
another tserver again passing falsely.

Fixed the unit-test to address these problems.

Test Plan:
ybd --cxx_test yb_admin_snapshot_schedule_test --gtest-filter
YbAdminSnapshotScheduleTest.RestoreAfterSplit

Reviewers: sergei, bogdan, timur

Reviewed By: timur

Subscribers: ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D16303
PITR automation moved this from To do to Done Jun 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
PITR
Done
Development

No branches or pull requests

2 participants