Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DocDB] [xCluster] [Tablet Splitting] Replication is incomplete for range sharded table #15087

Closed
Arjun-yb opened this issue Nov 21, 2022 · 1 comment
Assignees
Labels
2.14.7_blocker 2.16.1_blocker area/docdb YugabyteDB core features kind/bug This issue is a bug priority/high High Priority tablet-splitting xCluster Label for xCluster related issues/improvements

Comments

@Arjun-yb
Copy link
Contributor

Arjun-yb commented Nov 21, 2022

Jira Link: DB-4303

Description

Version : 2.16.0.0-b49

Steps:

  1. Create Source and Target universes with below GFLAGS
Master:
{
"tablet_split_high_phase_shard_count_per_node": 10000, 
"tablet_split_high_phase_size_threshold_bytes": 10485760,  # 10 MB
"tablet_split_low_phase_size_threshold_bytes": 2097152,  # 2 MB
"tablet_split_low_phase_shard_count_per_node": 16
}
  1. Create a table at both sides
    create table table1(id int, name text, age int, description text, primary key(id ASC));
  2. Setup replication from Source to Target.
  3. Load 100K rows and observe the replication.

Actual:

  1. Replication is too slow. # perf issue
  2. It replicates 20-30% of data.
    Observed below errors at Source:
E1121 15:37:14.346189 22086 meta_cache.cc:502] T 1aafb64cca9d4273bded5b468e388757: Received error from GetTabletStatus: Not found (yb/tserver/ts_tablet_manager.cc:1836): Tablet 1aafb64cca9d4273bded5b468e388757 not found
E1121 15:37:14.350801 22085 meta_cache.cc:502] T 1aafb64cca9d4273bded5b468e388757: Received error from GetTabletStatus: Not found (yb/tserver/ts_tablet_manager.cc:1836): Tablet 1aafb64cca9d4273bded5b468e388757 not found
E1121 15:37:14.353968 22085 meta_cache.cc:502] T e345a49978bd4c708e38300f0099da8c: Received error from GetTabletStatus: Not found (yb/tserver/ts_tablet_manager.cc:1836): Tablet e345a49978bd4c708e38300f0099da8c not found
E1121 15:37:14.355094 22085 meta_cache.cc:502] T beb115ad3ed7422abd8c4ca79049dd4f: Received error from GetTabletStatus: Not found (yb/tserver/ts_tablet_manager.cc:1836): Tablet beb115ad3ed7422abd8c4ca79049dd4f not found
E1121 15:37:14.357215 22084 meta_cache.cc:502] T 1aafb64cca9d4273bded5b468e388757: Received error from GetTabletStatus: Not found (yb/tserver/ts_tablet_manager.cc:1836): Tablet 1aafb64cca9d4273bded5b468e388757 not found

Note: It is working fine without tablet splitting.

@Arjun-yb Arjun-yb added area/docdb YugabyteDB core features status/awaiting-triage Issue awaiting triage labels Nov 21, 2022
@yugabyte-ci yugabyte-ci added kind/bug This issue is a bug priority/medium Medium priority issue labels Nov 21, 2022
@Arjun-yb Arjun-yb added xCluster Label for xCluster related issues/improvements tablet-splitting labels Nov 21, 2022
@yugabyte-ci yugabyte-ci added priority/high High Priority and removed priority/medium Medium priority issue status/awaiting-triage Issue awaiting triage labels Nov 24, 2022
hulien22 added a commit that referenced this issue Jan 6, 2023
Summary:
Fixing issues with xcluster + ranged keys + uneven tablet counts. Note that even though these issues have been around for a while, they have not been noticed due to uneven tablet counts not being common for ranged partitions - this is only more prevalent now with tablet splitting + xcluster.

Issues fixed:
- Issues with assigning pollers with uneven tablet counts and ranged partitions.
  - Removing hardcoded hash checks and assumptions
  - Introducing different overlap functions to handle both hash and ranged keys.
    - replacing GetOverlap with HasOverlap (used in `ProcessRecordForTabletRange`)
  - For finding best locality, now just search for whichever tablet has the lexicographic middle key
- Change the key that we transmit for GetChanges responses
  - For hash keys, send the encoded hash value (instead of sending a serialized number, then having the consumer cast and encode it)
  - For range keys (if the key has no hash value), we send the encoded key value

Test Plan:
Added ranged tests and ranged tests with uneven partitions
```
ybd --cxx-test integration-tests_twodc_ysql-test --gtest_filter "*TwoDCYsqlTest.SimpleReplicationWithRangedPartitions/0"
ybd --cxx-test integration-tests_twodc_ysql-test --gtest_filter "*TwoDCYsqlTest.SimpleReplicationWithRangedPartitionsAndUnevenTabletCounts/0"
```

Also simple tests for the lexicographically ordered middle key algorithm:
```
ybd --cxx-test partition-test --gtest_filter PartitionTest.TestLexicographicMiddleKey
```

Reviewers: rahuldesirazu, nicolas, hsunder

Reviewed By: nicolas, hsunder

Subscribers: ybase, bogdan

Differential Revision: https://phabricator.dev.yugabyte.com/D21398
hulien22 added a commit that referenced this issue Jan 6, 2023
Summary:
Original commit: b20ca0e / D21398
Fixing issues with xcluster + ranged keys + uneven tablet counts. Note that even though these issues have been around for a while, they have not been noticed due to uneven tablet counts not being common for ranged partitions - this is only more prevalent now with tablet splitting + xcluster.

Issues fixed:
- Issues with assigning pollers with uneven tablet counts and ranged partitions.
  - Removing hardcoded hash checks and assumptions
  - Introducing different overlap functions to handle both hash and ranged keys.
    - replacing GetOverlap with HasOverlap (used in `ProcessRecordForTabletRange`)
  - For finding best locality, now just search for whichever tablet has the lexicographic middle key
- Change the key that we transmit for GetChanges responses
  - For hash keys, send the encoded hash value (instead of sending a serialized number, then having the consumer cast and encode it)
  - For range keys (if the key has no hash value), we send the encoded key value

Test Plan:
Added ranged tests and ranged tests with uneven partitions
```
ybd --cxx-test integration-tests_twodc_ysql-test --gtest_filter "*TwoDCYsqlTest.SimpleReplicationWithRangedPartitions/0"
ybd --cxx-test integration-tests_twodc_ysql-test --gtest_filter "*TwoDCYsqlTest.SimpleReplicationWithRangedPartitionsAndUnevenTabletCounts/0"
```

Also simple tests for the lexicographically ordered middle key algorithm:
```
ybd --cxx-test partition-test --gtest_filter PartitionTest.TestLexicographicMiddleKey
```

Reviewers: rahuldesirazu, nicolas, hsunder

Reviewed By: hsunder

Subscribers: bogdan, ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D22125
@hulien22
Copy link
Contributor

hulien22 commented Jan 6, 2023

closed by b20ca0e

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.14.7_blocker 2.16.1_blocker area/docdb YugabyteDB core features kind/bug This issue is a bug priority/high High Priority tablet-splitting xCluster Label for xCluster related issues/improvements
Projects
None yet
Development

No branches or pull requests

4 participants