Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[xCluster] Lag gets "stuck" for tablet splits #17025

Closed
hulien22 opened this issue Apr 25, 2023 · 0 comments
Closed

[xCluster] Lag gets "stuck" for tablet splits #17025

hulien22 opened this issue Apr 25, 2023 · 0 comments
Assignees
Labels
area/docdb YugabyteDB core features kind/bug This issue is a bug priority/high High Priority xCluster Label for xCluster related issues/improvements

Comments

@hulien22
Copy link
Contributor

hulien22 commented Apr 25, 2023

Jira Link: DB-6339
A split parent tablet doesn't get any new writes, so its last record time stops increasing (as writes are going to its children) - So even if replication is paused and we keep writing to this parent tablet’s key range, its lag won’t increase (this is calculated as last record time - last replicated time) even though we are still writing to this partition range.

In addition if we are slow at processing this parent tablet, when we get to the split, it now looks like we have missed replicating a ton of data, leading to a potential jump up in lag.

Currently we don't generate any metrics for split children, but perhaps thats something we could do to fix this? Would need to have a separation for bootstrapped streams vs new split children streams

@hulien22 hulien22 added area/docdb YugabyteDB core features xCluster Label for xCluster related issues/improvements labels Apr 25, 2023
@hulien22 hulien22 self-assigned this Apr 25, 2023
@yugabyte-ci yugabyte-ci added kind/bug This issue is a bug priority/medium Medium priority issue labels Apr 25, 2023
@yugabyte-ci yugabyte-ci added priority/high High Priority and removed priority/medium Medium priority issue labels Aug 15, 2023
hulien22 added a commit that referenced this issue Sep 27, 2023
… children

Summary:
For split tablet children that have yet to be polled for, we end up with incorrect lag metrics for
that hierarchy of tablets. This innaccuracy is because we calculate lag metrics as (time of last
record) - (time of last sent record). For the parent tablet, (time of last record) doesn't change
since we aren't writing to this tablet anymore; And for children tablets, we don't have a (time of
last sent record) as this tablet isn't being polled for..

Ideally what we want is (time of last record on the split tablet) - (time of last sent record on the
parent tablet).

This fix implements this metric calculation. The lag for the parent tablet is still calculcated in
the same way, but children tablets will fetch the parent tablet's last sent/committed record time
and use that for their lag calculation. Since we aggregate on stream in the end, we will then have
the children tablet metrics as the final value.
Jira: DB-6339

Test Plan:
```
ybd --cxx-test integration-tests_xcluster-tablet-split-itest --gtest_filter XClusterTabletSplitMetricsTest.VerifyReplicationLagMetricsOnChildren
```

Reviewers: hsunder, xCluster

Reviewed By: hsunder

Subscribers: ybase, ycdcxcluster

Differential Revision: https://phorge.dev.yugabyte.com/D28741
hulien22 added a commit that referenced this issue Sep 28, 2023
…or split tablet children

Summary:
Original commit: bac8402 / D28741
For split tablet children that have yet to be polled for, we end up with incorrect lag metrics for
that hierarchy of tablets. This innaccuracy is because we calculate lag metrics as (time of last
record) - (time of last sent record). For the parent tablet, (time of last record) doesn't change
since we aren't writing to this tablet anymore; And for children tablets, we don't have a (time of
last sent record) as this tablet isn't being polled for..

Ideally what we want is (time of last record on the split tablet) - (time of last sent record on the
parent tablet).

This fix implements this metric calculation. The lag for the parent tablet is still calculcated in
the same way, but children tablets will fetch the parent tablet's last sent/committed record time
and use that for their lag calculation. Since we aggregate on stream in the end, we will then have
the children tablet metrics as the final value.
Jira: DB-6339

Test Plan:
```
ybd --cxx-test integration-tests_xcluster-tablet-split-itest --gtest_filter XClusterTabletSplitMetricsTest.VerifyReplicationLagMetricsOnChildren
```

Reviewers: hsunder, xCluster

Reviewed By: hsunder

Subscribers: ycdcxcluster, ybase

Differential Revision: https://phorge.dev.yugabyte.com/D28846
hulien22 added a commit that referenced this issue Sep 28, 2023
…or split tablet children

Summary:
Original commit: bac8402 / D28741
For split tablet children that have yet to be polled for, we end up with incorrect lag metrics for
that hierarchy of tablets. This innaccuracy is because we calculate lag metrics as (time of last
record) - (time of last sent record). For the parent tablet, (time of last record) doesn't change
since we aren't writing to this tablet anymore; And for children tablets, we don't have a (time of
last sent record) as this tablet isn't being polled for..

Ideally what we want is (time of last record on the split tablet) - (time of last sent record on the
parent tablet).

This fix implements this metric calculation. The lag for the parent tablet is still calculcated in
the same way, but children tablets will fetch the parent tablet's last sent/committed record time
and use that for their lag calculation. Since we aggregate on stream in the end, we will then have
the children tablet metrics as the final value.
Jira: DB-6339

Test Plan:
```
ybd --cxx-test integration-tests_xcluster-tablet-split-itest --gtest_filter XClusterTabletSplitMetricsTest.VerifyReplicationLagMetricsOnChildren
```

Reviewers: hsunder, xCluster

Reviewed By: hsunder

Subscribers: ybase, ycdcxcluster

Differential Revision: https://phorge.dev.yugabyte.com/D28847
@hulien22 hulien22 closed this as completed Oct 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/docdb YugabyteDB core features kind/bug This issue is a bug priority/high High Priority xCluster Label for xCluster related issues/improvements
Projects
None yet
Development

No branches or pull requests

2 participants