New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[xCluster] Lag gets "stuck" for tablet splits #17025
Labels
area/docdb
YugabyteDB core features
kind/bug
This issue is a bug
priority/high
High Priority
xCluster
Label for xCluster related issues/improvements
Comments
hulien22
added
area/docdb
YugabyteDB core features
xCluster
Label for xCluster related issues/improvements
labels
Apr 25, 2023
yugabyte-ci
added
kind/bug
This issue is a bug
priority/medium
Medium priority issue
labels
Apr 25, 2023
yugabyte-ci
added
priority/high
High Priority
and removed
priority/medium
Medium priority issue
labels
Aug 15, 2023
hulien22
added a commit
that referenced
this issue
Sep 27, 2023
… children Summary: For split tablet children that have yet to be polled for, we end up with incorrect lag metrics for that hierarchy of tablets. This innaccuracy is because we calculate lag metrics as (time of last record) - (time of last sent record). For the parent tablet, (time of last record) doesn't change since we aren't writing to this tablet anymore; And for children tablets, we don't have a (time of last sent record) as this tablet isn't being polled for.. Ideally what we want is (time of last record on the split tablet) - (time of last sent record on the parent tablet). This fix implements this metric calculation. The lag for the parent tablet is still calculcated in the same way, but children tablets will fetch the parent tablet's last sent/committed record time and use that for their lag calculation. Since we aggregate on stream in the end, we will then have the children tablet metrics as the final value. Jira: DB-6339 Test Plan: ``` ybd --cxx-test integration-tests_xcluster-tablet-split-itest --gtest_filter XClusterTabletSplitMetricsTest.VerifyReplicationLagMetricsOnChildren ``` Reviewers: hsunder, xCluster Reviewed By: hsunder Subscribers: ybase, ycdcxcluster Differential Revision: https://phorge.dev.yugabyte.com/D28741
hulien22
added a commit
that referenced
this issue
Sep 28, 2023
…or split tablet children Summary: Original commit: bac8402 / D28741 For split tablet children that have yet to be polled for, we end up with incorrect lag metrics for that hierarchy of tablets. This innaccuracy is because we calculate lag metrics as (time of last record) - (time of last sent record). For the parent tablet, (time of last record) doesn't change since we aren't writing to this tablet anymore; And for children tablets, we don't have a (time of last sent record) as this tablet isn't being polled for.. Ideally what we want is (time of last record on the split tablet) - (time of last sent record on the parent tablet). This fix implements this metric calculation. The lag for the parent tablet is still calculcated in the same way, but children tablets will fetch the parent tablet's last sent/committed record time and use that for their lag calculation. Since we aggregate on stream in the end, we will then have the children tablet metrics as the final value. Jira: DB-6339 Test Plan: ``` ybd --cxx-test integration-tests_xcluster-tablet-split-itest --gtest_filter XClusterTabletSplitMetricsTest.VerifyReplicationLagMetricsOnChildren ``` Reviewers: hsunder, xCluster Reviewed By: hsunder Subscribers: ycdcxcluster, ybase Differential Revision: https://phorge.dev.yugabyte.com/D28846
hulien22
added a commit
that referenced
this issue
Sep 28, 2023
…or split tablet children Summary: Original commit: bac8402 / D28741 For split tablet children that have yet to be polled for, we end up with incorrect lag metrics for that hierarchy of tablets. This innaccuracy is because we calculate lag metrics as (time of last record) - (time of last sent record). For the parent tablet, (time of last record) doesn't change since we aren't writing to this tablet anymore; And for children tablets, we don't have a (time of last sent record) as this tablet isn't being polled for.. Ideally what we want is (time of last record on the split tablet) - (time of last sent record on the parent tablet). This fix implements this metric calculation. The lag for the parent tablet is still calculcated in the same way, but children tablets will fetch the parent tablet's last sent/committed record time and use that for their lag calculation. Since we aggregate on stream in the end, we will then have the children tablet metrics as the final value. Jira: DB-6339 Test Plan: ``` ybd --cxx-test integration-tests_xcluster-tablet-split-itest --gtest_filter XClusterTabletSplitMetricsTest.VerifyReplicationLagMetricsOnChildren ``` Reviewers: hsunder, xCluster Reviewed By: hsunder Subscribers: ybase, ycdcxcluster Differential Revision: https://phorge.dev.yugabyte.com/D28847
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
area/docdb
YugabyteDB core features
kind/bug
This issue is a bug
priority/high
High Priority
xCluster
Label for xCluster related issues/improvements
Jira Link: DB-6339
A split parent tablet doesn't get any new writes, so its last record time stops increasing (as writes are going to its children) - So even if replication is paused and we keep writing to this parent tablet’s key range, its lag won’t increase (this is calculated as last record time - last replicated time) even though we are still writing to this partition range.
In addition if we are slow at processing this parent tablet, when we get to the split, it now looks like we have missed replicating a ton of data, leading to a potential jump up in lag.
Currently we don't generate any metrics for split children, but perhaps thats something we could do to fix this? Would need to have a separation for bootstrapped streams vs new split children streams
The text was updated successfully, but these errors were encountered: