Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Platform] Add metrics for xCluster replication #3820

Open
ndeodhar opened this issue Mar 3, 2020 · 2 comments
Open

[Platform] Add metrics for xCluster replication #3820

ndeodhar opened this issue Mar 3, 2020 · 2 comments
Assignees
Labels
area/cdc Change Data Capture area/platform Yugabyte Platform
Milestone

Comments

@ndeodhar
Copy link
Contributor

ndeodhar commented Mar 3, 2020

This is a master task to add xCluster replication metrics into Platform. The following needs to be added.

Phase 1

✅ Add the cluster-wide max-lag metrics into Prometheus metrics from the YugabyteDB
✅ Add the max lag metric as a metric graph into Platform
✅ Add a new tab for replication in Platform (universe details)
✅ Replication tab: show if the cluster is caught up or not (max lag is 0 if caught up).
✅ Replication tab: show by default on the source cluster. If not caught up, we should show the max lag as time (seconds, etc).

Phase 2 - v2.3

⬜️ Send alerts on high replication lag
⬜️ Ability to configure replication lag thresholds for sending alerts

Phase 3

⬜️ Give table level breakdown for the above in the replication status page
⬜️ Max lag in terms of op ID at cluster level
⬜️ Max lag per table in terms of op ID

cc: @bmatican @ramkumarvs @rkarthik007 @schoudhury

Aha! Link: https://yugabyte-test.aha.io/features/PLATFORM-644

@ndeodhar ndeodhar added the area/platform Yugabyte Platform label Mar 3, 2020
@ndeodhar ndeodhar added the area/cdc Change Data Capture label Mar 3, 2020
@rkarthik007 rkarthik007 added this to Backlog in Platform Jun 1, 2020
@schoudhury schoudhury moved this from Backlog to To do in Platform Jun 1, 2020
@rkarthik007 rkarthik007 changed the title [YW] Add metrics for 2DC [Platform] Add metrics for xCluster replication Jun 1, 2020
@schoudhury
Copy link
Contributor

Status update - currently in design.

@rkarthik007 rkarthik007 added this to the v2.2.x milestone Jun 5, 2020
@bmatican bmatican moved this from To do to In progress in Platform Jun 8, 2020
@rkarthik007 rkarthik007 modified the milestones: v2.2.x, v2.2 Jun 22, 2020
andrewc-dev pushed a commit that referenced this issue Jul 6, 2020
Summary:
Add "Replication" tab into universe overview if enabled and query for replication lag metric as implemented in https://phabricator.dev.yugabyte.com/D8733
For now, show the latest value of the committed lag metric. Convert metric number to human-readable units where possible, ie if large enough, so minutes or seconds lag time instead of microseconds.

Test Plan:
Note: User's feature config must set `universes.details.replication: 'available'` in order for Replication tab to appear.
Go to universe overview and then the Replication tab. Confirm that the page displays the
metric for `tserver_async_replication_lag_micros` as set in DocDB layer. Confirm in Prometheus that the
value is correct.

Example of Replication page with lag 0:
{F13767}
Example of metric number with large lag:
{F13769}

Reviewers: ram, rahuldesirazu, sshevchenko

Reviewed By: sshevchenko

Subscribers: ui, jenkins-bot

Differential Revision: https://phabricator.dev.yugabyte.com/D8738
@andrewc-dev
Copy link
Contributor

Phase 1 is mostly complete. We currently do not show the status on the destination cluster, only on the source cluster. This seems to be a limitation of the CDC architecture as currently implemented. @ndeodhar

andrewc-dev pushed a commit that referenced this issue Jul 14, 2020
Summary:
Add "Replication" tab into universe overview if enabled and query for replication lag metric as implemented in https://phabricator.dev.yugabyte.com/D8733
For now, show the latest value of the committed lag metric. Convert metric number to human-readable units where possible, ie if large enough, so minutes or seconds lag time instead of microseconds.

Test Plan:
Note: User's feature config must set `universes.details.replication: 'available'` in order for Replication tab to appear.
Go to universe overview and then the Replication tab. Confirm that the page displays the
metric for `tserver_async_replication_lag_micros` as set in DocDB layer. Confirm in Prometheus that the
value is correct.

Example of Replication page with lag 0:
{F13767}
Example of metric number with large lag:
{F13769}

Reviewers: ram, rahuldesirazu, sshevchenko

Reviewed By: sshevchenko

Subscribers: ui, jenkins-bot

Differential Revision: https://phabricator.dev.yugabyte.com/D8738
@rkarthik007 rkarthik007 modified the milestones: v2.2, v2.3 Jul 20, 2020
@rkarthik007 rkarthik007 moved this from In progress to To do in Platform Aug 4, 2020
@chirag-yb chirag-yb modified the milestones: v2.3, v2.2.x Aug 11, 2020
@streddy-yb streddy-yb moved this from To do to Backlog in Platform Aug 25, 2020
@bmatican bmatican added this to To do in xCluster replication via automation Jul 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cdc Change Data Capture area/platform Yugabyte Platform
Projects
Platform
  
Backlog
Development

No branches or pull requests

8 participants