[Platform] Add metrics for xCluster replication #3820

ndeodhar · 2020-03-03T23:53:49Z

This is a master task to add xCluster replication metrics into Platform. The following needs to be added.

Phase 1

✅ Add the cluster-wide max-lag metrics into Prometheus metrics from the YugabyteDB
✅ Add the max lag metric as a metric graph into Platform
✅ Add a new tab for replication in Platform (universe details)
✅ Replication tab: show if the cluster is caught up or not (max lag is 0 if caught up).
✅ Replication tab: show by default on the source cluster. If not caught up, we should show the max lag as time (seconds, etc).

Phase 2 - v2.3

⬜️ Send alerts on high replication lag
⬜️ Ability to configure replication lag thresholds for sending alerts

Phase 3

⬜️ Give table level breakdown for the above in the replication status page
⬜️ Max lag in terms of op ID at cluster level
⬜️ Max lag per table in terms of op ID

cc: @bmatican @ramkumarvs @rkarthik007 @schoudhury

Aha! Link: https://yugabyte-test.aha.io/features/PLATFORM-644

schoudhury · 2020-06-05T17:39:23Z

Status update - currently in design.

Summary: Add "Replication" tab into universe overview if enabled and query for replication lag metric as implemented in https://phabricator.dev.yugabyte.com/D8733 For now, show the latest value of the committed lag metric. Convert metric number to human-readable units where possible, ie if large enough, so minutes or seconds lag time instead of microseconds. Test Plan: Note: User's feature config must set `universes.details.replication: 'available'` in order for Replication tab to appear. Go to universe overview and then the Replication tab. Confirm that the page displays the metric for `tserver_async_replication_lag_micros` as set in DocDB layer. Confirm in Prometheus that the value is correct. Example of Replication page with lag 0: {F13767} Example of metric number with large lag: {F13769} Reviewers: ram, rahuldesirazu, sshevchenko Reviewed By: sshevchenko Subscribers: ui, jenkins-bot Differential Revision: https://phabricator.dev.yugabyte.com/D8738

andrewc-dev · 2020-07-06T19:02:21Z

Phase 1 is mostly complete. We currently do not show the status on the destination cluster, only on the source cluster. This seems to be a limitation of the CDC architecture as currently implemented. @ndeodhar

Summary: Add "Replication" tab into universe overview if enabled and query for replication lag metric as implemented in https://phabricator.dev.yugabyte.com/D8733 For now, show the latest value of the committed lag metric. Convert metric number to human-readable units where possible, ie if large enough, so minutes or seconds lag time instead of microseconds. Test Plan: Note: User's feature config must set `universes.details.replication: 'available'` in order for Replication tab to appear. Go to universe overview and then the Replication tab. Confirm that the page displays the metric for `tserver_async_replication_lag_micros` as set in DocDB layer. Confirm in Prometheus that the value is correct. Example of Replication page with lag 0: {F13767} Example of metric number with large lag: {F13769} Reviewers: ram, rahuldesirazu, sshevchenko Reviewed By: sshevchenko Subscribers: ui, jenkins-bot Differential Revision: https://phabricator.dev.yugabyte.com/D8738

ndeodhar added the area/platform Yugabyte Platform label Mar 3, 2020

ndeodhar assigned ramkumarvs Mar 3, 2020

ndeodhar added the area/cdc Change Data Capture label Mar 3, 2020

rkarthik007 added this to Backlog in Platform Jun 1, 2020

schoudhury moved this from Backlog to To do in Platform Jun 1, 2020

rkarthik007 assigned rahuldesirazu and unassigned ramkumarvs Jun 1, 2020

rkarthik007 changed the title ~~[YW] Add metrics for 2DC~~ [Platform] Add metrics for xCluster replication Jun 1, 2020

rkarthik007 added this to the v2.2.x milestone Jun 5, 2020

bmatican moved this from To do to In progress in Platform Jun 8, 2020

rkarthik007 modified the milestones: v2.2.x, v2.2 Jun 22, 2020

rkarthik007 assigned andrewc-dev Jun 24, 2020

rkarthik007 modified the milestones: v2.2, v2.3 Jul 20, 2020

rkarthik007 assigned rao-vasireddy and unassigned rahuldesirazu and andrewc-dev Jul 20, 2020

rkarthik007 moved this from In progress to To do in Platform Aug 4, 2020

chirag-yb modified the milestones: v2.3, v2.2.x Aug 11, 2020

streddy-yb moved this from To do to Backlog in Platform Aug 25, 2020

bmatican added this to To do in xCluster replication via automation Jul 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Platform] Add metrics for xCluster replication #3820

[Platform] Add metrics for xCluster replication #3820

ndeodhar commented Mar 3, 2020 •

edited by chirag-yb

schoudhury commented Jun 5, 2020

andrewc-dev commented Jul 6, 2020

[Platform] Add metrics for xCluster replication #3820

[Platform] Add metrics for xCluster replication #3820

Comments

ndeodhar commented Mar 3, 2020 • edited by chirag-yb

Phase 1

Phase 2 - v2.3

Phase 3

schoudhury commented Jun 5, 2020

andrewc-dev commented Jul 6, 2020

ndeodhar commented Mar 3, 2020 •

edited by chirag-yb