Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Platform] New Health check for Replication status in 2dc setup #4843

Closed
rao-vasireddy opened this issue Jun 21, 2020 · 0 comments
Closed

[Platform] New Health check for Replication status in 2dc setup #4843

rao-vasireddy opened this issue Jun 21, 2020 · 0 comments
Assignees
Labels
area/cdc Change Data Capture area/platform Yugabyte Platform kind/enhancement This is an enhancement of an existing feature
Projects
Milestone

Comments

@rao-vasireddy
Copy link
Contributor

rao-vasireddy commented Jun 21, 2020

Raise an alert when the replication lag between two DC set-up exceeds the threshold.

Aha! Link: https://yugabyte-test.aha.io/features/PLATFORM-264

@rao-vasireddy rao-vasireddy added kind/enhancement This is an enhancement of an existing feature area/cdc Change Data Capture area/platform Yugabyte Platform labels Jun 21, 2020
@rkarthik007 rkarthik007 added this to To do in Platform Jul 6, 2020
@streddy-yb streddy-yb moved this from To do to Backlog in Platform Aug 25, 2020
@streddy-yb streddy-yb assigned daniel-yb and unassigned andrewc-dev and chirag-yb Oct 12, 2020
@streddy-yb streddy-yb moved this from Backlog to To do in Platform Oct 12, 2020
@streddy-yb streddy-yb added this to the v2.3 milestone Oct 12, 2020
@streddy-yb streddy-yb moved this from To do to In progress in Platform Oct 12, 2020
@streddy-yb streddy-yb moved this from In progress to In Review in Platform Oct 26, 2020
@streddy-yb streddy-yb modified the milestones: v2.3, 2.5.1.0 Nov 11, 2020
daniel-yb added a commit that referenced this issue Nov 12, 2020
Summary:
Add ability to configure alerts to be triggered based off of prometheus queries.

Also add UI for replication lag alert. Configurable to be turned on/off at runtime as well as have the threshold (in ms) configurable by the user. Default is 3m.

Test Plan:
Use a proxy metric (cpu_utime) to test the end-to-end flow including UI + triggering alert + resolving alert (through flipping the ">" to "<" in the query) + verifying corresponding emails are sent if the alert is triggered + resolved/not sent if the setting is disabled in the UI.

```
---------- MESSAGE FOLLOWS ----------
Content-Type: multipart/alternative;
 boundary="===============1491229132144892902=="
MIME-Version: 1.0
Subject: Yugabyte Platform Alert - <[admin][admin]>
From:
To: daniel@yugabyte.com
X-Peer: 127.0.0.1

--===============1491229132144892902==
Content-Type: text/plain; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit

Replication Lag Alert for daniel-test-backup-alert-oct-20-4 is firing.
--===============1491229132144892902==--
------------ END MESSAGE ------------
```

{F14427}

{F14428}

{F14429}

Reviewers: arnav, wesley, spotachev, sanketh, andrew

Reviewed By: sanketh, andrew

Subscribers: andrew, sshevchenko, jenkins-bot

Differential Revision: https://phabricator.dev.yugabyte.com/D9726
@streddy-yb streddy-yb moved this from In Review to Needs QA/Docs in Platform Nov 13, 2020
@streddy-yb streddy-yb moved this from Needs QA/Docs to Closed in Platform Feb 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cdc Change Data Capture area/platform Yugabyte Platform kind/enhancement This is an enhancement of an existing feature
Projects
Platform
  
Closed
Development

No branches or pull requests

7 participants