mixin: Use sidecar's metric timestamp for healthcheck #3204

hwoarang · 2020-09-22T12:35:32Z

During prometheus updates the alert was firing because the metric was
initialized with a value of '0' before the first heartbeat was sent. As
such, the evaluation of the alert results into actually taking just the
value of time() into consideration which led to misleading information
about the health of the sidecar.

As the thanos_sidecar_last_heartbeat_success_time_seconds metric is
effectively just a timestamp that resets on new deployments, we can
simply wrap it around the timestamp() function which should return
almost the same value of the metric itself with the added benefit that
heartbeat resets will be ignored.

Signed-off-by: Markos Chandras markos@chandras.me

I added CHANGELOG entry for this change.
Change is not relevant to the end user.

Changes

Verification

mixin/alerts/sidecar.libsonnet

stale · 2020-11-30T12:14:00Z

Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward?

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

hwoarang · 2020-11-30T15:34:20Z

Update PR to match latest master branch

@kakkoyun any thoughts here?

stale · 2021-01-30T03:41:03Z

Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward?

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

hwoarang · 2021-02-01T07:14:47Z

@kakkoyun @bwplotka would it be possible to merge this? If not, do you have any feedback on how to bring it into mergeable state? :) Thank you

kakkoyun

LGTM

kakkoyun · 2021-02-01T07:19:39Z

@hwoarang This looks good to me in principle, however, you need to generate docs and make sure this fulfils the expected behaviour with tests. CI already points out where it falls short. You can find all necessary task as Make actions.

kakkoyun

Requesting changes as commented above.

hwoarang · 2021-02-12T12:44:00Z

@hwoarang This looks good to me in principle, however, you need to generate docs and make sure this fulfils the expected behaviour with tests. CI already points out where it falls short. You can find all necessary task as Make actions.

@kakkoyun thank you for your input and apologies for taking a while to address your concerns. Tests are passing now as I had to refactor tests a little bit as we are now effectively testing for a different kind of alert. Please let me know your thoughts. Thank you

hwoarang · 2021-02-12T13:14:46Z

The only tests that seems to fail in circle CI is --- FAIL: TestBucketStore_ManyParts_e2e (0.20s) which I believe is not related to this PR

kakkoyun · 2021-03-01T08:52:47Z

Hey @hwoarang, enabled the auto-merge and approved. Please fix the conflicts for the Changelog.

During prometheus updates the alert was firing because the metric was initialized with a value of '0' before the first heartbeat was sent. As such, the evaluation of the alert results into actually taking just the value of time() into consideration which led to misleading information about the health of the sidecar. As the thanos_sidecar_last_heartbeat_success_time_seconds metric is effectively just a timestamp that resets on new deployments, we can simply wrap it around the timestamp() function which should return almost the same value of the metric itself with the added benefit that heartbeat resets will be ignored. This also refactors the relevant tests and drops the timeout to 4 minutes in order to ensure that we do not get hit by stale data if the sidecar takes longer to start. Signed-off-by: Markos Chandras <markos@chandras.me>

hwoarang · 2021-03-05T12:35:13Z

Hey @hwoarang, enabled the auto-merge and approved. Please fix the conflicts for the Changelog.

Hello @kakkoyun . Thank you so much. I will have this ready today.

This reverts commit 80e0257.

kakkoyun · 2021-03-05T12:56:04Z

@hwoarang This PR has introduced regressions around the pod to instance label renamings. It's my bad to mark as auto-merge. I didn't anticipate this. I'll send a consecutive PR to fix the issues.

hwoarang · 2021-03-05T13:03:00Z

@kakkoyun really sorry that I missed that locally. I assumed the green CI was a good indication :)

During prometheus updates the alert was firing because the metric was initialized with a value of '0' before the first heartbeat was sent. As such, the evaluation of the alert results into actually taking just the value of time() into consideration which led to misleading information about the health of the sidecar. As the thanos_sidecar_last_heartbeat_success_time_seconds metric is effectively just a timestamp that resets on new deployments, we can simply wrap it around the timestamp() function which should return almost the same value of the metric itself with the added benefit that heartbeat resets will be ignored. This also refactors the relevant tests and drops the timeout to 4 minutes in order to ensure that we do not get hit by stale data if the sidecar takes longer to start. Signed-off-by: Markos Chandras <markos@chandras.me>

During prometheus updates the alert was firing because the metric was initialized with a value of '0' before the first heartbeat was sent. As such, the evaluation of the alert results into actually taking just the value of time() into consideration which led to misleading information about the health of the sidecar. As the thanos_sidecar_last_heartbeat_success_time_seconds metric is effectively just a timestamp that resets on new deployments, we can simply wrap it around the timestamp() function which should return almost the same value of the metric itself with the added benefit that heartbeat resets will be ignored. This also refactors the relevant tests and drops the timeout to 4 minutes in order to ensure that we do not get hit by stale data if the sidecar takes longer to start. Signed-off-by: Markos Chandras <markos@chandras.me> Signed-off-by: Damien Grisonnet <dgrisonn@redhat.com>

During prometheus updates the alert was firing because the metric was initialized with a value of '0' before the first heartbeat was sent. As such, the evaluation of the alert results into actually taking just the value of time() into consideration which led to misleading information about the health of the sidecar. As the thanos_sidecar_last_heartbeat_success_time_seconds metric is effectively just a timestamp that resets on new deployments, we can simply wrap it around the timestamp() function which should return almost the same value of the metric itself with the added benefit that heartbeat resets will be ignored. This also refactors the relevant tests and drops the timeout to 4 minutes in order to ensure that we do not get hit by stale data if the sidecar takes longer to start. Signed-off-by: Markos Chandras <markos@chandras.me> Signed-off-by: Damien Grisonnet <dgrisonn@redhat.com> Co-authored-by: Markos Chandras <hwoarang@users.noreply.github.com>

…er healthy (#4342) * Revert "mixin: Use sidecar's metric timestamp for healthcheck (#3204) (#3979)" This reverts commit 5139e33. Signed-off-by: Arunprasad Rajkumar <arajkuma@redhat.com> * fix(mixin): ThanosSidecarUnhealthy doesn't fire if the sidecar is never healthy Signed-off-by: Arunprasad Rajkumar <arajkuma@redhat.com>

hwoarang force-pushed the add-2m-delay-sidecar branch 2 times, most recently from 8f7b458 to e097018 Compare September 22, 2020 12:37

kakkoyun reviewed Sep 22, 2020

View reviewed changes

mixin/alerts/sidecar.libsonnet Outdated Show resolved Hide resolved

hwoarang force-pushed the add-2m-delay-sidecar branch from e097018 to 47cb8d8 Compare September 24, 2020 06:38

hwoarang changed the title ~~mixin: Use 2m interval for Thanos sidecar healthcheck~~ mixin: Use sidecar's metric timestamp for healthcheck Sep 24, 2020

hwoarang force-pushed the add-2m-delay-sidecar branch from 47cb8d8 to 50f8f99 Compare September 24, 2020 09:02

hwoarang requested a review from kakkoyun October 1, 2020 08:28

stale bot added the stale label Nov 30, 2020

hwoarang force-pushed the add-2m-delay-sidecar branch from 50f8f99 to 88b2e6e Compare November 30, 2020 15:34

stale bot removed the stale label Nov 30, 2020

stale bot added the stale label Jan 30, 2021

hwoarang force-pushed the add-2m-delay-sidecar branch from 88b2e6e to 7313f0a Compare February 1, 2021 07:12

stale bot removed the stale label Feb 1, 2021

kakkoyun approved these changes Feb 1, 2021

View reviewed changes

kakkoyun requested changes Feb 1, 2021

View reviewed changes

hwoarang force-pushed the add-2m-delay-sidecar branch 3 times, most recently from 7dba321 to 4568191 Compare February 12, 2021 12:26

hwoarang requested a review from kakkoyun February 12, 2021 12:44

Base automatically changed from master to main February 26, 2021 16:30

kakkoyun enabled auto-merge (squash) March 1, 2021 08:52

kakkoyun approved these changes Mar 1, 2021

View reviewed changes

hwoarang force-pushed the add-2m-delay-sidecar branch from 4568191 to b4bf820 Compare March 5, 2021 12:25

hwoarang force-pushed the add-2m-delay-sidecar branch from b4bf820 to 95b2fe9 Compare March 5, 2021 12:32

kakkoyun merged commit 80e0257 into thanos-io:main Mar 5, 2021

kakkoyun added a commit that referenced this pull request Mar 5, 2021

Revert "mixin: Use sidecar's metric timestamp for healthcheck (#3204)"

cc7c07e

This reverts commit 80e0257.

hwoarang deleted the add-2m-delay-sidecar branch March 5, 2021 12:52

kakkoyun mentioned this pull request Mar 5, 2021

Fix regression introduced in previous PR #3884

Merged

openshift-ci-robot mentioned this pull request Mar 25, 2021

Bug 1921335: Fix and adjust ThanosSidecarUnhealthy alert openshift/cluster-monitoring-operator#1090

Merged

2 tasks

dgrisonnet mentioned this pull request Mar 26, 2021

Cherry-pick #3204 after it has been inadvertently removed #3979

Merged

2 tasks

dgrisonnet mentioned this pull request Jun 14, 2021

ThanosSidecarUnhealthy doesn't fire if the sidecar is never healthy #3990

Closed

arajkumar mentioned this pull request Jun 14, 2021

fix(mixin): ThanosSidecarUnhealthy doesn't fire if the sidecar is never healthy #4342

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mixin: Use sidecar's metric timestamp for healthcheck #3204

mixin: Use sidecar's metric timestamp for healthcheck #3204

hwoarang commented Sep 22, 2020 •

edited

stale bot commented Nov 30, 2020

hwoarang commented Nov 30, 2020

stale bot commented Jan 30, 2021

hwoarang commented Feb 1, 2021

kakkoyun left a comment

kakkoyun commented Feb 1, 2021

kakkoyun left a comment

hwoarang commented Feb 12, 2021

hwoarang commented Feb 12, 2021

kakkoyun commented Mar 1, 2021

hwoarang commented Mar 5, 2021

kakkoyun commented Mar 5, 2021

hwoarang commented Mar 5, 2021

mixin: Use sidecar's metric timestamp for healthcheck #3204

mixin: Use sidecar's metric timestamp for healthcheck #3204

Conversation

hwoarang commented Sep 22, 2020 • edited

Changes

Verification

stale bot commented Nov 30, 2020

hwoarang commented Nov 30, 2020

stale bot commented Jan 30, 2021

hwoarang commented Feb 1, 2021

kakkoyun left a comment

Choose a reason for hiding this comment

kakkoyun commented Feb 1, 2021

kakkoyun left a comment

Choose a reason for hiding this comment

hwoarang commented Feb 12, 2021

hwoarang commented Feb 12, 2021

kakkoyun commented Mar 1, 2021

hwoarang commented Mar 5, 2021

kakkoyun commented Mar 5, 2021

hwoarang commented Mar 5, 2021

hwoarang commented Sep 22, 2020 •

edited