Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

control-service: add counter to track data job watching task executions #692

Merged
merged 1 commit into from
Feb 2, 2022

Conversation

tpalashki
Copy link
Contributor

Currently, we are lacking monitoring of our data job watching task -
this is the task that monitors the K8s namespace for data job changes and
updates the execution and termination statuses of the data jobs along with
the metrics exposed by the control service.

We have experienced cases when this task stops running. Considering the
importance of this task it is essential that we get an early alert when
this happens. This commit introduces a new metric (counter) that exposes
the number of executions of this task. This counter can then be used in
dashboards to alert when the task stops executing for a period of time.

Testing done: new unit tests; manually starting the service to observe
the new, gradually increasing metrics.

Signed-off-by: Tsvetomir Palashki tpalashki@vmware.com

Currently, we are lacking monitoring of our data job watching task -
this is the task that monitors the K8s namespace for data job changes and
updates the execution and termination statuses of the data jobs along with
the metrics exposed by the control service.

We have experienced cases when this task stops running. Considering the
importance of this task it is essential that we get an early alert when
this happens. This commit introduces a new metric (counter) that exposes
the number of executions of this task. This counter can then be used in
dashboards to alert when the task stops executing for a period of time.

Testing done: new unit tests; manually starting the service to observe
the new, gradually increasing metrics.

Signed-off-by: Tsvetomir Palashki <tpalashki@vmware.com>
@mivanov1988
Copy link
Collaborator

LGTM

@antoniivanov
Copy link
Collaborator

I'd like for each new metric we add to have an example query of how it will be used as part of Testing Done. This is to make sure that the metrics is useful and correct.

@tpalashki
Copy link
Contributor Author

Below is an example usage of this metric. Take into consideration that the interval between adjacent invocations of the data job watch task is 5 minutes:

sum(increase(taurus_datajob_watch_task_invocations_counter[6m]))

If the task is invoked regularly (i.e. the counter is incremented), the above expression should produce graphics similar to the one below, with a value of 1 most of the time and value 2 every 4 minutes:

image

The expectation is that the above expression will evaluate to 0 if there are no counter increments in 6 minutes.

@tpalashki tpalashki merged commit 7053df7 into main Feb 2, 2022
@tpalashki tpalashki deleted the person/tpalashki/control-service branch February 2, 2022 21:02
doks5 pushed a commit that referenced this pull request Feb 3, 2022
…ns (#692)

Currently, we are lacking monitoring of our data job watching task -
this is the task that monitors the K8s namespace for data job changes and
updates the execution and termination statuses of the data jobs along with
the metrics exposed by the control service.

We have experienced cases when this task stops running. Considering the
importance of this task it is essential that we get an early alert when
this happens. This commit introduces a new metric (counter) that exposes
the number of executions of this task. This counter can then be used in
dashboards to alert when the task stops executing for a period of time.

Testing done: new unit tests; manually starting the service to observe
the new, gradually increasing metrics.

Signed-off-by: Tsvetomir Palashki <tpalashki@vmware.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants