Skip to content

Commit

Permalink
control-service: add counter to track data job watching task executio…
Browse files Browse the repository at this point in the history
…ns (#692)

Currently, we are lacking monitoring of our data job watching task -
this is the task that monitors the K8s namespace for data job changes and
updates the execution and termination statuses of the data jobs along with
the metrics exposed by the control service.

We have experienced cases when this task stops running. Considering the
importance of this task it is essential that we get an early alert when
this happens. This commit introduces a new metric (counter) that exposes
the number of executions of this task. This counter can then be used in
dashboards to alert when the task stops executing for a period of time.

Testing done: new unit tests; manually starting the service to observe
the new, gradually increasing metrics.

Signed-off-by: Tsvetomir Palashki <tpalashki@vmware.com>
  • Loading branch information
tpalashki authored and doks5 committed Feb 3, 2022
1 parent d8dc457 commit 1bd40f7
Show file tree
Hide file tree
Showing 4 changed files with 38 additions and 0 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -122,6 +122,8 @@ Custom metrics are:
is allowed to be delayed from its schedule before an alert is triggered)
* tags:
* data_job - the data job name
* taurus.datajob.watch.task.invocations.counter (A counter that exposes the number of executions
of the data job monitoring task)


### Alerting
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
package com.vmware.taurus.service.monitoring;

import com.vmware.taurus.service.model.DataJob;
import io.micrometer.core.instrument.Counter;
import io.micrometer.core.instrument.Gauge;
import io.micrometer.core.instrument.MeterRegistry;
import io.micrometer.core.instrument.Tags;
Expand Down Expand Up @@ -33,6 +34,7 @@ public class DataJobMetrics {
public static final String TAURUS_DATAJOB_INFO_METRIC_NAME = "taurus.datajob.info";
public static final String TAURUS_DATAJOB_NOTIFICATION_DELAY_METRIC_NAME = "taurus.datajob.notification.delay";
public static final String TAURUS_DATAJOB_TERMINATION_STATUS_METRIC_NAME = "taurus.datajob.termination.status";
public static final String TAURUS_DATAJOB_WATCH_TASK_INVOCATIONS_COUNTER_NAME = "taurus.datajob.watch.task.invocations.counter";
public static final String TAG_DATA_JOB = "data_job";
public static final String TAG_EXECUTION_ID = "execution_id";
public static final String TAG_TEAM = "team";
Expand All @@ -44,6 +46,7 @@ public class DataJobMetrics {
public static final int DEFAULT_NOTIFICATION_DELAY_PERIOD_MINUTES = 240;

private final MeterRegistry meterRegistry;
private final Counter watchTaskInvocationsCounter;
private final Map<String, Gauge> infoGauges = new ConcurrentHashMap<>();
private final Map<String, Gauge> delayGauges = new ConcurrentHashMap<>();
private final Map<String, Gauge> statusGauges = new ConcurrentHashMap<>();
Expand All @@ -53,6 +56,21 @@ public class DataJobMetrics {
@Autowired
public DataJobMetrics(MeterRegistry meterRegistry) {
this.meterRegistry = meterRegistry;

watchTaskInvocationsCounter = Counter.builder(TAURUS_DATAJOB_WATCH_TASK_INVOCATIONS_COUNTER_NAME)
.description("Counts the number of times the data jobs watching task is called.")
.register(this.meterRegistry);
}

/**
* Increments the counter used to track the number of times the {@link DataJobMonitor#watchJobs} method was invoked.
*/
public void incrementWatchTaskInvocations() {
try {
watchTaskInvocationsCounter.increment();
} catch (Exception e) {
log.warn("Error while trying to increment counter.", e);
}
}

/**
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -91,6 +91,7 @@ public DataJobMonitor(
initialDelayString = "${datajobs.status.watch.initial.delay:10000}")
@SchedulerLock(name = "watchJobs_schedulerLock")
public void watchJobs() {
dataJobMetrics.incrementWatchTaskInvocations();
try {
dataJobsKubernetesService.watchJobs(
labelsToWatch,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -230,4 +230,21 @@ void testClearGauges_shouldClearAllGauges() {
gauges = meterRegistry.find(DataJobMetrics.TAURUS_DATAJOB_TERMINATION_STATUS_METRIC_NAME).gauges();
Assertions.assertEquals(0, gauges.size());
}

@Test
@Order(13)
void testIncrementWatchTaskInvocations() {
dataJobMetrics.incrementWatchTaskInvocations();

var counter = meterRegistry.counter(DataJobMetrics.TAURUS_DATAJOB_WATCH_TASK_INVOCATIONS_COUNTER_NAME);
Assertions.assertEquals(1.0, counter.count(), 0.001);

dataJobMetrics.incrementWatchTaskInvocations();
dataJobMetrics.incrementWatchTaskInvocations();
dataJobMetrics.incrementWatchTaskInvocations();
dataJobMetrics.incrementWatchTaskInvocations();

counter = meterRegistry.counter(DataJobMetrics.TAURUS_DATAJOB_WATCH_TASK_INVOCATIONS_COUNTER_NAME);
Assertions.assertEquals(5.0, counter.count(), 0.001);
}
}

0 comments on commit 1bd40f7

Please sign in to comment.