Skip to content

Latest commit

 

History

History
87 lines (63 loc) · 2.55 KB

README.md

File metadata and controls

87 lines (63 loc) · 2.55 KB

Prometheus health checks collector

This library includes a custom collector for the Prometheus JVM Client.

The HealthChecksCollector creates a GaugeMetricFamily of name health_check_status and a label system which holds the health-checker's name.

Possible values are:

  • 1.0: means the system is healthy.
  • 0.0: means the system is unhealthy.

To link the HealthChecks add them to the registered HealthChecksCollector.

class DbHealthCheck extends HealthCheck {
  @Override
  public HealthStatus check() throws Exception {
    return checkDbConnection() ? HealthStatus.HEALTHY : HealthStatus.UNHEALTHY;
  }
  boolean checkDbConnection() {}
}

HealthChecksCollector healthchecksMetrics = HealthChecksCollector.newInstance().register();

DbHealthCheck dbHealthCheck = new DbHealthCheck();
FsHealthCheck fsHealthCheck = // ...

healthchecksMetrics.addHealthCheck("database", dbHealthCheck)
    .addHealthCheck("filesystem", fsHealthCheck);

Resulting samples:

# HELP health_check_status Health check status results
# TYPE health_check_status gauge
health_check_status{system="database",} 1.0
health_check_status{system="filesystem",} 0.0

To customize the gauge data (name, label and help string) you can use the Builder class:

import com.github.strengthened.prometheus.healthchecks.HealthChecksCollector.Builder;

HealthChecksCollector collector = Builder.of().setGaugeMetricName("my_metric_name")
    .setGaugeMetricLabelName("my_label").build();

It also possible to make an asynchronous HealthCheck annotating it with @Async

@Async(initialState=HealthStatus.UNHEALTHY, unit=TimeUnit.SECONDS, period=1)
class TestHealthCheck extends HealthCheck {
  // Executed every second starting with an unhealthy status.
}

For further details see the API Document.

Bonus facts

Alertmanager

An example rules file with an alert would be:

groups:
- name: example
  rules:

  # Alert for any system that is unreachable for >2 minutes.
  - alert: HealthCheck
    expr: health_check_status < 1
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: "System {{ $labels.system }} down"
      description: "System {{ $labels.system }} of instance {{ $labels.instance }} of job {{ $labels.job }} has been down for more than 2 minutes."

Grafana

You can use the Grafana's Discrete Panel to show health's state transitions in an horizontal graph!