Skip to content

stackhpc/stackhpc-monasca-agent-plugins

Repository files navigation

This project is no longer under active development

StackHPC Monasca-Agent plugins

https://travis-ci.org/stackhpc/stackhpc-monasca-agent-plugins.svg?branch=master

A collection of Monasca-Agent plugins to gather metrics. This repo functions as an incubator, with the ultimate aim to merge any effective plugins into the Monasca-Agent.

Includes:

  • Slurm (proof-of-concept)
  • nVidia GPUs
  • Prometheus (proof-of-concept)

Prometheus plugin

This is an experimental plugin which extends the capability of the existing Prometheus plugin to make it more useful. The following configuration options are supported:

metric_endpoint

The Prometheus endpoint to scrape.

Example:

metric_endpoint: "http://ceph-host:9283/metrics"

remove_hostname

Strip the hostname from each metric. This is useful when scraping an endpoint which exposes metrics not specific to a host. For example, RabbitMQ queue lengths, of Ceph cluster health.

Example:

remove_hostname: true

default_dimensions

A dict of dimensions to include with all metrics scraped from the specified endpoint.

Example:

default_dimensions:
  cluster_tag: production

counters_to_rates

Automatically convert counters to rates. This works by buffering counters locally and then computing the derivative with respect to time when the buffer is flushed to the Monasca API. When enabled, this setting uses the Prometheus metric type to automatically generate new rate metrics from counters. The counter metrics are still posted to the API unless they are not included in the whitelist. The rate metrics are named after the counters by appending _rate to the end of the metric name. Note that the Prometheus convention is to append _total to all counters, so a counter named ceph_osd_op_w will become ceph_osd_op_w_total_rate when converted to a rate.

Example:

counters_to_rates: True

Defaults to True.

whitelist

A whitelist of regexes used to determine which metrics are posted to the Monasca API. Many Prometheus endpoints generate vast quantities of data, so this can be a useful way to cut back on the number of metrics posted to the Monasca API to improve performance.

Example:

whitelist:
  - ceph_cluster_total_used_bytes
  - ceph_cluster_total_bytes
  - ceph_osd_op.*

Label whitelist

A whitelist of labels can be provided to reduce the number of unique time series created in Monasca. This is useful for exporters such as cAdvisor which produce many highly variable labels attached to each metric, of which some may not even be valid dimensions in Monasca.

Example:

label_whitelist:
  - name
  - state
  - hostname
  - interface

derived_metrics

A dict of metrics to derive from existing metrics. Supported operations are divide, sum and counter.

divide

The divide operation divides two metric series by each other. It enforces that the dimensions of the metrics match, to reduce the chance of an unphysical result. For example, in a ceph cluster with two OSDs, the following metrics may exist:

['ceph_osd_total_bytes', 'dimensions': {'osd': 1}, 'value': '1234',
 'ceph_osd_total_bytes', 'dimensions': {'osd': 2}, 'value': '4567']

['ceph_osd_total_used_bytes', 'dimensions': {'osd': 1}, 'value': '891',
 'ceph_osd_total_used_bytes', 'dimensions': {'osd': 2}, 'value': '111']

To calculate the fractional amount of space used on each OSD you must divide ceph_osd_total_used_bytes by ceph_osd_total_bytes for osd: 1 and again for osd: 2. The plugin does this by hashing the dimensions for each metric and using the hash to find the equivalent metric. If the two metric series do not have common sets of dimensions the operation will currently fail.

derived_metrics:
  ceph_cluster_usage:
    x: ceph_cluster_total_used_bytes
    y: ceph_cluster_total_bytes
    op: divide

sum

The sum operation sums all metrics in a series as a function of a specified dimension. For example, by specifying the osd dimension the total space used on all OSDs could be computed from the following metrics:

['ceph_osd_total_used_bytes', 'dimensions': {'osd': 1}, 'value': '891',
 'ceph_osd_total_used_bytes', 'dimensions': {'osd': 2}, 'value': '111']

If additional dimensions are present, these must remain the same for all metrics in the calculation. For example, it is not currently possible to create a sum on this hypothetical metric series:

['ceph_osd_total_used_bytes', 'dimensions': {'osd': 1, 'cluster: 'A'}, 'value': '891',
 'ceph_osd_total_used_bytes', 'dimensions': {'osd': 1, 'cluster: 'B'}, 'value': '111']

Example:

derived_metrics:
  ceph_osd_in_sum:
    series: ceph_osd_in
    key: ceph_daemon
    op: sum

counter

In many cases you will want to use counters_to_rates to automatically create counters from rates. As such this setting is enabled by default. However, sometimes Prometheus metrics may not be marked as counters correctly, or you may wish to calculate the rate of change of a gauge, or even of an existing rate.

To minimise user configuration, any metric ending with _total which is not marked as a counter will be converted automatically to a rate when counters_to_rates is True. This is because, by Prometheus convention, any metric ending with _total should be a counter. In this case the metric name will be appended with _rate to create the name of the new series, and the original series will remain.

For metrics which do not end in _total and/or are not marked as counters it may still be useful to convert the series to a rate. For example, the rate of change of remaining capacity would be a useful derivative of a gauge on a Ceph cluster. In this case you can use the counter operation to generate a rate from an arbitrary metric. The new metric assumes the name specified by the configuration key. For example in this case, a series of metrics called ceph_pool_wr_bytes_total_rate would be created from the metric series ceph_pool_wr_bytes.

Example:

derived_metrics:
  ceph_pool_wr_bytes_total:
    series: ceph_pool_wr_bytes
    op: counter

Note that this requires counters_to_rates to be enabled, which is the default, and if the same name is used for the existing series, the existing series will be converted to a rate in situ, overwriting the existing counter.

Full example configuration

init_config:
  timeout: 10
instances:
  - metric_endpoint: 'http://ceph-node:9283/metrics'
    remove_hostname: true
    default_dimensions:
      cluster_tag: production
    counters_to_rates: True
    whitelist:
      - ceph_cluster_total_used_bytes
      - ceph_cluster_total_bytes
      - ceph_osd_op.*
    derived_metrics: |
      ceph_cluster_usage:
        x: ceph_cluster_total_used_bytes
        y: ceph_cluster_total_bytes
        op: divide
      ceph_osd_in_sum:
        series: ceph_osd_in
        key: ceph_daemon
        op: sum
      ceph_pool_wr_bytes_total:
        series: ceph_pool_wr_bytes
        op: counter
      ceph_pool_rd_bytes_total:
        series: ceph_pool_rd_bytes
        op: counter

Note that more than one endpoint can be monitored by adding additional entries on the instances list.

About

A collection Monasca Agent plugins for gathering metrics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages