A collection of Monasca-Agent plugins to gather metrics. This repo functions as an incubator, with the ultimate aim to merge any effective plugins into the Monasca-Agent.
Includes:
- Slurm (proof-of-concept)
- nVidia GPUs
- Prometheus (proof-of-concept)
This is an experimental plugin which extends the capability of the existing Prometheus plugin to make it more useful. The following configuration options are supported:
The Prometheus endpoint to scrape.
Example:
metric_endpoint: "http://ceph-host:9283/metrics"
Strip the hostname from each metric. This is useful when scraping an endpoint which exposes metrics not specific to a host. For example, RabbitMQ queue lengths, of Ceph cluster health.
Example:
remove_hostname: true
A dict of dimensions to include with all metrics scraped from the specified endpoint.
Example:
default_dimensions:
cluster_tag: production
Automatically convert counters to rates. This works by buffering counters
locally and then computing the derivative with respect to time when the
buffer is flushed to the Monasca API. When enabled, this setting uses the
Prometheus metric type to automatically generate new rate metrics from
counters. The counter metrics are still posted to the API unless they
are not included in the whitelist
. The rate metrics are named after
the counters by appending _rate
to the end of the metric name. Note that
the Prometheus convention is to append _total
to all counters, so a
counter named ceph_osd_op_w
will become ceph_osd_op_w_total_rate
when converted to a rate.
Example:
counters_to_rates: True
Defaults to True
.
A whitelist of regexes used to determine which metrics are posted to the Monasca API. Many Prometheus endpoints generate vast quantities of data, so this can be a useful way to cut back on the number of metrics posted to the Monasca API to improve performance.
Example:
whitelist:
- ceph_cluster_total_used_bytes
- ceph_cluster_total_bytes
- ceph_osd_op.*
A whitelist of labels can be provided to reduce the number of unique time series created in Monasca. This is useful for exporters such as cAdvisor which produce many highly variable labels attached to each metric, of which some may not even be valid dimensions in Monasca.
Example:
label_whitelist:
- name
- state
- hostname
- interface
A dict of metrics to derive from existing metrics. Supported operations
are divide
, sum
and counter
.
The divide
operation divides two metric series by each other. It enforces
that the dimensions of the metrics match, to reduce the chance of an
unphysical result. For example, in a ceph cluster with two OSDs, the
following metrics may exist:
['ceph_osd_total_bytes', 'dimensions': {'osd': 1}, 'value': '1234', 'ceph_osd_total_bytes', 'dimensions': {'osd': 2}, 'value': '4567'] ['ceph_osd_total_used_bytes', 'dimensions': {'osd': 1}, 'value': '891', 'ceph_osd_total_used_bytes', 'dimensions': {'osd': 2}, 'value': '111']
To calculate the fractional amount of space used on each OSD you must
divide ceph_osd_total_used_bytes
by ceph_osd_total_bytes
for osd: 1
and again for osd: 2
. The plugin does this by hashing the dimensions for
each metric and using the hash to find the equivalent metric. If the two
metric series do not have common sets of dimensions the operation will
currently fail.
derived_metrics: ceph_cluster_usage: x: ceph_cluster_total_used_bytes y: ceph_cluster_total_bytes op: divide
The sum
operation sums all metrics in a series as a function of a specified
dimension. For example, by specifying the osd
dimension the total space used
on all OSDs could be computed from the following metrics:
['ceph_osd_total_used_bytes', 'dimensions': {'osd': 1}, 'value': '891', 'ceph_osd_total_used_bytes', 'dimensions': {'osd': 2}, 'value': '111']
If additional dimensions are present, these must remain the same for all
metrics in the calculation. For example, it is not currently possible to
create a sum
on this hypothetical metric series:
['ceph_osd_total_used_bytes', 'dimensions': {'osd': 1, 'cluster: 'A'}, 'value': '891', 'ceph_osd_total_used_bytes', 'dimensions': {'osd': 1, 'cluster: 'B'}, 'value': '111']
Example:
derived_metrics: ceph_osd_in_sum: series: ceph_osd_in key: ceph_daemon op: sum
In many cases you will want to use counters_to_rates
to automatically
create counters from rates. As such this setting is enabled by default.
However, sometimes Prometheus metrics may not be marked as counters
correctly, or you may wish to calculate the rate of change of a gauge, or
even of an existing rate.
To minimise user configuration, any metric ending with _total
which is not
marked as a counter will be converted automatically to a rate when
counters_to_rates
is True
. This is because, by Prometheus convention,
any metric ending with _total
should be a counter. In this case the metric
name will be appended with _rate
to create the name of the new series,
and the original series will remain.
For metrics which do not end in _total
and/or are not marked as
counters it may still be useful to convert the series to a rate. For
example, the rate of change of remaining capacity would be a useful
derivative of a gauge on a Ceph cluster. In this case you can use
the counter
operation to generate a rate from an arbitrary metric.
The new metric assumes the name specified by the configuration key. For
example in this case, a series of metrics called
ceph_pool_wr_bytes_total_rate
would be created from the metric series
ceph_pool_wr_bytes
.
Example:
derived_metrics: ceph_pool_wr_bytes_total: series: ceph_pool_wr_bytes op: counter
Note that this requires counters_to_rates
to be enabled, which is the
default, and if the same name is used for the existing series, the existing
series will be converted to a rate in situ, overwriting the existing counter.
init_config: timeout: 10 instances: - metric_endpoint: 'http://ceph-node:9283/metrics' remove_hostname: true default_dimensions: cluster_tag: production counters_to_rates: True whitelist: - ceph_cluster_total_used_bytes - ceph_cluster_total_bytes - ceph_osd_op.* derived_metrics: | ceph_cluster_usage: x: ceph_cluster_total_used_bytes y: ceph_cluster_total_bytes op: divide ceph_osd_in_sum: series: ceph_osd_in key: ceph_daemon op: sum ceph_pool_wr_bytes_total: series: ceph_pool_wr_bytes op: counter ceph_pool_rd_bytes_total: series: ceph_pool_rd_bytes op: counter
Note that more than one endpoint can be monitored by adding additional
entries on the instances
list.