Use the Mesos plugin for collectd to monitor the following information about Mesos:
- Cluster status: number of activated slaves, schedulers and tasks
- CPU, disk and memory usage for Mesos
- Tasks finished, lost, and failed
Mesos Clusters: Overview of data from all Mesos clusters.
Mesos Cluster: Focus on a single Mesos cluster.
Mesos Master: Focus further on a single Mesos master.
Mesos Slave: Focus further on a single Mesos slave.
REQUIREMENTS AND DEPENDENCIES
This plugin requires:
- collectd 4.9+
- Python plugin for collectd (included with SignalFx collectd agent)
- Python 2.3+ (2.7.5+ for DC/OS strict mode)
- Mesos 0.19.0 or greater
Download the three Python modules for Mesos from the following URL: https://github.com/signalfx/collectd-mesos. Place them in a convenient spot (e.g. in
Modify the configuration file to contain values that make sense for your environment, as described below.
OPTIONAL: This step needs to be followed when the Mesos cluster being monitored is running under a DC/OS cluster operating in strict mode.
- Make a new user on DC/OS.
- Give the new user the following permission strings:
- Configure the plugin with the required options. See below.
/system/health/v1 endpoint on port
1050 for DC/OS is not available if operating in strict mode.
Using the example configuration files 10-mesos-master.conf or 10-mesos-slave.conf as a guide, provide values for the configuration options listed below that make sense for your environment and allow you to connect to the Mesos instance to be monitored.
|configuration option||definition||default value|
|ModulePath||Path on disk where collectd can find the Mesos python modules.||"/usr/share/collectd/mesos-collectd-plugin"|
|Cluster||The name of the cluster to which the Mesos instance belongs. Appears in the dimension
|Instance||The name of this Mesos master/slave instance. Appears in the dimension
||"master-0" / "slave-0"|
|Path||The location of the mesos-master/mesos-slave binary.||"/usr/sbin"|
|scheme||Scheme the plugin needs to use to fetch metrics. It is either "http" or "https".||"http"|
|Host||The hostname or IP address of the Mesos instance to be monitored.||"%%%MASTER_IP%%%"|
|Port||The port on which the Mesos instance is listening for connections.||%%%MASTER_PORT%%%|
|Verbose||Enable verbose logging from this plugin to collectd's log file||false|
|IncludeSystemHealth||Enable the sending of DC/OS System Service Health Metrics (this option is only applicable for a DC/OS master)||false|
|ca_file_path||Path to CA file required for server verification. If not provided, verification is skipped (this option is only applicable if ssl is enabled)||"path/to/file"|
|dcos_sfx_username||New DC/OS username created for the plugin (this option is only applicable for DC/OS in strict mode)||sfx-collectd|
|dcos_sfx_password||Password of the above username (this option is only applicable for DC/OS in strict mode)||signalfx|
|dcos_url||The DC/OS authentication endpoint (this is an optional config and is only applicable for DC/OS in strict mode)||"https://leader.mesos/acs/api/v1/auth/login"|
Note: (Applicable if operating DC/OS in strict mode) The default
dcos_url makes use of the
leader.mesos hostname provided by DC/OS. If the hostname does not exist,
dcos_url can be set by the user. See below example.
Below is an example configuration:
<LoadPlugin "python"> Globals true </LoadPlugin> <Plugin "python"> ModulePath "/opt/collectd-mesos" Import "mesos-master" <Module "mesos-master"> Cluster "cluster-0" Instance "master-0" Path "/usr/sbin" scheme "https" Host "10.0.142.190" Port 5050 Verbose false IncludeSystemHealth false dcos_sfx_username "test-collectd" dcos_sfx_password "1234" # Note that https://sfx-dco-elasticl-qyuyl8k0dc99-1879689557.us-west-2.elb.amazonaws.com is # base URL of the DC/OS UI and /acs/api/v1/auth/login is the authentication endpoint the plugin # uses to obtain token for subsequent requests. By default dcos_url takes - # https://leader.mesos/acs/api/v1/auth/login dcos_url "https://sfx-dco-elasticl-qyuyl8k0dc99-1879689557.us-west-2.elb.amazonaws.com/acs/api/v1/auth/login" </Module> </Plugin>
Below are screen captures of dashboards created for this plugin by SignalFx, illustrating the metrics emitted by this plugin.
Monitoring Mesos clusters
It’s important to keep track of the status of tasks in the cluster. An increase in failed tasks for a master or slave can indicate a problem with a framework.
It can be important to analyze performance per Mesos host. An increase in failed tasks for many masters and slaves on a single host may indicate a hardware problem.
Track week-over-week growth of tasks in your cluster to be informed of changing workloads.
Monitoring Mesos masters and slaves
An unexpectedly low number of connected slaves on a Mesos master can indicate a network problem preventing them from connecting. To verify this, check to see if there’s an unexpectedly high number of dropped messages in counter.master_dropped_messages.
On the Mesos master dashboard, you can view in detail the number of tasks that are finished, failed, lost or errored out. Monitoring connected and active frameworks can help you determine the health of your Mesos scheduler.
For additional information on how to monitor Mesos, check out Apache's guide here.
For documentation of the metrics and dimensions emitted by this plugin, click here.
This integration is released under the Apache 2.0 license. See LICENSE for more details.