Skip to content

Tutorial

Dieter Plaetinck edited this page Oct 27, 2013 · 8 revisions

Setup

install according to the readme, pay attention to the configuration sections.

Populate the database.

just running the update_metrics.py script should be enough. It uses plugins that populate elasticsearch with structured metrics, based on the "flat" (proto1) metric id's it gets from graphite. see populating the database for more details.

  • Your metrics that the plugins understand, will yield structured metrics with clear tags, eg:
    servers.web9.cpu.cpu2.iowait becomes
    {server:web9, core: cpu2, unit:cpu_state, type:iowait, target_type:gauge_pct, plugin:cpu}
  • Unrecognized metrics (i.e. for which you don't have a plugin) will be matched by the catchall plugins, eg:
    stats.web9.request.web becomes
    {n1:web9, n2:request, n3:web, unit:unknown/s, source:statsd, plugin:catchall_statsd}
    As you can see, this a more rough "copy" of the information, but for a lot of purposes this actually works quite well.

Your first queries.

Now, the stuff that you always start off typing into the query field is just words/patterns that you know will match (parts of) the metrics you're looking for, so instead of following my exact example, use patterns that will match your collection metrics. The best example metrics contain multiple dimensions/fields ("nodes" in graphite speak), cause you can leverage those in some interesting ways.

In my case, I want to gain insights into network consumption by a group of servers. I know I have metrics that will match the word "network" and "dfvimeo" (which is part of the hostname), so let's start with that. If you expect to get a lot of metrics back, I recommend to type "list" first (which lists metrics instead of doing cpu intensive graphing). This is what I get back:

screenshot

For details on what everything on the interface means, I refer to the Query Interfacepage.

  • Let's say I'm only interested in the machines matching "dfvimeodfs", I can exclude the others with "!dfvimeotran" or just match more precisely on the term "dfvimeodfs" instead of just "dfvimeo".
  • Similarly, it lists interfaces (em's) I don't care about. I only want to see bond interfaces, so I add the pattern "bond".
  • Looking at the end of the metrics makes apparent a lot of them represent stats about frames, errors, etc. for now let's assume i only want to know about network bandwidth, so the tx_bit and rx_bit metrics seem the most interesting. I can do this in two ways: either add "(rx|tx)_bit" (which does pattern matching on the actual metric name) or leverage the tags. When I click on one of the metrics, I get the inspect view, shown below.

screenshot

This shows me the tags for the metric. The target_type, type and unit tags seem convenient, I can actually get what I want by using "unit=b/s" which means narrow down the metrics to only list those which unit is bits per second. (Graph Explorer aims to use standardized naming for units

Surely I've narrowed it down enough by now so I can remove the "list" part, and my query becomes "network dfvimeodfs bond unit=b/s" (note that the order of patterns does not matter).

screenshot

Below the query is a query explanation section, some information about the resultset, and the actual graphs. Again for details, please see the Query Interface page.

The most important things to note here:

  • Group by "target_type=", "unit=", "server": this means that after collecting all metrics, it will put metrics together in one graph if they have the same value for the "target_type", "unit" and "server" tags. This makes sure the information on the same graph is "compatible". The "=" means that metrics must have this tag, i.e. metrics without "target_type" or "unit" tag won't even be retrieved (they would also be pretty meaningless, that's why they are mandatory. Note that the catchall plugins can't know what the unit of a metric is, so they use "Unknown").
  • 40 targets were matched, and they are grouped into 10 graphs. (i.e. 10 unique combinations of "target_type", "unit" and "server").
  • There are 2 ways to alter this: 'GROUP BY' and 'group by'; the latter is most common and basically by typing "group by ,[,..]" It will group by target_type, unit, and not server, but the specified tags. So for example "group by type", means you will get a graph for each combination of target_type, unit and type. target_type and unit are in this case constant across all targets, so this means you get a graph for type rx and one for type tx, and the targets with the same type but different values for server will appear on the same graph.
  • Remember you can always further narrow down by adding patterns based on any of the tags you see, the constants and variables sections are a good inspiration.
  • In case you hadn't noticed, you can zoom and get detail popups by hoovering over the graph.

From here on, it's only a matter of playing with the various GEQL features ("group by", "avg by", "avg over", "sum by", etc) to unlock the power of Graph Explorer. Before you know it, your'e composing graphs like the one below:

screenshot

A very useful learning exercise is studying this query and figuring out exactly what's going on. The most interesting new pieces here are the regex pattern, unit conversion (unit "b/s" to "b/d"), and aggregation across different values of the "device" tag. If you have any questions about this, please let us know.