This repository has been archived by the owner. It is now read-only.

Probes and Instrumentation

Alan Malloy edited this page May 15, 2013 · 12 revisions

Computation requires a thread of execution. In synchronous code, this thread is explicit. In asynchronous code this thread is implicit, but no less real. Event-driven programming is the process of applying structure to events until they become ordered.

But what about the inverse operation, where structured execution of code is transformed into a collection of loosely-ordered events? We do this all the time: it’s called logging.

Logging is a lossy operation; log statements lose any context that isn’t explicitly captured. While we can sometimes recreate the context by considering all the log statements collectively, this is akin to archaeology; we’re trying to create a detailed picture of what happened using sparse, unconnected details and educated guesswork.

At the same time, logging is a necessary tool. It lets us understand our code without having to fundamentally change its structure, because the stream of events it generates exists outside our thread of execution. Lamina gives us powerful tools for transforming, aggregating, and analyzing streams of events. Therefore, if we can transform these log streams into something Lamina can consume, we have a powerful tool for understanding the inner workings of our software.

The lamina.trace namespace provides tools for instrumenting and understanding the flow of code.


A probe-channel is a permanent, grounded channel. This means that it cannot be closed or put into an error state, and that messages will never accumulate in its queue. A probe-channel is also named, which means that when we interact with probe-channel abc:def, that will always be the same channel.

Probe-channels represent a stream of events, typically from an instrumented piece of code. The easiest way to enqueue a message into a probe-channel is by using trace:

(trace :foo:bar

This will send a million digits of pi into the probe-channel returned by (probe-channel :foo:bar). By convention, probe namespaces are colon-delimited. If the argument given to probe-channel is a sequence, they will be joined with colon separators. All of these function calls are equivalent:

(probe-channel :foo:bar)
(probe-channel "foo:bar")
(probe-channel [:foo :bar])
(probe-channel ["foo" :bar])

However, the body of the trace statement will only be executed if something is consuming messages from the probe-channel. Unless we do something with the output of (probe-channel :foo:bar), the above trace statement is effectively a no-op.

We can also consume collections of probe-channels using select-probes. This takes either a wildcard string or a regex, and returns a channel representing the combined output of all probe-channels whose names match that pattern.

> (use 'lamina.core 'lamina.trace)
> (receive-all (select-probes "foo*") #(println "traced:" %))
> (trace :foo:bar 1)
traced: 1
> (trace :foo 2)
traced: 2
> (trace :bar 3)

Notice that trace returns true if the message was consumed, and false otherwise.

instrumenting functions

While littering the code with trace statements won’t affect its semantics, it is a bit of an eyesore. Since most of the time meaningful events correlate with function calls, there’s an easy way to automatically instrument functions.

Calling lamina.trace/instrument on a function will create probes for every time the function is called, returns, and throws an exception. These are called enter, return, and error probes respectively.

> (use 'clojure.pprint)
> (defn f [x y] (+ x y))
> (def f* (instrument f {:name :instrumented :capture :in-out}))
> (receive-all (probe-channel :instrumented:enter) #(do (println "ENTERED") (pprint %)))
> (receive-all (probe-channel :instrumented:return) #(do (println "RETURNED") (pprint %)))
> (f* 1 2)
{:name "instrumented",
 :timestamp 1367021930248,
 :args [1 2],
 :context {:host "xanadu", :pid 34403}}
{:name "instrumented",
 :context {:host "xanadu", :pid 34403},
 :timestamp 1367021930256,
 :offset 0,
 :compute-duration 20000,
 :duration 20000,
 :sub-tasks nil,
 :counts {},
 :args [1 2],
 :result 3}

Notice that the we have defined our function with the name :instrumented, so the return probe can be accessed at :instrumented:return.

The fields emitted by the enter probe are:

:description the function’s name
:timestamp the timestamp, in milliseconds, of the event
:args the arguments passed to the function
:context a map describing the local context, defaulting to the :host and :pid of the process

The additional fields emitted by the return probe are:

:result the value returned by the function
:duration the elapsed time, in nanoseconds, since the function was called
:computed-duration the elapsed time, in nanoseconds, that was not spent waiting for the return value to resolve. If the return value was not an async-promise, then this will always equal the :duration
:sub-tasks the return data for any instrumented functions called in the scope of the parent function
:counts values captured by increment-trace-counter or @add-to-trace-counter (see below for further explanation)
:offset the elapsed time, in nanoseconds, from the topmost parent function’s invocation to this function’s invocation

Notice that we capture the return data for all other instrumented functions automatically. To explore this, let’s use defn-instrumented, a combination of defn and instrument.

> (defn-instrumented g [x y] (+ x y))
> (defn-instrumented f
	{:probes {:return (sink->> pprint)}}
	[x y]
	(+ (g x x) (g y y)))

In this, we have two instrumented functions, f and g. We do not need to explicitly define a :name in the function metadata, since that defaults to :the:function:namespace:the-function-name.

We have also defined a :probes map for f, which allows us to consume probes without explicitly calling probe-channel. The keys of the map are the probe suffix (in this case, just :return), and the values are a channel that will consume messages from that probe-channel. For this, we use the sink->> function.

> (f 1 2) 
{:name "user:f",
 :context {:host "xanadu", :pid 34403},
 :timestamp 1367022313773,
 :offset 0,
 :compute-duration 72000,
 :duration 72000,
 ({:name "user:g",
   :context {:host "xanadu", :pid 34403},
   :timestamp 1367022313773,
   :offset 17000,
   :compute-duration 22000,
   :duration 22000,
   :sub-tasks nil,
   :counts {},
   :args nil,
   :result nil}
  {:name "user:g",
   :context {:host "xanadu", :pid 34403},
   :timestamp 1367022313773,
   :offset 50000,
   :compute-duration 13000,
   :duration 13000,
   :sub-tasks nil,
   :counts {},
   :args nil,
   :result nil}),
 :counts {},
 :args nil,
 :result nil}

Notice that the :sub-tasks field now has two entries, one for each time g was called. Each of these has their own :args and :result, and an :offset with respect to when f was initially invoked.

Notice also that neither the :args or :result fields have any entries. This is because we have not specified what should be captured within the functions via the :capture option, and this defaults to capturing neither. There are a variety of options that can be used when instrumenting functions:

:capture controls whether the input arguments or return value will be captured by the timing; may be :in, :out, :in-out, or :none. Defaults to :none.
:name the name which will be prefixed to the enter, return, and exit probes; in the case of defn-instrumented defaults to the function name with the namespace prefixed
:probes a map of sub-names (with :name implicitly prefixed) onto channels that consume those probes
:executor the executor which will invoke the function body; if defined the function will return an async-promise rather than a realized value
:timeout only used when :executor is defined; a function which takes the arguments and returns the timeout in milliseconds
:implicit? if true, when this function is called within the scope of another instrumented function its timing will be represented in the top-level timing’s :sub-tasks. Defaults to true.
:with-bindings only used when :executor is defined; if true, the thread-local bindings are conveyed to the executing thread. Defaults to false.

instrumented functions and executors

When we specify that a function should be executed on another thread, we’re inserting a delay between the invocation of the function and its execution. With normal thread pools it can be hard to differentiate between time spent on the pool’s queue and time spent actually executing the function. However, Lamina’s executors are specially instrumented, so the timing for any instrumented function that’s executed on an executor will have an :enqueued-duration field, which is the amount of time, in nanoseconds, that the function spent waiting to be executed. The :duration field, in this case, is the sum of the :enqueued-duration and time spent actually executing the function.

Using the with-instrumentation macro, which captures and returns the timing information for all instrumented functions, we can see this in action:

> (pprint
        (+ 1 1)))

{:name "with-instrumentation",
 :context {:host "xanadu.local", :pid 34403},
 :timestamp 1367022932809,
 :offset 0,
 :compute-duration 5470000,
 :duration 5470000,
 ({:name "user:task",
   :context {:host "xanadu.local", :pid 34403},
   :offset 1872000,
   :timestamp 1367022932811,
   :duration 1381000,
   :enqueued-duration 1181000,
   :compute-duration 1381000,
   :sub-tasks nil,
   :args nil,
   :result nil,
   :counts {}}),
 :counts {},
 :args nil,
 :result 1}

formatting timing data

These nested timing data structures can be hard to read, so in many cases it’s useful to use the time* macro, which behaves like Clojure’s time macro but gives timing information for every instrumented function:

> (time* (task (+ 1 1)))
time - 657.0us
  user:task - 229.0us - enqueued for 177.0us
<< ... >>

Notice that in the entry for task, it displays both the total time and the time spent waiting for execution. This printed format may be reproduced by passing the timing data into format-timing.

time* can be a very powerful way to understand the performance characteristics of our code, if important functions are instrumented. In Aleph, the http-request function is already instrumented, which lets us examine exactly where time is spent:

> (use 'aleph.http 'lamina.trace)
> (time* 
      {:method :get, :url ""}))
time - 62.0ms
  aleph:http-request - 61.9ms - computed for 658.0us, waited for 61.3ms
    aleph:http-connection - 16.4ms - computed for 620.0us, waited for 15.8ms

Since both the inner http-request and http-connection functions return async-promises, the printed output is able to differentiate between time spent computing and time spent waiting for the value to resolve. In this case, we can see that we spent ~16ms waiting for the TCP connection to open, and ~62ms for both the connection to open and the HTTP response to arrive.

Probe selector syntax

There is a query language for writing basic transformers on probes (or other channels) using a specialized string-based syntax.

TODO: stomp endpoints