Skip to content
Peter Taoussanis edited this page May 23, 2024 · 3 revisions

Building observable systems can be tough, there's no magic solutions. But the benefits can be large, often dramatically outweighing the costs.

I'll present some basic/fundamental info here and can expand on this more in future if there's interest.

General guidance

Consider what info you (will) need

Try be as concrete as possible about what info is (or will be) most valuable about your system. Get agreement on examples.

Info may be needed for:

  • Debugging
  • Business intelligence
  • Testing/staging/validation
  • Customer support
  • Quality management
  • Etc.

Try be clear on:

  • Who exactly will need what information
  • In what time frame
  • In what form
  • And for what purpose (i.e. how will the information be actionable)

Always try involve the final consumer of information in your design process.

Always try map examples of expected information to expected actionable decisions, and document these mappings.

The more clear the expected actionable decisions, the more clear the information's value.

Consider data dependencies

Not all data is inherently useful (and so valuable).

Be clear on which data is:

  • Useful independently
  • Useful in the aggregate (e.g. statistically)
  • Useful in association with other related data (e.g. while tracing the flow of some logical activity)

Remember to always question assertions of usefulness!!

Useful for what precise purpose and by whom? Can a clear example be provided mapping information to decisions?

When aggregates or associations are needed- does a plan exist for producing them from the raw data? Some forethought (e.g. appropriate identifiers and/or indexes) can help avoid big headaches!

Consider cardinality

Too much low-value data is often actively harmful: expensive to process, to store, and to query. Adding noise just interferes with better data, harming your ability to understand your system.

Consider the quantities of data that'd best suit your needs for that data:

Telemere sampling

Telemere offers extensive filtering capabilities that help you easily express the conditions and quantities that make sense for your needs. Use these both for their effects and as a form of documentation.

Consider evolution

Both data and needs evolve over time.

Consider what is likely subject to change, and how that might impact your observability needs and therefore design.

Consider the downstream effects on data services/storage when something does change.

Use schemas when appropriate. Use version identifiers when reasonable.

Consider the differences between accretion and breakage.

Telemere usage tips

  • log! and event! are both good general-purpose signal creators.

  • Provide an id for all signals you create.

    Qualified keywords are perfect! Downstream behaviour (e.g. alerts) can then look for these ids rather than messages (which are harder to match and more likely to change).

  • Keep a documented index of all your signal ids under version control.

    This way you can see all your ids in one place, and precise info on when ids were added/removed/changed.

  • Use signal middleware to your advantage.

    The result of signal middleware is cached and shared between all handlers making it an efficient place to transform signals. For this reason - prefer signal middleware to handler middleware when possible/convenient.

  • Signal and handler sampling is multiplicative.

    Both signals and handlers can have independent sample rates, and these MULTIPLY!

    If a signal is created with 20% sampling and a handler handles 50% of received signals, then 10% of possible signals will be handled (50% of 20%).

    This multiplicative rate is helpfully reflected in each signal's final :sample-rate value, making it possible to estimate unsampled cardinalities in relevant cases.

    So for n randomly sampled signals matching some criteria, you'd have seen an estimated Σ(1.0/sample-rate_i) such signals without sampling, etc.

  • Middleware can return any type, but it's best to return only nil or a map.

  • Middleware can be used to filter signals by returning nil.

  • Middleware can be used to split signals.

    Your middleware can call signal creators like any other code. Return nil after to filter the source signal. Just be aware that new signals will re-enter your handler queue/s as would any other signal - and so may be subject to handling delay and normal handler queue back-pressure.

  • Levels can be arbitrary integers.

    See the value of level-aliases to see how the standard keywords (:info, :warn, etc.) map to integers.

  • Signals with an :error value need not have :error level and vice versa.

    Telemere doesn't couple the presence of an error value to signal level. This can be handy, but means that you need to be clear on what constitutes an "error signal" for your use case. See also the error-signal? util.

  • Signals may contain arbitrary user-level keys.

    Any non-standard options you give to a signal creator call will be added to the signal it creates:

    (t/with-signal (t/log! {:my-key "foo"} "My message")))
    ;; => {:my-key "foo", :kvs {:my-key "foo", ...}, ...}

    Note that all user kvs will also be available together under the signal's :kvs key.

    User kvs are typically not included in handler output, so are a great way of providing custom data/opts for use (only) by custom middleware or handlers.

  • Signal kind can be useful in advanced cases.

    Every signal has a kind key set by default by each signal creator- log! calls produce signals with a :log kind, etc.

    Most folks won't use this, but it can be handy in advanced environments since it allows you to specify a custom taxonomy of signals separate from ids and namespaces.

    Signals can be filtered by kind, and minimum levels specified by kind.