Telemetry #411

c0c0n3 · 2020-12-08T18:58:39Z

Proposed changes

The telemetry package: Thread-safe, low memory footprint, efficient collection of time-varying quantities in 286 lines of code. Using this module you can easily:

time the duration of selected code blocks;
turn on the Python built-in profiler (cProfile);
gather garbage collection and OS resource usage metrics---memory, user/kernel time, etc.

Telemetry data collection works seamlessly across process boundaries so even if the main process forks children, you can still be sure there won't be nasty race conditions and no overhead either as it would happen when using a DB backend.

Duration, GC and OS measurements are assembled in time series. Every time you sample a duration, a corresponding measurement is added to an underlying duration series at the current time point. GC and OS metrics, if enabled, work similarly, except they're automatically gathered in a background thread every second. Notice we use a nanosecond-resolution, high-precision timer under the bonnet. Time series data are collected in a memory buffer which gets flushed to file as soon as the buffer fills. Files are written to a directory of your choice.

Types of changes

Bugfix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing
functionality to not work as expected)

Checklist

I have read the CONTRIBUTING doc
I have signed the CLA
I have added tests that prove my fix is effective or that my
feature works
I have added necessary documentation (if appropriate)
Any dependent changes have been merged and published in
downstream modules

Further comments

Concurrency and performance

The whole data collection process, from memory to file, is designed to be efficient and have a low memory footprint in order to minimise impact on the runtime performance of the process being monitored and guarantee accurate measurements. At the same time, it is thread-safe and can even be used by multiple processes simultaneously, e.g. a Gunicorn WSGI app configured with a pre-forked server model. (See the documentation of the flush module about parallel reader/writer processes.)

As a frame of reference, average overhead for collecting a duration sample is 0.31 ms. Memory gets capped at 1 MiB as noted below. (You can use the overhead gauge script in the tests directory to experiment yourself.)

Output files

Time series data are collected in a memory buffer which gets flushed to file when the buffer's memory grows bigger than 1 MiB. Files are written to a directory of your choice with file names having the following prefixes: the value of DURATION_FILE_PREFIX for duration series, the value of RUNTIME_FILE_PREFIX for GC & OS metrics, and PROFILER_FILE_PREFIX for profiler data. The file format is CSV and fields are arranged as follows:

Timepoint: time at which the measurement was taken, expressed as number of nanoseconds since the epoch. (Integer value.)
Measurement: sampled quantity. (Float value.)
Label: name used to identify a particular kind of measurement when sampling. (String value.)
PID: operating system ID of the process that sampled the quantity.

Out of convenience, the CSV file starts with a header of: Timepoint, Measurement, Label, PID.

Usage

You start a monitoring session with a call to the start function

import server.telemetry.monitor as monitor

monitor.start(monitoring_dir='/my/output/',  # where to output files
              with_runtime=True,             # enable GC & OS sampling
              with_profiler=False)           # but keep profiler off

This function should be called exactly once, so it's best to call it from the main thread when the process starts. There's also a stop function you should call just before the process exits to make sure all memory buffers get flushed to file:

monitor.stop()

This function too should be called exactly once. With set-up and tear-down out of the way, let's have a look at how to time a code block:

sample_id = monitor.start_duration_sample()       # start timer
try:
    # do stuff
finally:
    key = 'my code block id'      # unique name for this series
    monitor.stop_duration_sample(key, sample_id)  # stop timer

Now every time this block of code gets hit, a new duration sample ends up in the "my code block id" series. If you later open up the duration file where the series gets saved, you should be able to see something similar to

Timepoint, Measurement, Label, PID
    ...
1607092101580206000, 0.023, "my code block id", 5662
    ...
1607092101580275000, 0.029, "my code block id", 5662
    ...

Since timing functions is a pretty common thing to do, we have a shortcut for that

from server.telemetry.monitor import time_it

@time_it(label='my_func')
def my_func():
    # do stuff

The time_it decorator wraps the function you annotate to basically run the same timing instructions we wired in manually earlier. With runtime metrics collection enabled (see start method), a background thread gathers GC and OS data (CPU time, memory, etc.) as detailed in the documentation of GCSampler and ProcSampler. Another thing you can do is turn on the profiler when calling the start function. In that case, when the process exits you'll have a profile data file you can import into the Python profiler's stats console, e.g.

python -m pstats profiler.5662.data  # 5662 is the process' PID

Finally, here's a real-world example of using this module with Gunicorn to record the duration of each HTTP request in time series (one series for each combination of path and verb) as well as GC and OS metrics. To try it out yourself, start Gunicorn with a config file containing

import os
import server.telemetry.monitor as monitor


bind = "0.0.0.0:8080"
workers = 2               # pre-fork model (two processes)
worker_class = 'gthread'  # with threads,
threads = 2               # two of them for each process
loglevel = 'debug'


monitoring_dir = '_monitoring'  # output files go in ./_monitoring

# init monitoring with duration and runtime samplers just after Gunicorn
# forks the worker process.
def post_worker_init(worker):
    os.makedirs(monitoring_dir, exist_ok=True)
    monitor.start(monitoring_dir=monitoring_dir,
                  with_runtime=True,
                  with_profiler=False)

# start the request timer just before Gunicorn hands off the request
# to the WSGI app; stash away the sample ID in the request object so
# we can use it later.
def pre_request(worker, req):
    req.duration_sample_id = monitor.start_duration_sample()

# stop the request timer as soon as the WSGI app is done with the
# request; record request duration in a time series named by combining
# HTTP path and verb.
def post_request(worker, req, environ, resp):
    key = f"{req.path} [{req.method}]"
    monitor.stop_duration_sample(key, req.duration_sample_id)

# flush any left over time series data still buffered in memory just
# before the process exits.
def worker_exit(server, worker):
    monitor.stop()

Customisation

For common telemetry scenarios (timing, profiling, GC) you should just be able to use the monitor module as is. See there for details and usage.

For more advanced scenarios or writing your own samplers, familiarise yourself with the observation module (core functionality, comes with lots of examples) first, then have a look at the samplers in the sampler module to see how to write one, finally you can use the implementation of the monitor module as a starting point for wiring together the building
blocks to make them fit for your use case.

Prod monitoring

We can use this module as is to gather prod telemetry data. Ideally we'd add a couple of tweaks though, just to make it more convenient to use with K8s pods:

Make buffer flushing configurable: time intervals and memory thresholds.
Sidecar to move telemetry files to a place where they can be easily analysed. Importing into Crate or Timescale could be as simple as a COPY FROM. Another option is to copy them to a local directory and then import them into Pandas or any other data analytics framework.
Turning on/off of telemetry without restarting containers.

Data analysis with Pandas

Out of convenience I've added a pandas_import module to, you guessed it, import telemetry files into Pandas frames and series. This module doesn't really belong in the telemetry package but I couldn't find a better home for it just now. If you want to use it to analyse data produced during benchmark sessions (see src/tests/benchmark), you'll have to install Pandas and Matplotlib, which you can do easily with pipenv since I've added those packages to the dev dependencies:

$ pipenv install --dev

Notice those packages and their dependencies won't wind up in the Docker image.

…try.

github-actions · 2020-12-08T18:59:01Z

CLA Assistant Lite bot All contributors have signed the CLA ✍️

c0c0n3 · 2020-12-08T18:59:47Z

I have read the CLA Document and I hereby sign the CLA

c0c0n3 · 2020-12-09T09:04:10Z

@chicco785 I've just pushed a commit (8af5e67) with some cosmetic changes to keep Code Climate happy. But it looks like there's a problem with how Code Climate counts lines of code:

File observation.py has 299 lines of code (exceeds 250 allowed).
Consider refactoring. NEW

That's bogus. Real count is 84:

$  cloc src/server/telemetry/observation.py
-------------------------------------------------------------------------------
Language                     files          blank        comment           code
-------------------------------------------------------------------------------
Python                           1             93            248             84
-------------------------------------------------------------------------------

I've double checked the figure by tallying up lines of code myself and still got 84. (Well 80 if I exclude imports.) We should probably disable the check?

src/tests/benchmark/docker-compose.yml

chicco785 · 2020-12-09T10:20:46Z

@c0c0n3 can you update documentation about running load tests and refer to this addition and ditch k6?

https://github.com/smartsdk/ngsi-timeseries-api/blob/master/docs/manuals/admin/benchmarks.md

very briefly, just to let people know how to run benchmarks.

c0c0n3 · 2020-12-10T17:06:18Z

I've added a new section to the manual to detail telemetry features and how to use them:

https://github.com/smartsdk/ngsi-timeseries-api/blob/profiling/docs/manuals/admin/telemetry.md

can you update documentation about running load tests and refer to this addition and ditch k6?

Sure, I can update that section too, but do you also want me to zap all the k6 related stuff too in the code base? i.e. scripts & friends?

chicco785 · 2020-12-11T10:39:18Z

Sure, I can update that section too, but do you also want me to zap all the k6 related stuff too in the code base? i.e. scripts & friends?

yep, clearly your shell script is less resource intensive and give burst (k6 provides more a user like dynamic behaviour)

docs/manuals/admin/telemetry.md

…locks.

chicco785 · 2020-12-23T09:26:00Z

@c0c0n3 can we finalise this PR?

c0c0n3 · 2020-12-24T11:47:35Z

can we finalise this PR?

I think it's ready to go. I haven't updated the benchmark manual section, I opened a separate issue for that: #415.

c0c0n3 added 11 commits December 1, 2020 13:52

pin pg8000 to last version known to work w/ ql.

fe1006d

rework profiling not to use db backend---avoids skewing measurements.

6c9c75c

implement efficient, low mem footprint, data analysis friendly teleme…

8124a7b

…try.

use constants for cvs field names.

6eace9c

fix tmp file mv across file systems.

860ec3f

implement import of telemetry data into pandas.

c7fef6f

initial benchmarking scripts.

a9f793c

get rid of first implementation of telemetry stuff.

3b40a04

undo the rebase mess!

20fed6a

Merge branch 'master' into profiling

f43b7da

replay python deps changes from before botched rebase.

b2fd29c

github-actions bot added a commit that referenced this pull request Dec 8, 2020

@c0c0n3 has signed the CLA from Pull Request #411

e9fd463

cosmetic changes to keep code climate happy.

8af5e67

chicco785 self-requested a review December 9, 2020 10:14

chicco785 reviewed Dec 9, 2020

View reviewed changes

src/tests/benchmark/docker-compose.yml Show resolved Hide resolved

chicco785 previously approved these changes Dec 9, 2020

View reviewed changes

document telemetry features for admins and power users.

9dfe08c

c0c0n3 dismissed chicco785’s stale review via 9dfe08c December 10, 2020 16:59

chicco785 reviewed Dec 11, 2020

View reviewed changes

docs/manuals/admin/telemetry.md Show resolved Hide resolved

use context decorator instead of function decorator for timing code b…

43f4e71

…locks.

c0c0n3 added 3 commits December 24, 2020 12:19

use custom load test driver for better accuracy.

7bbf0ab

add telemetry section to docs index.

a5de322

Merge branch 'master' into profiling

6c3a647

c0c0n3 mentioned this pull request Dec 24, 2020

Consider replacing K6 with Vegeta #415

Closed

Merge branch 'master' into profiling

197ee82

chicco785 approved these changes Jan 13, 2021

View reviewed changes

Merge branch 'master' into profiling

647cf3e

chicco785 merged commit 32674f5 into master Jan 27, 2021

github-actions bot locked and limited conversation to collaborators Jan 27, 2021

c0c0n3 deleted the profiling branch January 28, 2021 08:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Telemetry #411

Telemetry #411

c0c0n3 commented Dec 8, 2020

github-actions bot commented Dec 8, 2020 •

edited

Loading

c0c0n3 commented Dec 8, 2020

c0c0n3 commented Dec 9, 2020

chicco785 commented Dec 9, 2020

c0c0n3 commented Dec 10, 2020

chicco785 commented Dec 11, 2020

chicco785 commented Dec 23, 2020

c0c0n3 commented Dec 24, 2020

Telemetry #411

Telemetry #411

Conversation

c0c0n3 commented Dec 8, 2020

Proposed changes

Types of changes

Checklist

Further comments

Concurrency and performance

Output files

Usage

Customisation

Prod monitoring

Data analysis with Pandas

github-actions bot commented Dec 8, 2020 • edited Loading

c0c0n3 commented Dec 8, 2020

c0c0n3 commented Dec 9, 2020

chicco785 commented Dec 9, 2020

c0c0n3 commented Dec 10, 2020

chicco785 commented Dec 11, 2020

chicco785 commented Dec 23, 2020

c0c0n3 commented Dec 24, 2020

github-actions bot commented Dec 8, 2020 •

edited

Loading