Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Telemetry #411

Merged
merged 19 commits into from
Jan 27, 2021
Merged

Telemetry #411

merged 19 commits into from
Jan 27, 2021

Conversation

c0c0n3
Copy link
Member

@c0c0n3 c0c0n3 commented Dec 8, 2020

Proposed changes

The telemetry package: Thread-safe, low memory footprint, efficient collection of time-varying quantities in 286 lines of code. Using this module you can easily:

  • time the duration of selected code blocks;
  • turn on the Python built-in profiler (cProfile);
  • gather garbage collection and OS resource usage metrics---memory, user/kernel time, etc.

Telemetry data collection works seamlessly across process boundaries so even if the main process forks children, you can still be sure there won't be nasty race conditions and no overhead either as it would happen when using a DB backend.

Duration, GC and OS measurements are assembled in time series. Every time you sample a duration, a corresponding measurement is added to an underlying duration series at the current time point. GC and OS metrics, if enabled, work similarly, except they're automatically gathered in a background thread every second. Notice we use a nanosecond-resolution, high-precision timer under the bonnet. Time series data are collected in a memory buffer which gets flushed to file as soon as the buffer fills. Files are written to a directory of your choice.

Types of changes

  • Bugfix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing
    functionality to not work as expected)

Checklist

  • I have read the CONTRIBUTING doc
  • I have signed the CLA
  • I have added tests that prove my fix is effective or that my
    feature works
  • I have added necessary documentation (if appropriate)
  • Any dependent changes have been merged and published in
    downstream modules

Further comments

Concurrency and performance

The whole data collection process, from memory to file, is designed to be efficient and have a low memory footprint in order to minimise impact on the runtime performance of the process being monitored and guarantee accurate measurements. At the same time, it is thread-safe and can even be used by multiple processes simultaneously, e.g. a Gunicorn WSGI app configured with a pre-forked server model. (See the documentation of the flush module about parallel reader/writer processes.)

As a frame of reference, average overhead for collecting a duration sample is 0.31 ms. Memory gets capped at 1 MiB as noted below. (You can use the overhead gauge script in the tests directory to experiment yourself.)

Output files

Time series data are collected in a memory buffer which gets flushed to file when the buffer's memory grows bigger than 1 MiB. Files are written to a directory of your choice with file names having the following prefixes: the value of DURATION_FILE_PREFIX for duration series, the value of RUNTIME_FILE_PREFIX for GC & OS metrics, and PROFILER_FILE_PREFIX for profiler data. The file format is CSV and fields are arranged as follows:

  • Timepoint: time at which the measurement was taken, expressed as number of nanoseconds since the epoch. (Integer value.)
  • Measurement: sampled quantity. (Float value.)
  • Label: name used to identify a particular kind of measurement when sampling. (String value.)
  • PID: operating system ID of the process that sampled the quantity.

Out of convenience, the CSV file starts with a header of: Timepoint, Measurement, Label, PID.

Usage

You start a monitoring session with a call to the start function

import server.telemetry.monitor as monitor

monitor.start(monitoring_dir='/my/output/',  # where to output files
              with_runtime=True,             # enable GC & OS sampling
              with_profiler=False)           # but keep profiler off

This function should be called exactly once, so it's best to call it from the main thread when the process starts. There's also a stop function you should call just before the process exits to make sure all memory buffers get flushed to file:

monitor.stop()

This function too should be called exactly once. With set-up and tear-down out of the way, let's have a look at how to time a code block:

sample_id = monitor.start_duration_sample()       # start timer
try:
    # do stuff
finally:
    key = 'my code block id'      # unique name for this series
    monitor.stop_duration_sample(key, sample_id)  # stop timer

Now every time this block of code gets hit, a new duration sample ends up in the "my code block id" series. If you later open up the duration file where the series gets saved, you should be able to see something similar to

Timepoint, Measurement, Label, PID
    ...
1607092101580206000, 0.023, "my code block id", 5662
    ...
1607092101580275000, 0.029, "my code block id", 5662
    ...

Since timing functions is a pretty common thing to do, we have a shortcut for that

from server.telemetry.monitor import time_it

@time_it(label='my_func')
def my_func():
    # do stuff

The time_it decorator wraps the function you annotate to basically run the same timing instructions we wired in manually earlier. With runtime metrics collection enabled (see start method), a background thread gathers GC and OS data (CPU time, memory, etc.) as detailed in the documentation of GCSampler and ProcSampler. Another thing you can do is turn on the profiler when calling the start function. In that case, when the process exits you'll have a profile data file you can import into the Python profiler's stats console, e.g.

python -m pstats profiler.5662.data  # 5662 is the process' PID

Finally, here's a real-world example of using this module with Gunicorn to record the duration of each HTTP request in time series (one series for each combination of path and verb) as well as GC and OS metrics. To try it out yourself, start Gunicorn with a config file containing

import os
import server.telemetry.monitor as monitor


bind = "0.0.0.0:8080"
workers = 2               # pre-fork model (two processes)
worker_class = 'gthread'  # with threads,
threads = 2               # two of them for each process
loglevel = 'debug'


monitoring_dir = '_monitoring'  # output files go in ./_monitoring

# init monitoring with duration and runtime samplers just after Gunicorn
# forks the worker process.
def post_worker_init(worker):
    os.makedirs(monitoring_dir, exist_ok=True)
    monitor.start(monitoring_dir=monitoring_dir,
                  with_runtime=True,
                  with_profiler=False)

# start the request timer just before Gunicorn hands off the request
# to the WSGI app; stash away the sample ID in the request object so
# we can use it later.
def pre_request(worker, req):
    req.duration_sample_id = monitor.start_duration_sample()

# stop the request timer as soon as the WSGI app is done with the
# request; record request duration in a time series named by combining
# HTTP path and verb.
def post_request(worker, req, environ, resp):
    key = f"{req.path} [{req.method}]"
    monitor.stop_duration_sample(key, req.duration_sample_id)

# flush any left over time series data still buffered in memory just
# before the process exits.
def worker_exit(server, worker):
    monitor.stop()

Customisation

For common telemetry scenarios (timing, profiling, GC) you should just be able to use the monitor module as is. See there for details and usage.

For more advanced scenarios or writing your own samplers, familiarise yourself with the observation module (core functionality, comes with lots of examples) first, then have a look at the samplers in the sampler module to see how to write one, finally you can use the implementation of the monitor module as a starting point for wiring together the building
blocks to make them fit for your use case.

Prod monitoring

We can use this module as is to gather prod telemetry data. Ideally we'd add a couple of tweaks though, just to make it more convenient to use with K8s pods:

  • Make buffer flushing configurable: time intervals and memory thresholds.
  • Sidecar to move telemetry files to a place where they can be easily analysed. Importing into Crate or Timescale could be as simple as a COPY FROM. Another option is to copy them to a local directory and then import them into Pandas or any other data analytics framework.
  • Turning on/off of telemetry without restarting containers.

Data analysis with Pandas

Out of convenience I've added a pandas_import module to, you guessed it, import telemetry files into Pandas frames and series. This module doesn't really belong in the telemetry package but I couldn't find a better home for it just now. If you want to use it to analyse data produced during benchmark sessions (see src/tests/benchmark), you'll have to install Pandas and Matplotlib, which you can do easily with pipenv since I've added those packages to the dev dependencies:

$ pipenv install --dev

Notice those packages and their dependencies won't wind up in the Docker image.

@github-actions
Copy link
Contributor

github-actions bot commented Dec 8, 2020

CLA Assistant Lite bot All contributors have signed the CLA ✍️

@c0c0n3
Copy link
Member Author

c0c0n3 commented Dec 8, 2020

I have read the CLA Document and I hereby sign the CLA

github-actions bot added a commit that referenced this pull request Dec 8, 2020
@c0c0n3
Copy link
Member Author

c0c0n3 commented Dec 9, 2020

@chicco785 I've just pushed a commit (8af5e67) with some cosmetic changes to keep Code Climate happy. But it looks like there's a problem with how Code Climate counts lines of code:

File observation.py has 299 lines of code (exceeds 250 allowed).
Consider refactoring. NEW

That's bogus. Real count is 84:

$  cloc src/server/telemetry/observation.py
-------------------------------------------------------------------------------
Language                     files          blank        comment           code
-------------------------------------------------------------------------------
Python                           1             93            248             84
-------------------------------------------------------------------------------

I've double checked the figure by tallying up lines of code myself and still got 84. (Well 80 if I exclude imports.) We should probably disable the check?

@chicco785 chicco785 self-requested a review December 9, 2020 10:14
chicco785
chicco785 previously approved these changes Dec 9, 2020
@chicco785
Copy link
Contributor

@c0c0n3 can you update documentation about running load tests and refer to this addition and ditch k6?

https://github.com/smartsdk/ngsi-timeseries-api/blob/master/docs/manuals/admin/benchmarks.md

very briefly, just to let people know how to run benchmarks.

@c0c0n3
Copy link
Member Author

c0c0n3 commented Dec 10, 2020

I've added a new section to the manual to detail telemetry features and how to use them:

https://github.com/smartsdk/ngsi-timeseries-api/blob/profiling/docs/manuals/admin/telemetry.md

can you update documentation about running load tests and refer to this addition and ditch k6?

Sure, I can update that section too, but do you also want me to zap all the k6 related stuff too in the code base? i.e. scripts & friends?

@chicco785
Copy link
Contributor

Sure, I can update that section too, but do you also want me to zap all the k6 related stuff too in the code base? i.e. scripts & friends?

yep, clearly your shell script is less resource intensive and give burst (k6 provides more a user like dynamic behaviour)

@chicco785
Copy link
Contributor

@c0c0n3 can we finalise this PR?

@c0c0n3
Copy link
Member Author

c0c0n3 commented Dec 24, 2020

can we finalise this PR?

I think it's ready to go. I haven't updated the benchmark manual section, I opened a separate issue for that: #415.

@chicco785 chicco785 merged commit 32674f5 into master Jan 27, 2021
@github-actions github-actions bot locked and limited conversation to collaborators Jan 27, 2021
@c0c0n3 c0c0n3 deleted the profiling branch January 28, 2021 08:13
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants