-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Telemetry #411
Conversation
CLA Assistant Lite bot All contributors have signed the CLA ✍️ |
I have read the CLA Document and I hereby sign the CLA |
@chicco785 I've just pushed a commit (8af5e67) with some cosmetic changes to keep Code Climate happy. But it looks like there's a problem with how Code Climate counts lines of code:
That's bogus. Real count is 84:
I've double checked the figure by tallying up lines of code myself and still got 84. (Well 80 if I exclude imports.) We should probably disable the check? |
@c0c0n3 can you update documentation about running load tests and refer to this addition and ditch k6? https://github.com/smartsdk/ngsi-timeseries-api/blob/master/docs/manuals/admin/benchmarks.md very briefly, just to let people know how to run benchmarks. |
I've added a new section to the manual to detail telemetry features and how to use them: https://github.com/smartsdk/ngsi-timeseries-api/blob/profiling/docs/manuals/admin/telemetry.md
Sure, I can update that section too, but do you also want me to zap all the k6 related stuff too in the code base? i.e. scripts & friends? |
yep, clearly your shell script is less resource intensive and give burst (k6 provides more a user like dynamic behaviour) |
@c0c0n3 can we finalise this PR? |
I think it's ready to go. I haven't updated the benchmark manual section, I opened a separate issue for that: #415. |
Proposed changes
The telemetry package: Thread-safe, low memory footprint, efficient collection of time-varying quantities in 286 lines of code. Using this module you can easily:
Telemetry data collection works seamlessly across process boundaries so even if the main process forks children, you can still be sure there won't be nasty race conditions and no overhead either as it would happen when using a DB backend.
Duration, GC and OS measurements are assembled in time series. Every time you sample a duration, a corresponding measurement is added to an underlying duration series at the current time point. GC and OS metrics, if enabled, work similarly, except they're automatically gathered in a background thread every second. Notice we use a nanosecond-resolution, high-precision timer under the bonnet. Time series data are collected in a memory buffer which gets flushed to file as soon as the buffer fills. Files are written to a directory of your choice.
Types of changes
functionality to not work as expected)
Checklist
feature works
downstream modules
Further comments
Concurrency and performance
The whole data collection process, from memory to file, is designed to be efficient and have a low memory footprint in order to minimise impact on the runtime performance of the process being monitored and guarantee accurate measurements. At the same time, it is thread-safe and can even be used by multiple processes simultaneously, e.g. a Gunicorn WSGI app configured with a pre-forked server model. (See the documentation of the
flush
module about parallel reader/writer processes.)As a frame of reference, average overhead for collecting a duration sample is 0.31 ms. Memory gets capped at 1 MiB as noted below. (You can use the overhead gauge script in the tests directory to experiment yourself.)
Output files
Time series data are collected in a memory buffer which gets flushed to file when the buffer's memory grows bigger than 1 MiB. Files are written to a directory of your choice with file names having the following prefixes: the value of
DURATION_FILE_PREFIX
for duration series, the value ofRUNTIME_FILE_PREFIX
for GC & OS metrics, andPROFILER_FILE_PREFIX
for profiler data. The file format is CSV and fields are arranged as follows:Out of convenience, the CSV file starts with a header of:
Timepoint, Measurement, Label, PID
.Usage
You start a monitoring session with a call to the
start
functionThis function should be called exactly once, so it's best to call it from the main thread when the process starts. There's also a
stop
function you should call just before the process exits to make sure all memory buffers get flushed to file:This function too should be called exactly once. With set-up and tear-down out of the way, let's have a look at how to time a code block:
Now every time this block of code gets hit, a new duration sample ends up in the "my code block id" series. If you later open up the duration file where the series gets saved, you should be able to see something similar to
Since timing functions is a pretty common thing to do, we have a shortcut for that
The
time_it
decorator wraps the function you annotate to basically run the same timing instructions we wired in manually earlier. With runtime metrics collection enabled (seestart
method), a background thread gathers GC and OS data (CPU time, memory, etc.) as detailed in the documentation ofGCSampler
andProcSampler
. Another thing you can do is turn on the profiler when calling thestart
function. In that case, when the process exits you'll have a profile data file you can import into the Python profiler's stats console, e.g.Finally, here's a real-world example of using this module with Gunicorn to record the duration of each HTTP request in time series (one series for each combination of path and verb) as well as GC and OS metrics. To try it out yourself, start Gunicorn with a config file containing
Customisation
For common telemetry scenarios (timing, profiling, GC) you should just be able to use the
monitor
module as is. See there for details and usage.For more advanced scenarios or writing your own samplers, familiarise yourself with the
observation
module (core functionality, comes with lots of examples) first, then have a look at the samplers in thesampler
module to see how to write one, finally you can use the implementation of themonitor
module as a starting point for wiring together the buildingblocks to make them fit for your use case.
Prod monitoring
We can use this module as is to gather prod telemetry data. Ideally we'd add a couple of tweaks though, just to make it more convenient to use with K8s pods:
COPY FROM
. Another option is to copy them to a local directory and then import them into Pandas or any other data analytics framework.Data analysis with Pandas
Out of convenience I've added a
pandas_import
module to, you guessed it, import telemetry files into Pandas frames and series. This module doesn't really belong in the telemetry package but I couldn't find a better home for it just now. If you want to use it to analyse data produced during benchmark sessions (seesrc/tests/benchmark
), you'll have to install Pandas and Matplotlib, which you can do easily withpipenv
since I've added those packages to the dev dependencies:Notice those packages and their dependencies won't wind up in the Docker image.