Add statistics for CRUD operations on router #244

Add statistics module for collecting metrics of CRUD operations on router. Wrap all CRUD operation calls in the statistics collector. Statistics must be enabled manually with `crud.cfg`. They can be disabled, restarted or re-enabled later. This patch introduces `crud.cfg`. `crud.cfg` is a tool to set module configuration. It is similar to Tarantool `box.cfg`, although we don't need to call it to bootstrap the module -- it is used only to change configuration. `crud.cfg` is a callable table. To change configuration, call it: `crud.cfg{ stats = true }`. You can check table contents as with ordinary table, but do not change them directly -- use call instead. Table contents is immutable and use proxy approach (see [1, 2]). Iterating through `crud.cfg` with pairs is not supported yet, refer to #265. `crud.stats()` returns --- - spaces: my_space: insert: ok: latency: 0.002 count: 19800 time: 39.6 error: latency: 0.000001 count: 4 time: 0.000004 ... `spaces` section contains statistics for each observed space. If operation has never been called for a space, the corresponding field will be empty. If no requests has been called for a space, it will not be represented. Space data is based on client requests rather than storages schema, so requests for non-existing spaces are also collected. Possible statistics operation labels are `insert` (for `insert` and `insert_object` calls), `get`, `replace` (for `replace` and `replace_object` calls), `update`, `upsert` (for `upsert` and `upsert_object` calls), `delete`, `select` (for `select` and `pairs` calls), `truncate`, `len`, `count` and `borders` (for `min` and `max` calls). Each operation section consists of different collectors for success calls and error (both error throw and `nil, err`) returns. `count` is the total requests count since instance start or stats restart. `latency` is the average time of requests execution, `time` is the total time of requests execution. Since `pairs` request behavior differs from any other crud request, its statistics collection also has specific behavior. Statistics (`select` section) are updated after `pairs` cycle is finished: you either have iterated through all records or an error was thrown. If your pairs cycle was interrupted with `break`, statistics will be collected when pairs objects are cleaned up with Lua garbage collector. Statistics are preserved between package reloads. Statistics are preserved between Tarantool Cartridge role reloads [3] if CRUD Cartridge roles are used. 1. http://lua-users.org/wiki/ReadOnlyTables 2. tarantool/tarantool#2867 3. https://www.tarantool.io/en/doc/latest/book/cartridge/cartridge_api/modules/cartridge.roles/#reload Part of #224

In some cases LuaJit optimizes using gc_observer table to handle pairs object gc. It had lead to incorrect behavior (ignoring some pairs interrupted with break in stats) and tests fail in some cases (for example, if you run only stats unit tests). Part of #224

After this patch, statistics `select` section additionally contains `details` collectors. ``` crud.stats('my_space').select.details --- - map_reduces: 4 tuples_fetched: 10500 tuples_lookup: 238000 ... ``` `map_reduces` is the count of planned map reduces (including those not executed successfully). `tuples_fetched` is the count of tuples fetched from storages during execution, `tuples_lookup` is the count of tuples looked up on storages while collecting responses for calls (including scrolls for multibatch requests). Details data is updated as part of the request process, so you may get new details before `select`/`pairs` call is finished and observed with count, latency and time collectors. Part of #224

Use in-built `crud.stats()` info instead on `storage_stat` helper in tests to track map reduce calls. Part of #224

`crud.len` supports using space id instead of name. After this patch, stats wrapper support mapping id to name. Since using space id is a questionable pattern (see #255), this commit may be reverted later. Part of #224

If `metrics` [1] found, you can use metrics collectors to store statistics. `metrics >= 0.10.0` is required to use metrics driver. (`metrics >= 0.9.0` is required to use summary quantiles with age buckets. `metrics >= 0.5.0, < 0.9.0` is unsupported due to quantile overflow bug [2]. `metrics == 0.9.0` has bug that do not permits to create summary collector without quantiles [3]. In fact, user may use `metrics >= 0.5.0`, `metrics != 0.9.0` if he wants to use metrics without quantiles, and `metrics >= 0.9.0` if he wants to use metrics with quantiles. But this is confusing, so let's use a single restriction for both cases.) The metrics are part of global registry and can be exported together (e.g. to Prometheus) with default tools without any additional configuration. Disabling stats destroys the collectors. Metrics collectors are used by default if supported. To explicitly set driver, call `crud.cfg{ stats = true, stats_driver = driver }` ('local' or 'metrics'). To enable quantiles, call ``` crud.cfg{ stats = true, stats_driver = 'metrics', stats_quantiles = true, } ``` With quantiles, `latency` statistics are changed to 0.99 quantile of request execution time (with aging). Quantiles computations increases performance overhead up to 10% when used in statistics. Add CI matrix to run tests with `metrics` installed. To get full coverage on coveralls, #248 must be resolved. 1. https://github.com/tarantool/metrics 2. tarantool/metrics#235 3. tarantool/metrics#262 Closes #224

Before this patch, performance tests ran together with unit and integration with `--coverage` flag. Coverage analysis cropped the result of performance tests to 10-15 times. For metrics integration it resulted in timeout errors and drop of performance which is not reproduces with coverage disabled. Moreover, before this patch log capture was disabled and performance tests did not displayed any results after run. Now performance tests also run is separate CI job. After this patch, `make -C build coverage` will run lightweight version of performance test. `make -C build performance` will run real performance tests. You can paste output table to GitHub [1]. This path also reworks current performance test. It adds new cases to compare module performance with or without statistics, statistic wrappers and compare different metrics drivers and reports new info: average call time and max call time. Performance test result: overhead is 3-10% in case of `local` driver and 5-15% in case of `metrics` driver, up to 20% for `metrics` with quantiles. Based on several runs on HP ProBook 440 G7 i7/16Gb/256SSD. 1. https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/organizing-information-with-tables Closes #233, follows up #224

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add statistics for CRUD operations on router #244

Add statistics for CRUD operations on router #244

Commits on Feb 21, 2022

Commits on Feb 25, 2022