Add statistics for CRUD operations on router #244

DifferentialOrange · 2021-11-29T16:31:34Z

Add statistics module for collecting metrics of CRUD operations on
router. Wrap all CRUD operation calls in the statistics collector.
Statistics must be enabled manually with crud.cfg. They can be
disabled, restarted or re-enabled later.

crud.stats() returns

---
- spaces:
    my_space:
      insert:
        ok:
          latency: 0.002
          count: 19800
          time: 39.6
        error:
          latency: 0.000001
          count: 4
          time: 0.000004
      select:
        ok:
          latency: 0.032
          count: 43100
          time: 1379.2
        error:
          latency: 0.000001
          count: 2
          time: 0.000002
        details:
          map_reduces: 48
          tuples_fetched: 105000
          tuples_lookup: 2380000
---

spaces section contains statistics for each observed space.
If operation has never been called for a space, the corresponding
field will be empty. If no requests has been called for a
space, it will not be represented. Space data is based on
client requests rather than storages schema, so requests
for non-existing spaces are also collected.

This patch introduces crud.cfg. crud.cfg is a tool to set module
configuration. It is similar to Tarantool box.cfg, although we don't
need to call it to bootstrap the module -- it is used only to change
configuration. crud.cfg is a callable table. To change configuration,
call it: crud.cfg{ stats = true }. You can check table contents as
with ordinary table, but do not change them directly -- use call
instead. Table contents is immutable and use proxy approach
(see [1, 2]). Iterating through crud.cfg with pairs is not supported
yet, refer to #265.

Possible statistics operation labels are
insert (for insert and insert_object calls),
get, replace (for replace and replace_object calls), update,
upsert (for upsert and upsert_object calls), delete,
select (for select and pairs calls), truncate, len, count
and borders (for min and max calls).

Each operation section consists of different collectors
for success calls and error (both error throw and nil, err)
returns. count is total requests count since instance start
or stats restart. latency is average time of requests execution,
time is the total time of requests execution.

Since pairs request behavior differs from any other crud request, its
statistics collection also has specific behavior. Statistics (select
section) are updated after pairs cycle is finished: you
either have iterated through all records or an error was thrown.
If your pairs cycle was interrupted with break, statistics will
be collected when pairs objects are cleaned up with Lua garbage
collector.

Statistics are preserved between package reloads. Statistics are
preserved between Tarantool Cartridge role reloads [3] if CRUD Cartridge
roles are used.

Statistics select section additionally contains
details collectors.
map_reduces is the count of planned map reduces (including those not
executed successfully). tuples_fetched is the count of tuples fetched
from storages during execution, tuples_lookup is the count of tuples
looked up on storages while collecting responses for calls (including
scrolls for multibatch requests). Details data is updated as part of
the request process, so you may get new details before select/pairs
call is finished and observed with count, latency and time collectors. q

Use in-built crud.stats() info instead on storage_stat helper
in tests to track map reduce calls.

If metrics [4] found, you can use metrics collectors to store
statistics. metrics >= 0.10.0 is required to use metrics driver.
(metrics >= 0.9.0 is required to use summary quantiles with
age buckets. metrics >= 0.5.0, < 0.9.0 is unsupported
due to quantile overflow bug [5]. metrics == 0.9.0 has bug that do
not permits to create summary collector without quantiles [6].
In fact, user may use metrics >= 0.5.0, metrics != 0.9.0
if he wants to use metrics without quantiles, and metrics >= 0.9.0
if he wants to use metrics with quantiles. But this is confusing,
so let's use a single restriction for both cases.)

The metrics are part of global registry and can be exported together
(e.g. to Prometheus) with default tools without any additional
configuration. Disabling stats destroys the collectors.

Metrics collectors are used by default if supported. To explicitly set
driver, call crud.cfg{ stats = true, stats_driver = driver }
('local' or 'metrics'). To enable quantiles, call

crud.cfg{
    stats = true,
    stats_driver = 'metrics',
    stats_quantiles = true,
}

With quantiles, latency statistics are changed to 0.99 quantile
of request execution time (with aging). Quantiles computations increases
performance overhead up to 10% when used in statistics.

Add CI matrix to run tests with metrics installed. To get full
coverage on coveralls, #248 must be resolved.

The metrics are part of global registry and can be exported together
(e.g. to Prometheus) with default tools without any additional
configuration. Disabling stats destroys the collectors.

Metrics collectors are used by default if supported. To explicitly set
driver, call crud.enable_stats{ driver = driver } ('local' or
'metrics'). To enable quantiles, call
crud.enable_stats{ driver = 'metrics', quantiles = true }.
With quantiles, latency statistics are changed to 0.99 quantile
of request execution time (with aging). Quantiles computations increases
performance overhead up to 10% when used in statistics.

Add CI matrix to run tests with metrics installed. To get full
coverage on coveralls, #248 must be resolved.

Metrics collectors are used by default if supported. To explicitly set
driver, call crud.enable_stats{ driver = driver } ('local' or
'metrics').

If metrics used, latency statistics are changed to 0.99 quantile
of request execution time (with aging).

Add CI matrix to run tests with metrics installed. To get full
coverage on coveralls, #248 must be resolved.

Before this patch, performance tests ran together with unit and
integration with --coverage flag. Coverage analysis cropped the
result of performance tests to 10-15 times. For metrics integration
it resulted in timeout errors and drop of performance which is not
reproduces with coverage disabled. Moreover, before this patch log
capture was disabled and performance tests did not displayed any
results after run. Now performance tests also run is separate CI job.

After this patch, make -C build coverage will run lightweight
version of performance test. make -C build performance will run real
performance tests.

You can paste output table to GitHub [7].

This path also reworks current performance test. It adds new cases to
compare module performance with or without statistics, statistic
wrappers and compare different metrics drivers and reports new info:
average call time and max call time.

Performance test result: overhead is 3-10% in case of local driver and
5-15% in case of metrics driver, up to 20% for metrics with
quantiles. Based on several runs on HP ProBook 440 G7 i7/16Gb/256SSD.

Success requests per second

	without stats wrapper	stats disabled	local stats	metrics stats (no quantiles)	metrics stats (with quantiles)
select by pk	18818.04	18666.49	17057.17	16223.08	15919.78
select gt by pk (limit 10)	4439.22	4411.50	4345.72	4137.92	4134.32
pairs gt by pk (limit 100)	1667.76	1643.36	1485.64	1448.39	1470.19
insert	39808.06	39392.49	35940.60	34346.48	32155.64

Max call time

	without stats wrapper	stats disabled	local stats	metrics stats (no quantiles)	metrics stats (with quantiles)
select by pk	55.865 ms	54.517 ms	51.106 ms	57.375 ms	45.661 ms
select gt by pk (limit 10)	100.522 ms	95.305 ms	98.899 ms	110.826 ms	102.551 ms
pairs gt by pk (limit 100)	111.484 ms	149.179 ms	125.325 ms	165.374 ms	124.922 ms
insert	52.945 ms	49.434 ms	52.853 ms	55.963 ms	62.925 ms

Performance overhead is 3-10% in case of local driver and
5-15% in case of metrics driver, up to 20% for metrics with quantiles.

I didn't forget about

Tests
Changelog
Documentation

Closes #224, closes #233

DifferentialOrange · 2021-12-07T10:54:34Z

metrics integrations leads to timeouts in perf test: https://github.com/tarantool/crud/runs/4442731842?check_suite_focus=true

DifferentialOrange · 2021-12-08T14:43:44Z

metrics integrations leads to timeouts in perf test: https://github.com/tarantool/crud/runs/4442731842?check_suite_focus=true

I tuned out summary parameters to not cause timeouts (on local runs). But performance drop is still noticeable (2-3 times). I have discussed the issue with @yngvar-antonsson and filed a ticket (tarantool/metrics#331). CI perf test not runs with metrics driver now (simply to not make test run twice a time it runs now), I think I will add separate perf test as a part of #225 solution.

Since we have PR #244 it will be nice to collect statistics for batch operations too. To establish the effectiveness of `crud.batch_insert()` method compared to `crud.insert()`, perf tests were added. `crud.insert()` in the loop and `crud.batch_insert()` are compared for different batch sizes. Closes #193

`crud` module is cartridge-independent in nature, but provides cartridge roles which are the most popular way to setup the module. The roles also not use any modern cartridge features and should work with any cartridge version. But since crud.cfg was introduced [1], it was required to add some code for roles reload [2] proper support. Now cartridge.hotreload module is unconditionally required, so roles cannot be used with cartridge older than 2.4.0. This patch fixes the behavior. 1. 6da4f56 2. tarantool/cartridge@941952e Follows #244

Before this patch, tests were marked with xfail since there was a bug in metrics module [1]. This bug is fixes in newer versions, so xfail is replaced with skip based on metrics version. 1. tarantool/metrics#334 Follows #244

`crud` module is cartridge-independent in nature, but provides cartridge roles which are the most popular way to setup the module. The roles also not use any modern cartridge features and should work with any cartridge version. But since crud.cfg was introduced [1], it was required to add some code for roles reload [2] proper support. Now cartridge.hotreload module is unconditionally required, so roles cannot be used with cartridge older than 2.4.0. This patch fixes the behavior. 1. 6da4f56 2. tarantool/cartridge@941952e Follows #244

Before this patch, tests were marked with xfail since there was a bug in metrics module [1]. This bug is fixes in newer versions, so xfail is replaced with skip based on metrics version. 1. tarantool/metrics#334 Follows #244

`crud` module is cartridge-independent in nature, but provides cartridge roles which are the most popular way to setup the module. The roles also not use any modern cartridge features and should work with any cartridge version. But since crud.cfg was introduced [1], it was required to add some code for roles reload [2] proper support. Now cartridge.hotreload module is unconditionally required, so roles cannot be used with cartridge older than 2.4.0. This patch fixes the behavior. 1. 6da4f56 2. tarantool/cartridge@941952e Follows #244

Before this patch, tests were marked with xfail since there was a bug in metrics module [1]. This bug is fixes in newer versions, so xfail is replaced with skip based on metrics version. 1. tarantool/metrics#334 Follows #244

DifferentialOrange force-pushed the DifferentialOrange/gh-224-operation-stats branch 7 times, most recently from 2c3c332 to 04e57b7 Compare December 6, 2021 12:00

DifferentialOrange changed the title ~~Operation stats~~ Add statistics for CRUD operations on router Dec 6, 2021

DifferentialOrange force-pushed the DifferentialOrange/gh-224-operation-stats branch from 04e57b7 to 096c41f Compare December 6, 2021 13:25

DifferentialOrange marked this pull request as ready for review December 6, 2021 13:34

DifferentialOrange force-pushed the DifferentialOrange/gh-224-operation-stats branch 3 times, most recently from 6e1306c to a734377 Compare December 6, 2021 14:09

DifferentialOrange mentioned this pull request Dec 7, 2021

Send parallel builds coverage to coveralls #248

Open

DifferentialOrange force-pushed the DifferentialOrange/gh-224-operation-stats branch 7 times, most recently from 0504ed8 to 0b43ec6 Compare December 7, 2021 10:42

DifferentialOrange force-pushed the DifferentialOrange/gh-224-operation-stats branch from 0b43ec6 to a7324e8 Compare December 8, 2021 14:23

DifferentialOrange mentioned this pull request Dec 8, 2021

Summary performance drops tarantool/metrics#331

Closed

DifferentialOrange force-pushed the DifferentialOrange/gh-224-operation-stats branch 2 times, most recently from 5a6e025 to dad3d17 Compare December 8, 2021 14:38

DifferentialOrange requested review from Totktonada, ligurio and olegrok December 8, 2021 14:44

Totktonada mentioned this pull request Jun 3, 2022

Preview resolved PR comments refined-github/refined-github#5673

Closed

DifferentialOrange mentioned this pull request Feb 3, 2023

fix: pre-hotreload cartridge support #341

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add statistics for CRUD operations on router #244

Add statistics for CRUD operations on router #244

DifferentialOrange commented Nov 29, 2021 •

edited

DifferentialOrange commented Dec 7, 2021

DifferentialOrange commented Dec 8, 2021

Add statistics for CRUD operations on router #244

Add statistics for CRUD operations on router #244

Conversation

DifferentialOrange commented Nov 29, 2021 • edited

Success requests per second

Max call time

DifferentialOrange commented Dec 7, 2021

DifferentialOrange commented Dec 8, 2021

DifferentialOrange commented Nov 29, 2021 •

edited