Release 2.25.0.0-b203: [#24405] docdb : Optimize Server-Level Metric Attribute Map Creation · yugabyte/yugabyte-db

2.25.0.0-b203
0ad17a3
Choose a tag to compare

Filter

View all tags

2.25.0.0-b203: [#24405] docdb : Optimize Server-Level Metric Attribute Map Creation

2.25.0.0-b203
0ad17a3
Choose a tag to compare

Filter

View all tags

yusong-yan tagged this 24 Oct 22:55

Summary:
**Background:**

Currently, attribute map of table or tablet metric contains 6 elements: `table_id`, `table name`, `table_type`, `namespace_name`, `exported_instance`, `metric_type`

**Issue:**

During Prometheus metric scraping, a new server level attribute map is created each time a table/tablet metric is aggregated to server level, even if the attribute map already exists in the cache (`aggregated_attributes_by_metric_type_`).
```
MetricEntity::AttributeMap new_attr = attr;
new_attr.erase("table_id");
new_attr.erase("table_name");
new_attr.erase("table_type");
new_attr.erase("namespace_name");
```
This leads to unnecessary creation and deletion of the map on the stack each and every time. If the number of tables/tablets is very high, the cost of doing this adds up significantly.

For example, tests on a 4 cores node with 4,000 tables and 18,000 tablets showed that, a normal mode scrape took about 18 seconds, with server-level attribute handling accounting for 5 seconds. With about 300 metrics being aggregated, the unnecessary map creation/deletion was done (4000+18000)*300 ~ 6.6 million times. Here is the time breakdown:
```
Server WriteForPrometheus took 18.33 seconds
1. Time taken by WriteSingleEntry: 10.27 seconds
1a. Time taken by AddAggregatedEntry attributes aggregation: 1.18 seconds
1b. Time taken by AddAggregatedEntry values aggregation: 1.46 seconds
1c. Time taken by flushing server metric: 0.01 seconds
1d. Time taken by creating server level attributes: 5.01 seconds <--- bottleneck
1e. Time taken by getting aggregation level: 0.79 seconds
2. Time taken by FlushAggregatedValues: 2.99 seconds
3. Time taken by rest of operations: 5.07 seconds
```

More testing results: [[ https://docs.google.com/spreadsheets/d/1hjPYusvDr38m6JvKenAbbiJCGd2Mt0t98qk_d_JUkhw/edit?gid=0#gid=0 | Google Spread Sheet]]

**Affected Versions: **
This behavior is part of the server level aggregation feature
D18637 (starting from 2.14.1 branch) introduced the server level aggregation feature, but it is not used by default.
D29217 (starting from 2.18.6 branch), server level aggregation is used by default, it aggregate metrics that are excluded from `priority_regex` on server level.

**Fix:**

Only create a new server level attribute map if it doesn't exist in the cache. Since only tablet and table metrics are aggregated to the server level, this fix reduces the number of times the attribute map creation logic is executed to a maximum of two(once for all the tablet metrics and once for all table metrics)
Jira: DB-13314

Test Plan: On a 4-core node with 4,000 tables and 18,000 tablets, reduce the normal mode scraping time from 18 seconds to 13 seconds after applying the optimization code.

Reviewers: rthallam, esheng, sergei

Reviewed By: rthallam, esheng, sergei

Subscribers: svc_phabricator, rthallam, ybase

Differential Revision: https://phorge.dev.yugabyte.com/D39118

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Uh oh!