[DocDB] Emit server level aggregates for metrics not in the table-level whitelist #18078

yusong-yan · 2023-07-04T19:31:09Z

Jira Link: DB-7120

Description

YBA uses URL param priority_regex as a table whitelist to reduce the amount of output, and the downside is, for those tablet or table metrics that are not in the regex list, the endpoint will not scrape them at all. As we keep adding metrics in the future, new metrics will never get returned from the endpoint unless YBA modify the priority_regex. This causes trouble for debugging in many situations.
Currently, we have server aggregation code, but it is not being used due to its performance issue. It requires two scrapes to fetch all server-level metrics + table-level metrics. It also has some consistency errors – the server-level aggregation is done at a separate time from the normal metrics and thus may not actually be an aggregation of the returned table-level metrics.

Warning: Please confirm that this issue does not contain any sensitive information

I confirm this issue does not contain any sensitive information.

Summary: **The issue with current metric aggregation:** 1. For each periodic metric collection, every aggregation level requires an entire scrape of all metrics. For example, if a collection requires some metrics on table level, server level, and stream levels, DB needs to loop through all metrics 3 times. 2. Not a good long-term API for future use cases. Every time we change this API, we have to write version-specific code for Prometheus url generation (our db versions are not linear) 3. CSV can become very long and hit url limits **New design:** 1. Redesigned metric level aggregation to only do one scraping for fetching server-level, stream-level, and table-level metrics. (Removed `entity_opts`) 2. Introduce new URL parameter regex filters. The caller can now specify regex (whitelist, blacklist) at each level. The same metric can be exposed at different levels as aggregation_level=”server” and aggregation_level=”table”. If there is a metric in both the whitelist and blacklist on the same level, we will prioritize the blacklist. - table_whitelist : Options( ALL, NONE, Regex), Default (ALL) - table_blacklist : Options( ALL, NONE, Regex), Default (NONE) - server_whitelist : Options( ALL, NONE, Regex), Default (ALL) - server_blacklist : Options( NONE, Regex), Default (NONE) 3. Old URL params `metrics`(CSV) and `priority_regex`(Regex) are still available, so the new version of the Prometheus endpoint will also be compatible with old versions of the platform. **Moreover, most of the change in D24598 has been reversed, since D24598 is built on top of the old metric aggregation design.** I will add another Gflag that limits the number of metrics scraped for each call in another diff. Jira: DB-7120 Test Plan: Rewrote `MetricsTest.AggregationTest` and `MetricsTest.TestStreamLevelAggregation` to verify the current metric aggregation. Created a universe with a single table, then run `yb-admin create_cdc_stream tablename` twice, then ensured two separate entries for `async_replication_committed_lag_micros` with different stream ids. Reviewers: amitanand, rthallam, arybochkin, jhe, esheng Reviewed By: amitanand, arybochkin Subscribers: yql, ycdcxcluster, ybase, bogdan Differential Revision: https://phorge.dev.yugabyte.com/D26662

…ic scrape" and "[#18078] docdb : Reconstruct prometheus metric aggregation" Summary: This reverts commits e732233 and 67aba99. 67aba99 breaks the `/prometheus-metrics` endpoint when `priority_regex` (or equivalently, `table_whitelist`) is set as follows: - No table-level aggregated metrics are in the output - As a result of server_whitelist=ALL defaults, server-level aggregates for all metrics are in the output As a result, we get an output where only server-level aggregates are present, regardless of what `priority_regex`/`table_whitelist` are set to. e732233 makes changes dependent on 67aba99, so reverting that as well for the time being. Jira: DB-7130, DB-7120 Test Plan: Jenkins Reviewers: bogdan, rthallam Reviewed By: bogdan Subscribers: yql, ycdcxcluster, ybase Differential Revision: https://phorge.dev.yugabyte.com/D28606

rthallamko3 · 2023-11-30T19:13:20Z

Evaluating backports for this diff. 2.20 backport is a must. 2.18 is optional (depending on the complexity of the backport).

Summary: **The issue with current /prometheus-metric endpoint:** - YBA uses URL param `priority_regex` as a table whitelist to reduce the amount of output, and the downside is, for those tablet or table metrics that are not in the regex list, the endpoint will not scrape them at all. As we keep adding metrics in the future, new metrics will never get returned from the endpoint unless YBA modify the priority_regex. This causes trouble for debugging in many situations. - Currently, we have server aggregation code, but it is not being used due to its performance issue. It requires two scrapes to fetch all server-level metrics + table-level metrics. It also has some consistency errors – the server-level aggregation is done at a separate time from the normal metrics and thus may not actually be an aggregation of the returned table-level metrics. **Fix:** - Redesigned metric level aggregation to only do one scraping for fetching server-level, stream-level, and table-level metrics(Removed `entity_opts`). Also fixed the consistency problem with the previous server-level aggregation code. **New Output** - We are only generating server-level timeseries for tablet or table metrics that are filtered out by `priority_regex`. This means the endpoint output will have more tablet and table timeseries on the server level. @mlillibridge tested with 3000 tablets, compared with the current endpoint output, the new output contains 539 new timeseries(Master+TServer) in normal mode. - The reason we only partially turn on server-level aggregation is to not break YBA and YBM charts: Even the new /prometheus endpoint is capable of outputting metrics on both server and table levels, the current and older versions of YBA can't distinguish between the same metric on the table and server level, so generating server-level time series will cause a double-counting issue for YBA charts. The new output should be completely transparent to YBA and YBM as they should have no references in charts or alerts to the new server-level timeseries. - The new output can immediately help DB debugging as many metrics that were not collected before will start appearing in the /prometheus output. **Others:** - Dropped URL parameter `max_tables_metrics_breakdowns`. Since it was not used, there is no impact to delete it. It is still necessary to have a global limit to prevent egregious cases, so I will have another diff for it. - Dropped URL parameters `exclude_metrics`, `server_exclude_metrics`, `server_metrics`, `server_priority_regex`. Those were introduced from D19147 for the server aggregation. Since they are not used by YBA and YBM, and we are switching to a new server aggregation design, it's safe to remove them. - Most of the change in D24598 has been reverted since D24598 was built on top of the old metric aggregation design. - Removed `external_json_metrics_cbs_` and `external_prometheus_metrics_cbs_` from MetricEntity, since they are not used anywhere. - Previously, we had some issues with this project, so D28606 reverted all the changes. For clean revert, D28606 also had to revert another diff D27535, which fixes wrong metric_type. This diff moves the metric_type fix D27535 back. Jira: DB-7120 Test Plan: Jenkins New unit test, `MetricsTest.PrometheusMetricFilterTest` Rewrote `MetricsTest.AggregationTest` and `MetricsTest.TestStreamLevelAggregation` to verify the new metric aggregation. Manually compared the current and new /prometheus-metrics outputs by running a customized [[ https://phorge.dev.yugabyte.com/P91 | python script ]] Reviewers: amitanand, esheng, rthallam, mlillibridge Reviewed By: mlillibridge Subscribers: hsunder, jhe, amalyshev, ybase, yql, ycdcxcluster, bogdan Differential Revision: https://phorge.dev.yugabyte.com/D29217

…ggregation Summary: Original commit: 8f62730 / D29217 **The issue with current /prometheus-metric endpoint:** - YBA uses URL param `priority_regex` as a table whitelist to reduce the amount of output, and the downside is, for those tablet or table metrics that are not in the regex list, the endpoint will not scrape them at all. As we keep adding metrics in the future, new metrics will never get returned from the endpoint unless YBA modify the priority_regex. This causes trouble for debugging in many situations. - Currently, we have server aggregation code, but it is not being used due to its performance issue. It requires two scrapes to fetch all server-level metrics + table-level metrics. It also has some consistency errors – the server-level aggregation is done at a separate time from the normal metrics and thus may not actually be an aggregation of the returned table-level metrics. **Fix:** - Redesigned metric level aggregation to only do one scraping for fetching server-level, stream-level, and table-level metrics(Removed `entity_opts`). Also fixed the consistency problem with the previous server-level aggregation code. **New Output** - We are only generating server-level timeseries for tablet or table metrics that are filtered out by `priority_regex`. This means the endpoint output will have more tablet and table timeseries on the server level. @mlillibridge tested with 3000 tablets, compared with the current endpoint output, the new output contains 539 new timeseries(Master+TServer) in normal mode. - The reason we only partially turn on server-level aggregation is to not break YBA and YBM charts: Even the new /prometheus endpoint is capable of outputting metrics on both server and table levels, the current and older versions of YBA can't distinguish between the same metric on the table and server level, so generating server-level time series will cause a double-counting issue for YBA charts. The new output should be completely transparent to YBA and YBM as they should have no references in charts or alerts to the new server-level timeseries. - The new output can immediately help DB debugging as many metrics that were not collected before will start appearing in the /prometheus output. **Others:** - Dropped URL parameter `max_tables_metrics_breakdowns`. Since it was not used, there is no impact to delete it. It is still necessary to have a global limit to prevent egregious cases, so I will have another diff for it. - Dropped URL parameters `exclude_metrics`, `server_exclude_metrics`, `server_metrics`, `server_priority_regex`. Those were introduced from D19147 for the server aggregation. Since they are not used by YBA and YBM, and we are switching to a new server aggregation design, it's safe to remove them. - Most of the change in D24598 has been reverted since D24598 was built on top of the old metric aggregation design. - Removed `external_json_metrics_cbs_` and `external_prometheus_metrics_cbs_` from MetricEntity, since they are not used anywhere. - Previously, we had some issues with this project, so D28606 reverted all the changes. For clean revert, D28606 also had to revert another diff D27535, which fixes wrong metric_type. This diff moves the metric_type fix D27535 back. Jira: DB-7120 Test Plan: Jenkins New unit test, `MetricsTest.PrometheusMetricFilterTest` Rewrote `MetricsTest.AggregationTest` and `MetricsTest.TestStreamLevelAggregation` to verify the new metric aggregation. Manually compared the current and new /prometheus-metrics outputs by running a customized [[ https://phorge.dev.yugabyte.com/P91 | python script ]] Reviewers: amitanand, esheng, rthallam, mlillibridge Reviewed By: mlillibridge Subscribers: mlillibridge, bogdan, ycdcxcluster, yql, ybase, amalyshev, jhe, hsunder Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D30823

…ggregation Summary: Original commit: 8f62730 / D29217 **The issue with current /prometheus-metric endpoint:** - YBA uses URL param `priority_regex` as a table whitelist to reduce the amount of output, and the downside is, for those tablet or table metrics that are not in the regex list, the endpoint will not scrape them at all. As we keep adding metrics in the future, new metrics will never get returned from the endpoint unless YBA modify the priority_regex. This causes trouble for debugging in many situations. - Currently, we have server aggregation code, but it is not being used due to its performance issue. It requires two scrapes to fetch all server-level metrics + table-level metrics. It also has some consistency errors – the server-level aggregation is done at a separate time from the normal metrics and thus may not actually be an aggregation of the returned table-level metrics. **Fix:** - Redesigned metric level aggregation to only do one scraping for fetching server-level, stream-level, and table-level metrics(Removed `entity_opts`). Also fixed the consistency problem with the previous server-level aggregation code. **New Output** - We are only generating server-level timeseries for tablet or table metrics that are filtered out by `priority_regex`. This means the endpoint output will have more tablet and table timeseries on the server level. @mlillibridge tested with 3000 tablets, compared with the current endpoint output, the new output contains 539 new timeseries(Master+TServer) in normal mode. - The reason we only partially turn on server-level aggregation is to not break YBA and YBM charts: Even the new /prometheus endpoint is capable of outputting metrics on both server and table levels, the current and older versions of YBA can't distinguish between the same metric on the table and server level, so generating server-level time series will cause a double-counting issue for YBA charts. The new output should be completely transparent to YBA and YBM as they should have no references in charts or alerts to the new server-level timeseries. - The new output can immediately help DB debugging as many metrics that were not collected before will start appearing in the /prometheus output. **Others:** - Dropped URL parameter `max_tables_metrics_breakdowns`. Since it was not used, there is no impact to delete it. It is still necessary to have a global limit to prevent egregious cases, so I will have another diff for it. - Dropped URL parameters `exclude_metrics`, `server_exclude_metrics`, `server_metrics`, `server_priority_regex`. Those were introduced from D19147 for the server aggregation. Since they are not used by YBA and YBM, and we are switching to a new server aggregation design, it's safe to remove them. - Most of the change in D24598 has been reverted since D24598 was built on top of the old metric aggregation design. - Removed `external_json_metrics_cbs_` and `external_prometheus_metrics_cbs_` from MetricEntity, since they are not used anywhere. - Previously, we had some issues with this project, so D28606 reverted all the changes. For clean revert, D28606 also had to revert another diff D27535, which fixes wrong metric_type. This diff moves the metric_type fix D27535 back. Jira: DB-7120 lint Test Plan: Jenkins New unit test, `MetricsTest.PrometheusMetricFilterTest` Rewrote `MetricsTest.AggregationTest` and `MetricsTest.TestStreamLevelAggregation` to verify the new metric aggregation. Manually compared the current and new /prometheus-metrics outputs by running a customized [[ https://phorge.dev.yugabyte.com/P91 | python script ]] Reviewers: amitanand, esheng, rthallam, mlillibridge Reviewed By: mlillibridge Subscribers: mlillibridge, bogdan, ycdcxcluster, yql, ybase, amalyshev, jhe, hsunder Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D30971

yusong-yan added area/docdb YugabyteDB core features status/awaiting-triage Issue awaiting triage labels Jul 4, 2023

yusong-yan self-assigned this Jul 4, 2023

yugabyte-ci added kind/bug This issue is a bug priority/medium Medium priority issue labels Jul 4, 2023

yusong-yan changed the title ~~[DocDB] Reconstruct metric aggregation~~ [DocDB] Reconstruct prometheus metric aggregation Jul 4, 2023

yusong-yan mentioned this issue Jul 5, 2023

[DocDB] Improve Prometheus endpoint #18010

Closed

3 tasks

yugabyte-ci added kind/enhancement This is an enhancement of an existing feature and removed status/awaiting-triage Issue awaiting triage kind/bug This issue is a bug labels Jul 5, 2023

yusong-yan added 2.14 Backport Required 2.16 Backport Required 2.18 Backport Required labels Sep 6, 2023

rthallamko3 mentioned this issue Sep 11, 2023

[DocDB] Changes to /prometheus-metrics endpoint for improving performance #13265

Closed

rthallamko3 added the 2.20 Backport Required label Oct 12, 2023

yusong-yan mentioned this issue Nov 14, 2023

[DocDB] New prometheus-metrics API to fully utilize metric server aggregation #19943

Closed

1 task

rthallamko3 removed 2.14 Backport Required 2.16 Backport Required labels Nov 30, 2023

rthallamko3 changed the title ~~[DocDB] Reconstruct prometheus metric aggregation~~ [DocDB] Emit server level aggregates for metrics not in the table-level whitelist Dec 8, 2023

rthallamko3 closed this as completed Dec 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DocDB] Emit server level aggregates for metrics not in the table-level whitelist #18078

[DocDB] Emit server level aggregates for metrics not in the table-level whitelist #18078

yusong-yan commented Jul 4, 2023 •

edited

rthallamko3 commented Nov 30, 2023

[DocDB] Emit server level aggregates for metrics not in the table-level whitelist #18078

[DocDB] Emit server level aggregates for metrics not in the table-level whitelist #18078

Comments

yusong-yan commented Jul 4, 2023 • edited

Description

Warning: Please confirm that this issue does not contain any sensitive information

rthallamko3 commented Nov 30, 2023

yusong-yan commented Jul 4, 2023 •

edited