New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DocDB] Emit server level aggregates for metrics not in the table-level whitelist #18078
Closed
1 task done
Labels
2.18 Backport Required
2.20 Backport Required
area/docdb
YugabyteDB core features
kind/enhancement
This is an enhancement of an existing feature
priority/medium
Medium priority issue
Comments
yusong-yan
added
area/docdb
YugabyteDB core features
status/awaiting-triage
Issue awaiting triage
labels
Jul 4, 2023
yugabyte-ci
added
kind/bug
This issue is a bug
priority/medium
Medium priority issue
labels
Jul 4, 2023
yusong-yan
changed the title
[DocDB] Reconstruct metric aggregation
[DocDB] Reconstruct prometheus metric aggregation
Jul 4, 2023
3 tasks
yugabyte-ci
added
kind/enhancement
This is an enhancement of an existing feature
and removed
status/awaiting-triage
Issue awaiting triage
kind/bug
This issue is a bug
labels
Jul 5, 2023
yusong-yan
added a commit
that referenced
this issue
Sep 5, 2023
Summary: **The issue with current metric aggregation:** 1. For each periodic metric collection, every aggregation level requires an entire scrape of all metrics. For example, if a collection requires some metrics on table level, server level, and stream levels, DB needs to loop through all metrics 3 times. 2. Not a good long-term API for future use cases. Every time we change this API, we have to write version-specific code for Prometheus url generation (our db versions are not linear) 3. CSV can become very long and hit url limits **New design:** 1. Redesigned metric level aggregation to only do one scraping for fetching server-level, stream-level, and table-level metrics. (Removed `entity_opts`) 2. Introduce new URL parameter regex filters. The caller can now specify regex (whitelist, blacklist) at each level. The same metric can be exposed at different levels as aggregation_level=”server” and aggregation_level=”table”. If there is a metric in both the whitelist and blacklist on the same level, we will prioritize the blacklist. - table_whitelist : Options( ALL, NONE, Regex), Default (ALL) - table_blacklist : Options( ALL, NONE, Regex), Default (NONE) - server_whitelist : Options( ALL, NONE, Regex), Default (ALL) - server_blacklist : Options( NONE, Regex), Default (NONE) 3. Old URL params `metrics`(CSV) and `priority_regex`(Regex) are still available, so the new version of the Prometheus endpoint will also be compatible with old versions of the platform. **Moreover, most of the change in D24598 has been reversed, since D24598 is built on top of the old metric aggregation design.** I will add another Gflag that limits the number of metrics scraped for each call in another diff. Jira: DB-7120 Test Plan: Rewrote `MetricsTest.AggregationTest` and `MetricsTest.TestStreamLevelAggregation` to verify the current metric aggregation. Created a universe with a single table, then run `yb-admin create_cdc_stream tablename` twice, then ensured two separate entries for `async_replication_committed_lag_micros` with different stream ids. Reviewers: amitanand, rthallam, arybochkin, jhe, esheng Reviewed By: amitanand, arybochkin Subscribers: yql, ycdcxcluster, ybase, bogdan Differential Revision: https://phorge.dev.yugabyte.com/D26662
es1024
added a commit
that referenced
this issue
Sep 19, 2023
…ic scrape" and "[#18078] docdb : Reconstruct prometheus metric aggregation" Summary: This reverts commits e732233 and 67aba99. 67aba99 breaks the `/prometheus-metrics` endpoint when `priority_regex` (or equivalently, `table_whitelist`) is set as follows: - No table-level aggregated metrics are in the output - As a result of server_whitelist=ALL defaults, server-level aggregates for all metrics are in the output As a result, we get an output where only server-level aggregates are present, regardless of what `priority_regex`/`table_whitelist` are set to. e732233 makes changes dependent on 67aba99, so reverting that as well for the time being. Jira: DB-7130, DB-7120 Test Plan: Jenkins Reviewers: bogdan, rthallam Reviewed By: bogdan Subscribers: yql, ycdcxcluster, ybase Differential Revision: https://phorge.dev.yugabyte.com/D28606
1 task
Evaluating backports for this diff. 2.20 backport is a must. 2.18 is optional (depending on the complexity of the backport). |
yusong-yan
added a commit
that referenced
this issue
Nov 30, 2023
Summary: **The issue with current /prometheus-metric endpoint:** - YBA uses URL param `priority_regex` as a table whitelist to reduce the amount of output, and the downside is, for those tablet or table metrics that are not in the regex list, the endpoint will not scrape them at all. As we keep adding metrics in the future, new metrics will never get returned from the endpoint unless YBA modify the priority_regex. This causes trouble for debugging in many situations. - Currently, we have server aggregation code, but it is not being used due to its performance issue. It requires two scrapes to fetch all server-level metrics + table-level metrics. It also has some consistency errors – the server-level aggregation is done at a separate time from the normal metrics and thus may not actually be an aggregation of the returned table-level metrics. **Fix:** - Redesigned metric level aggregation to only do one scraping for fetching server-level, stream-level, and table-level metrics(Removed `entity_opts`). Also fixed the consistency problem with the previous server-level aggregation code. **New Output** - We are only generating server-level timeseries for tablet or table metrics that are filtered out by `priority_regex`. This means the endpoint output will have more tablet and table timeseries on the server level. @mlillibridge tested with 3000 tablets, compared with the current endpoint output, the new output contains 539 new timeseries(Master+TServer) in normal mode. - The reason we only partially turn on server-level aggregation is to not break YBA and YBM charts: Even the new /prometheus endpoint is capable of outputting metrics on both server and table levels, the current and older versions of YBA can't distinguish between the same metric on the table and server level, so generating server-level time series will cause a double-counting issue for YBA charts. The new output should be completely transparent to YBA and YBM as they should have no references in charts or alerts to the new server-level timeseries. - The new output can immediately help DB debugging as many metrics that were not collected before will start appearing in the /prometheus output. **Others:** - Dropped URL parameter `max_tables_metrics_breakdowns`. Since it was not used, there is no impact to delete it. It is still necessary to have a global limit to prevent egregious cases, so I will have another diff for it. - Dropped URL parameters `exclude_metrics`, `server_exclude_metrics`, `server_metrics`, `server_priority_regex`. Those were introduced from D19147 for the server aggregation. Since they are not used by YBA and YBM, and we are switching to a new server aggregation design, it's safe to remove them. - Most of the change in D24598 has been reverted since D24598 was built on top of the old metric aggregation design. - Removed `external_json_metrics_cbs_` and `external_prometheus_metrics_cbs_` from MetricEntity, since they are not used anywhere. - Previously, we had some issues with this project, so D28606 reverted all the changes. For clean revert, D28606 also had to revert another diff D27535, which fixes wrong metric_type. This diff moves the metric_type fix D27535 back. Jira: DB-7120 Test Plan: Jenkins New unit test, `MetricsTest.PrometheusMetricFilterTest` Rewrote `MetricsTest.AggregationTest` and `MetricsTest.TestStreamLevelAggregation` to verify the new metric aggregation. Manually compared the current and new /prometheus-metrics outputs by running a customized [[ https://phorge.dev.yugabyte.com/P91 | python script ]] Reviewers: amitanand, esheng, rthallam, mlillibridge Reviewed By: mlillibridge Subscribers: hsunder, jhe, amalyshev, ybase, yql, ycdcxcluster, bogdan Differential Revision: https://phorge.dev.yugabyte.com/D29217
rthallamko3
changed the title
[DocDB] Reconstruct prometheus metric aggregation
[DocDB] Emit server level aggregates for metrics not in the table-level whitelist
Dec 8, 2023
yusong-yan
added a commit
that referenced
this issue
Dec 18, 2023
…ggregation Summary: Original commit: 8f62730 / D29217 **The issue with current /prometheus-metric endpoint:** - YBA uses URL param `priority_regex` as a table whitelist to reduce the amount of output, and the downside is, for those tablet or table metrics that are not in the regex list, the endpoint will not scrape them at all. As we keep adding metrics in the future, new metrics will never get returned from the endpoint unless YBA modify the priority_regex. This causes trouble for debugging in many situations. - Currently, we have server aggregation code, but it is not being used due to its performance issue. It requires two scrapes to fetch all server-level metrics + table-level metrics. It also has some consistency errors – the server-level aggregation is done at a separate time from the normal metrics and thus may not actually be an aggregation of the returned table-level metrics. **Fix:** - Redesigned metric level aggregation to only do one scraping for fetching server-level, stream-level, and table-level metrics(Removed `entity_opts`). Also fixed the consistency problem with the previous server-level aggregation code. **New Output** - We are only generating server-level timeseries for tablet or table metrics that are filtered out by `priority_regex`. This means the endpoint output will have more tablet and table timeseries on the server level. @mlillibridge tested with 3000 tablets, compared with the current endpoint output, the new output contains 539 new timeseries(Master+TServer) in normal mode. - The reason we only partially turn on server-level aggregation is to not break YBA and YBM charts: Even the new /prometheus endpoint is capable of outputting metrics on both server and table levels, the current and older versions of YBA can't distinguish between the same metric on the table and server level, so generating server-level time series will cause a double-counting issue for YBA charts. The new output should be completely transparent to YBA and YBM as they should have no references in charts or alerts to the new server-level timeseries. - The new output can immediately help DB debugging as many metrics that were not collected before will start appearing in the /prometheus output. **Others:** - Dropped URL parameter `max_tables_metrics_breakdowns`. Since it was not used, there is no impact to delete it. It is still necessary to have a global limit to prevent egregious cases, so I will have another diff for it. - Dropped URL parameters `exclude_metrics`, `server_exclude_metrics`, `server_metrics`, `server_priority_regex`. Those were introduced from D19147 for the server aggregation. Since they are not used by YBA and YBM, and we are switching to a new server aggregation design, it's safe to remove them. - Most of the change in D24598 has been reverted since D24598 was built on top of the old metric aggregation design. - Removed `external_json_metrics_cbs_` and `external_prometheus_metrics_cbs_` from MetricEntity, since they are not used anywhere. - Previously, we had some issues with this project, so D28606 reverted all the changes. For clean revert, D28606 also had to revert another diff D27535, which fixes wrong metric_type. This diff moves the metric_type fix D27535 back. Jira: DB-7120 Test Plan: Jenkins New unit test, `MetricsTest.PrometheusMetricFilterTest` Rewrote `MetricsTest.AggregationTest` and `MetricsTest.TestStreamLevelAggregation` to verify the new metric aggregation. Manually compared the current and new /prometheus-metrics outputs by running a customized [[ https://phorge.dev.yugabyte.com/P91 | python script ]] Reviewers: amitanand, esheng, rthallam, mlillibridge Reviewed By: mlillibridge Subscribers: mlillibridge, bogdan, ycdcxcluster, yql, ybase, amalyshev, jhe, hsunder Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D30823
yusong-yan
added a commit
that referenced
this issue
Dec 19, 2023
…ggregation Summary: Original commit: 8f62730 / D29217 **The issue with current /prometheus-metric endpoint:** - YBA uses URL param `priority_regex` as a table whitelist to reduce the amount of output, and the downside is, for those tablet or table metrics that are not in the regex list, the endpoint will not scrape them at all. As we keep adding metrics in the future, new metrics will never get returned from the endpoint unless YBA modify the priority_regex. This causes trouble for debugging in many situations. - Currently, we have server aggregation code, but it is not being used due to its performance issue. It requires two scrapes to fetch all server-level metrics + table-level metrics. It also has some consistency errors – the server-level aggregation is done at a separate time from the normal metrics and thus may not actually be an aggregation of the returned table-level metrics. **Fix:** - Redesigned metric level aggregation to only do one scraping for fetching server-level, stream-level, and table-level metrics(Removed `entity_opts`). Also fixed the consistency problem with the previous server-level aggregation code. **New Output** - We are only generating server-level timeseries for tablet or table metrics that are filtered out by `priority_regex`. This means the endpoint output will have more tablet and table timeseries on the server level. @mlillibridge tested with 3000 tablets, compared with the current endpoint output, the new output contains 539 new timeseries(Master+TServer) in normal mode. - The reason we only partially turn on server-level aggregation is to not break YBA and YBM charts: Even the new /prometheus endpoint is capable of outputting metrics on both server and table levels, the current and older versions of YBA can't distinguish between the same metric on the table and server level, so generating server-level time series will cause a double-counting issue for YBA charts. The new output should be completely transparent to YBA and YBM as they should have no references in charts or alerts to the new server-level timeseries. - The new output can immediately help DB debugging as many metrics that were not collected before will start appearing in the /prometheus output. **Others:** - Dropped URL parameter `max_tables_metrics_breakdowns`. Since it was not used, there is no impact to delete it. It is still necessary to have a global limit to prevent egregious cases, so I will have another diff for it. - Dropped URL parameters `exclude_metrics`, `server_exclude_metrics`, `server_metrics`, `server_priority_regex`. Those were introduced from D19147 for the server aggregation. Since they are not used by YBA and YBM, and we are switching to a new server aggregation design, it's safe to remove them. - Most of the change in D24598 has been reverted since D24598 was built on top of the old metric aggregation design. - Removed `external_json_metrics_cbs_` and `external_prometheus_metrics_cbs_` from MetricEntity, since they are not used anywhere. - Previously, we had some issues with this project, so D28606 reverted all the changes. For clean revert, D28606 also had to revert another diff D27535, which fixes wrong metric_type. This diff moves the metric_type fix D27535 back. Jira: DB-7120 lint Test Plan: Jenkins New unit test, `MetricsTest.PrometheusMetricFilterTest` Rewrote `MetricsTest.AggregationTest` and `MetricsTest.TestStreamLevelAggregation` to verify the new metric aggregation. Manually compared the current and new /prometheus-metrics outputs by running a customized [[ https://phorge.dev.yugabyte.com/P91 | python script ]] Reviewers: amitanand, esheng, rthallam, mlillibridge Reviewed By: mlillibridge Subscribers: mlillibridge, bogdan, ycdcxcluster, yql, ybase, amalyshev, jhe, hsunder Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D30971
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
2.18 Backport Required
2.20 Backport Required
area/docdb
YugabyteDB core features
kind/enhancement
This is an enhancement of an existing feature
priority/medium
Medium priority issue
Jira Link: DB-7120
Description
Warning: Please confirm that this issue does not contain any sensitive information
The text was updated successfully, but these errors were encountered: