Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DocDB] Emit server level aggregates for metrics not in the table-level whitelist #18078

Closed
1 task done
yusong-yan opened this issue Jul 4, 2023 · 1 comment
Closed
1 task done
Assignees
Labels
2.18 Backport Required 2.20 Backport Required area/docdb YugabyteDB core features kind/enhancement This is an enhancement of an existing feature priority/medium Medium priority issue

Comments

@yusong-yan
Copy link
Contributor

yusong-yan commented Jul 4, 2023

Jira Link: DB-7120

Description

  • YBA uses URL param priority_regex as a table whitelist to reduce the amount of output, and the downside is, for those tablet or table metrics that are not in the regex list, the endpoint will not scrape them at all. As we keep adding metrics in the future, new metrics will never get returned from the endpoint unless YBA modify the priority_regex. This causes trouble for debugging in many situations.
  • Currently, we have server aggregation code, but it is not being used due to its performance issue. It requires two scrapes to fetch all server-level metrics + table-level metrics. It also has some consistency errors – the server-level aggregation is done at a separate time from the normal metrics and thus may not actually be an aggregation of the returned table-level metrics.

Warning: Please confirm that this issue does not contain any sensitive information

  • I confirm this issue does not contain any sensitive information.
@yusong-yan yusong-yan added area/docdb YugabyteDB core features status/awaiting-triage Issue awaiting triage labels Jul 4, 2023
@yusong-yan yusong-yan self-assigned this Jul 4, 2023
@yugabyte-ci yugabyte-ci added kind/bug This issue is a bug priority/medium Medium priority issue labels Jul 4, 2023
@yusong-yan yusong-yan changed the title [DocDB] Reconstruct metric aggregation [DocDB] Reconstruct prometheus metric aggregation Jul 4, 2023
@yugabyte-ci yugabyte-ci added kind/enhancement This is an enhancement of an existing feature and removed status/awaiting-triage Issue awaiting triage kind/bug This issue is a bug labels Jul 5, 2023
yusong-yan added a commit that referenced this issue Sep 5, 2023
Summary:
**The issue with current metric aggregation:**
1. For each periodic metric collection, every aggregation level requires an entire scrape of all metrics. For example, if a collection requires some metrics on table level, server level, and stream levels, DB needs to loop through all metrics 3 times.
2. Not a good long-term API for future use cases. Every time we change this API, we have to write version-specific code for Prometheus url generation (our db versions are not linear)
3. CSV can become very long and hit url limits

**New design:**
1. Redesigned metric level aggregation to only do one scraping for fetching server-level, stream-level, and table-level metrics. (Removed `entity_opts`)
2. Introduce new URL parameter regex filters. The caller can now specify regex (whitelist, blacklist) at each level. The same metric can be exposed at different levels as aggregation_level=”server” and aggregation_level=”table”. If there is a metric in both the whitelist and blacklist on the same level, we will prioritize the blacklist.

 - table_whitelist : Options( ALL, NONE, Regex), Default (ALL)

 - table_blacklist : Options( ALL, NONE, Regex), Default (NONE)

 - server_whitelist : Options( ALL, NONE, Regex), Default (ALL)

 - server_blacklist : Options( NONE, Regex), Default (NONE)
3. Old URL params `metrics`(CSV) and `priority_regex`(Regex) are still available, so the new version of the Prometheus endpoint will also be compatible with old versions of the platform.

**Moreover, most of the change in D24598 has been reversed, since D24598 is built on top of the old metric aggregation design.**

I will add another Gflag that limits the number of metrics scraped for each call in another diff.
Jira: DB-7120

Test Plan:
Rewrote `MetricsTest.AggregationTest` and `MetricsTest.TestStreamLevelAggregation` to verify the current metric aggregation.

Created a universe with a single table, then run `yb-admin create_cdc_stream tablename` twice,
then ensured two separate entries for `async_replication_committed_lag_micros` with different stream ids.

Reviewers: amitanand, rthallam, arybochkin, jhe, esheng

Reviewed By: amitanand, arybochkin

Subscribers: yql, ycdcxcluster, ybase, bogdan

Differential Revision: https://phorge.dev.yugabyte.com/D26662
es1024 added a commit that referenced this issue Sep 19, 2023
…ic scrape" and "[#18078] docdb : Reconstruct prometheus metric aggregation"

Summary:
This reverts commits e732233 and 67aba99.

67aba99 breaks the `/prometheus-metrics` endpoint when
`priority_regex` (or equivalently, `table_whitelist`) is set as follows:
- No table-level aggregated metrics are in the output
- As a result of server_whitelist=ALL defaults, server-level aggregates for all metrics are in the output

As a result, we get an output where only server-level aggregates are present, regardless of what
`priority_regex`/`table_whitelist` are set to.

e732233 makes changes dependent on 67aba99,
so reverting that as well for the time being.
Jira: DB-7130, DB-7120

Test Plan: Jenkins

Reviewers: bogdan, rthallam

Reviewed By: bogdan

Subscribers: yql, ycdcxcluster, ybase

Differential Revision: https://phorge.dev.yugabyte.com/D28606
@rthallamko3
Copy link
Contributor

Evaluating backports for this diff. 2.20 backport is a must. 2.18 is optional (depending on the complexity of the backport).

yusong-yan added a commit that referenced this issue Nov 30, 2023
Summary:
**The issue with current /prometheus-metric endpoint:**
-  YBA uses URL param `priority_regex` as a table whitelist to reduce the amount of output, and the downside is, for those tablet or table metrics that are not in the regex list, the endpoint will not scrape them at all. As we keep adding metrics in the future, new metrics will never get returned from the endpoint unless YBA modify the priority_regex. This causes trouble for debugging in many situations.
- Currently, we have server aggregation code, but it is not being used due to its performance issue. It requires two scrapes to fetch all server-level metrics + table-level metrics.  It also has some consistency errors – the server-level aggregation is done at a separate time from the normal metrics and thus may not actually be an aggregation of the returned table-level metrics.

**Fix:**
- Redesigned metric level aggregation to only do one scraping for fetching server-level, stream-level, and table-level metrics(Removed `entity_opts`). Also fixed the consistency problem with the previous server-level aggregation code.

**New Output**
- We are only generating server-level timeseries for tablet or table metrics that are filtered out by `priority_regex`. This means the endpoint output will have more tablet and table timeseries on the server level. @mlillibridge tested with 3000 tablets, compared with the current endpoint output, the new output contains 539 new timeseries(Master+TServer) in normal mode.
- The reason we only partially turn on server-level aggregation is to not break YBA and YBM charts: Even the new /prometheus endpoint is capable of outputting metrics on both server and table levels, the current and older versions of YBA can't distinguish between the same metric on the table and server level, so generating server-level time series will cause a double-counting issue for YBA charts. The new output should be completely transparent to YBA and YBM as they should have no references in charts or alerts to the new server-level timeseries.
- The new output can immediately help DB debugging as many metrics that were not collected before will start appearing in the /prometheus output.

**Others:**
- Dropped URL parameter `max_tables_metrics_breakdowns`. Since it was not used, there is no impact to delete it. It is still necessary to have a global limit to prevent egregious cases, so I will have another diff for it.
- Dropped URL parameters `exclude_metrics`, `server_exclude_metrics`, `server_metrics`, `server_priority_regex`. Those were introduced from D19147 for the server aggregation. Since they are not used by YBA and YBM, and we are switching to a new server aggregation design, it's safe to remove them.
- Most of the change in D24598 has been reverted since D24598 was built on top of the old metric aggregation design.
- Removed `external_json_metrics_cbs_` and `external_prometheus_metrics_cbs_` from MetricEntity, since they are not used anywhere.
- Previously, we had some issues with this project, so D28606 reverted all the changes. For clean revert, D28606 also had to revert another diff D27535, which fixes wrong metric_type. This diff moves the metric_type fix D27535 back.
Jira: DB-7120

Test Plan:
Jenkins

New unit test, `MetricsTest.PrometheusMetricFilterTest`
Rewrote `MetricsTest.AggregationTest` and `MetricsTest.TestStreamLevelAggregation` to verify the new metric aggregation.

Manually compared the current and new /prometheus-metrics outputs by running a customized [[ https://phorge.dev.yugabyte.com/P91 | python script ]]

Reviewers: amitanand, esheng, rthallam, mlillibridge

Reviewed By: mlillibridge

Subscribers: hsunder, jhe, amalyshev, ybase, yql, ycdcxcluster, bogdan

Differential Revision: https://phorge.dev.yugabyte.com/D29217
@rthallamko3 rthallamko3 changed the title [DocDB] Reconstruct prometheus metric aggregation [DocDB] Emit server level aggregates for metrics not in the table-level whitelist Dec 8, 2023
yusong-yan added a commit that referenced this issue Dec 18, 2023
…ggregation

Summary:
Original commit: 8f62730 / D29217
**The issue with current /prometheus-metric endpoint:**
-  YBA uses URL param `priority_regex` as a table whitelist to reduce the amount of output, and the downside is, for those tablet or table metrics that are not in the regex list, the endpoint will not scrape them at all. As we keep adding metrics in the future, new metrics will never get returned from the endpoint unless YBA modify the priority_regex. This causes trouble for debugging in many situations.
- Currently, we have server aggregation code, but it is not being used due to its performance issue. It requires two scrapes to fetch all server-level metrics + table-level metrics.  It also has some consistency errors – the server-level aggregation is done at a separate time from the normal metrics and thus may not actually be an aggregation of the returned table-level metrics.

**Fix:**
- Redesigned metric level aggregation to only do one scraping for fetching server-level, stream-level, and table-level metrics(Removed `entity_opts`). Also fixed the consistency problem with the previous server-level aggregation code.

**New Output**
- We are only generating server-level timeseries for tablet or table metrics that are filtered out by `priority_regex`. This means the endpoint output will have more tablet and table timeseries on the server level. @mlillibridge tested with 3000 tablets, compared with the current endpoint output, the new output contains 539 new timeseries(Master+TServer) in normal mode.
- The reason we only partially turn on server-level aggregation is to not break YBA and YBM charts: Even the new /prometheus endpoint is capable of outputting metrics on both server and table levels, the current and older versions of YBA can't distinguish between the same metric on the table and server level, so generating server-level time series will cause a double-counting issue for YBA charts. The new output should be completely transparent to YBA and YBM as they should have no references in charts or alerts to the new server-level timeseries.
- The new output can immediately help DB debugging as many metrics that were not collected before will start appearing in the /prometheus output.

**Others:**
- Dropped URL parameter `max_tables_metrics_breakdowns`. Since it was not used, there is no impact to delete it. It is still necessary to have a global limit to prevent egregious cases, so I will have another diff for it.
- Dropped URL parameters `exclude_metrics`, `server_exclude_metrics`, `server_metrics`, `server_priority_regex`. Those were introduced from D19147 for the server aggregation. Since they are not used by YBA and YBM, and we are switching to a new server aggregation design, it's safe to remove them.
- Most of the change in D24598 has been reverted since D24598 was built on top of the old metric aggregation design.
- Removed `external_json_metrics_cbs_` and `external_prometheus_metrics_cbs_` from MetricEntity, since they are not used anywhere.
- Previously, we had some issues with this project, so D28606 reverted all the changes. For clean revert, D28606 also had to revert another diff D27535, which fixes wrong metric_type. This diff moves the metric_type fix D27535 back.
Jira: DB-7120

Test Plan:
Jenkins

New unit test, `MetricsTest.PrometheusMetricFilterTest`
Rewrote `MetricsTest.AggregationTest` and `MetricsTest.TestStreamLevelAggregation` to verify the new metric aggregation.

Manually compared the current and new /prometheus-metrics outputs by running a customized [[ https://phorge.dev.yugabyte.com/P91 | python script ]]

Reviewers: amitanand, esheng, rthallam, mlillibridge

Reviewed By: mlillibridge

Subscribers: mlillibridge, bogdan, ycdcxcluster, yql, ybase, amalyshev, jhe, hsunder

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D30823
yusong-yan added a commit that referenced this issue Dec 19, 2023
…ggregation

Summary:
Original commit: 8f62730 / D29217
**The issue with current /prometheus-metric endpoint:**
-  YBA uses URL param `priority_regex` as a table whitelist to reduce the amount of output, and the downside is, for those tablet or table metrics that are not in the regex list, the endpoint will not scrape them at all. As we keep adding metrics in the future, new metrics will never get returned from the endpoint unless YBA modify the priority_regex. This causes trouble for debugging in many situations.
- Currently, we have server aggregation code, but it is not being used due to its performance issue. It requires two scrapes to fetch all server-level metrics + table-level metrics.  It also has some consistency errors – the server-level aggregation is done at a separate time from the normal metrics and thus may not actually be an aggregation of the returned table-level metrics.

**Fix:**
- Redesigned metric level aggregation to only do one scraping for fetching server-level, stream-level, and table-level metrics(Removed `entity_opts`). Also fixed the consistency problem with the previous server-level aggregation code.

**New Output**
- We are only generating server-level timeseries for tablet or table metrics that are filtered out by `priority_regex`. This means the endpoint output will have more tablet and table timeseries on the server level. @mlillibridge tested with 3000 tablets, compared with the current endpoint output, the new output contains 539 new timeseries(Master+TServer) in normal mode.
- The reason we only partially turn on server-level aggregation is to not break YBA and YBM charts: Even the new /prometheus endpoint is capable of outputting metrics on both server and table levels, the current and older versions of YBA can't distinguish between the same metric on the table and server level, so generating server-level time series will cause a double-counting issue for YBA charts. The new output should be completely transparent to YBA and YBM as they should have no references in charts or alerts to the new server-level timeseries.
- The new output can immediately help DB debugging as many metrics that were not collected before will start appearing in the /prometheus output.

**Others:**
- Dropped URL parameter `max_tables_metrics_breakdowns`. Since it was not used, there is no impact to delete it. It is still necessary to have a global limit to prevent egregious cases, so I will have another diff for it.
- Dropped URL parameters `exclude_metrics`, `server_exclude_metrics`, `server_metrics`, `server_priority_regex`. Those were introduced from D19147 for the server aggregation. Since they are not used by YBA and YBM, and we are switching to a new server aggregation design, it's safe to remove them.
- Most of the change in D24598 has been reverted since D24598 was built on top of the old metric aggregation design.
- Removed `external_json_metrics_cbs_` and `external_prometheus_metrics_cbs_` from MetricEntity, since they are not used anywhere.
- Previously, we had some issues with this project, so D28606 reverted all the changes. For clean revert, D28606 also had to revert another diff D27535, which fixes wrong metric_type. This diff moves the metric_type fix D27535 back.
Jira: DB-7120

lint

Test Plan:
Jenkins

New unit test, `MetricsTest.PrometheusMetricFilterTest`
Rewrote `MetricsTest.AggregationTest` and `MetricsTest.TestStreamLevelAggregation` to verify the new metric aggregation.

Manually compared the current and new /prometheus-metrics outputs by running a customized [[ https://phorge.dev.yugabyte.com/P91 | python script ]]

Reviewers: amitanand, esheng, rthallam, mlillibridge

Reviewed By: mlillibridge

Subscribers: mlillibridge, bogdan, ycdcxcluster, yql, ybase, amalyshev, jhe, hsunder

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D30971
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.18 Backport Required 2.20 Backport Required area/docdb YugabyteDB core features kind/enhancement This is an enhancement of an existing feature priority/medium Medium priority issue
Projects
None yet
Development

No branches or pull requests

3 participants