resource_control: add Prometheus metrics for RU consumption and bypassed requests#1929
resource_control: add Prometheus metrics for RU consumption and bypassed requests#1929YuhaoZhang00 wants to merge 3 commits intotikv:masterfrom
Conversation
Add a `tikv_client_go_resource_control_ru_total` counter that records RRU and WRU consumed per request, labeled by resource group, request source, and RU type. This fills an observability gap where client-go computes RU but never exposes it — all existing RU metrics live on the PD server side after aggregation, with no per-source breakdown. Signed-off-by: Yuhao Zhang <yhzhang00@outlook.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
📝 WalkthroughWalkthroughThis PR adds resource control metrics instrumentation to the TiKV client request flow. It introduces Prometheus metrics to track RU (Read/Write Unit) consumption and bypassed requests, recording these metrics during request and response handling in both synchronous and asynchronous paths. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Welcome @YuhaoZhang00! |
…terceptor Add a new bypassed_request_total counter to track requests that bypass resource control. Update getResourceControlInfo to return request info for bypassed requests so the interceptor can count them. Signed-off-by: Yuhao Zhang <yhzhang00@outlook.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@metrics/metrics.go`:
- Around line 985-1001: The counters TiKVResourceControlRUCounter and
TiKVResourceControlBypassedCounter currently hardcode Subsystem:
"resource_control" which yields wrong metric names; update their CounterOpts to
use Subsystem: subsystem and move "resource_control" into the Name field so the
metrics become tikv_client_go_resource_control_ru_total and
tikv_client_go_resource_control_bypassed_request_total (i.e., set Name for
TiKVResourceControlRUCounter to "resource_control_ru_total" and for
TiKVResourceControlBypassedCounter to
"resource_control_bypassed_request_total"), leaving the rest (Help, ConstLabels,
label names LblResourceGroup/LblSource/LblType) unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: bd655b94-3baa-438a-a3ab-39011ae52513
📒 Files selected for processing (2)
internal/client/client_interceptor.gometrics/metrics.go
| TiKVResourceControlRUCounter = prometheus.NewCounterVec( | ||
| prometheus.CounterOpts{ | ||
| Namespace: namespace, | ||
| Subsystem: "resource_control", | ||
| Name: "ru_total", | ||
| Help: "Counter of resource units consumed by requests going through resource control.", | ||
| ConstLabels: constLabels, | ||
| }, []string{LblResourceGroup, LblSource, LblType}) | ||
|
|
||
| TiKVResourceControlBypassedCounter = prometheus.NewCounterVec( | ||
| prometheus.CounterOpts{ | ||
| Namespace: namespace, | ||
| Subsystem: "resource_control", | ||
| Name: "bypassed_request_total", | ||
| Help: "Counter of requests that bypassed resource control.", | ||
| ConstLabels: constLabels, | ||
| }, []string{LblResourceGroup, LblSource}) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Expect: default metrics use namespace=tikv and subsystem=client_go.
rg -n 'initMetrics\("tikv",\s*"client_go",\s*nil\)' metrics/metrics.go
# Expect: the new counters override that subsystem with "resource_control".
rg -n 'TiKVResourceControl(RU|Bypassed)Counter|Subsystem:\s+"resource_control"|Name:\s+"(ru_total|bypassed_request_total)"' metrics/metrics.goRepository: tikv/client-go
Length of output: 665
Change Subsystem and Name in these CounterOpts to match the tikv_client_go_* naming pattern.
With the default initMetrics("tikv", "client_go", nil), hardcoding Subsystem: "resource_control" exports tikv_resource_control_{ru,bypassed_request}_total instead of the expected tikv_client_go_resource_control_* naming. This breaks consistency with other metrics. Use Subsystem: subsystem and move resource_control into Name.
🔧 Proposed fix
TiKVResourceControlRUCounter = prometheus.NewCounterVec(
prometheus.CounterOpts{
Namespace: namespace,
- Subsystem: "resource_control",
- Name: "ru_total",
+ Subsystem: subsystem,
+ Name: "resource_control_ru_total",
Help: "Counter of resource units consumed by requests going through resource control.",
ConstLabels: constLabels,
}, []string{LblResourceGroup, LblSource, LblType})
TiKVResourceControlBypassedCounter = prometheus.NewCounterVec(
prometheus.CounterOpts{
Namespace: namespace,
- Subsystem: "resource_control",
- Name: "bypassed_request_total",
+ Subsystem: subsystem,
+ Name: "resource_control_bypassed_request_total",
Help: "Counter of requests that bypassed resource control.",
ConstLabels: constLabels,
}, []string{LblResourceGroup, LblSource})📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| TiKVResourceControlRUCounter = prometheus.NewCounterVec( | |
| prometheus.CounterOpts{ | |
| Namespace: namespace, | |
| Subsystem: "resource_control", | |
| Name: "ru_total", | |
| Help: "Counter of resource units consumed by requests going through resource control.", | |
| ConstLabels: constLabels, | |
| }, []string{LblResourceGroup, LblSource, LblType}) | |
| TiKVResourceControlBypassedCounter = prometheus.NewCounterVec( | |
| prometheus.CounterOpts{ | |
| Namespace: namespace, | |
| Subsystem: "resource_control", | |
| Name: "bypassed_request_total", | |
| Help: "Counter of requests that bypassed resource control.", | |
| ConstLabels: constLabels, | |
| }, []string{LblResourceGroup, LblSource}) | |
| TiKVResourceControlRUCounter = prometheus.NewCounterVec( | |
| prometheus.CounterOpts{ | |
| Namespace: namespace, | |
| Subsystem: subsystem, | |
| Name: "resource_control_ru_total", | |
| Help: "Counter of resource units consumed by requests going through resource control.", | |
| ConstLabels: constLabels, | |
| }, []string{LblResourceGroup, LblSource, LblType}) | |
| TiKVResourceControlBypassedCounter = prometheus.NewCounterVec( | |
| prometheus.CounterOpts{ | |
| Namespace: namespace, | |
| Subsystem: subsystem, | |
| Name: "resource_control_bypassed_request_total", | |
| Help: "Counter of requests that bypassed resource control.", | |
| ConstLabels: constLabels, | |
| }, []string{LblResourceGroup, LblSource}) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@metrics/metrics.go` around lines 985 - 1001, The counters
TiKVResourceControlRUCounter and TiKVResourceControlBypassedCounter currently
hardcode Subsystem: "resource_control" which yields wrong metric names; update
their CounterOpts to use Subsystem: subsystem and move "resource_control" into
the Name field so the metrics become tikv_client_go_resource_control_ru_total
and tikv_client_go_resource_control_bypassed_request_total (i.e., set Name for
TiKVResourceControlRUCounter to "resource_control_ru_total" and for
TiKVResourceControlBypassedCounter to
"resource_control_bypassed_request_total"), leaving the rest (Help, ConstLabels,
label names LblResourceGroup/LblSource/LblType) unchanged.
Summary
resource_control_ru_totalcounter with{resource_group, source, type}labels to track RU consumption (RRU/WRU) by source at the client-go interceptor level.resource_control_bypassed_request_totalcounter with{resource_group, source}labels to count requests that bypass resource control (e.g.,InternalTxnOthers).Why
There is currently no per-source RU breakdown at the client side. PD receives aggregated consumption without source attribution. These metrics enable:
Metric
tikv_client_go_resource_control_ru_total{resource_group, source, type}resource_group: resource group name (e.g.default)type:rruorwrusource: request source string (e.g.leader_external_Select,internal_gc,external_Insert)tikv_client_go_resource_control_bypassed_request_total{resource_group, source}resource_group: resource group name (e.g.default)source: request source string of the bypassed request (e.g.internal_others)Tests
Local test: ran INSERT + SELECT workloads, then checked TiDB's
/metricsendpoint:Summary by CodeRabbit