Skip to content

resource_control: add Prometheus metrics for RU consumption and bypassed requests#1929

Open
YuhaoZhang00 wants to merge 3 commits intotikv:masterfrom
YuhaoZhang00:rc/resource-control-metrics
Open

resource_control: add Prometheus metrics for RU consumption and bypassed requests#1929
YuhaoZhang00 wants to merge 3 commits intotikv:masterfrom
YuhaoZhang00:rc/resource-control-metrics

Conversation

@YuhaoZhang00
Copy link
Copy Markdown

@YuhaoZhang00 YuhaoZhang00 commented Mar 30, 2026

Summary

  • Add resource_control_ru_total counter with {resource_group, source, type} labels to track RU consumption (RRU/WRU) by source at the client-go interceptor level.
  • Add resource_control_bypassed_request_total counter with {resource_group, source} labels to count requests that bypass resource control (e.g., InternalTxnOthers).

Why

There is currently no per-source RU breakdown at the client side. PD receives aggregated consumption without source attribution. These metrics enable:

  • Identifying which request sources consume the most RU during oncall/debugging.
  • Observing how many requests bypass resource control.

Metric

tikv_client_go_resource_control_ru_total{resource_group, source, type}

  • resource_group: resource group name (e.g. default)
  • type: rru or wru
  • source: request source string (e.g. leader_external_Select, internal_gc, external_Insert)

tikv_client_go_resource_control_bypassed_request_total{resource_group, source}

  • resource_group: resource group name (e.g. default)
  • source: request source string of the bypassed request (e.g. internal_others)

Tests

Local test: ran INSERT + SELECT workloads, then checked TiDB's /metrics endpoint:

$ curl -s http://127.0.0.1:10080/metrics | grep 'ru_total{'
tidb_resource_control_ru_total{resource_group="default",source="external_Insert",type="wru"} 3.702734375
tidb_resource_control_ru_total{resource_group="default",source="internal_ddl",type="wru"} 3.4556640625
tidb_resource_control_ru_total{resource_group="default",source="internal_DDLNotifier",type="wru"}
19.392578124999996
tidb_resource_control_ru_total{resource_group="default",source="leader_external_Select",type="rru"}
0.6286392389322917
tidb_resource_control_ru_total{resource_group="default",source="leader_internal_ddl",type="rru"} 17.949227671549473
tidb_resource_control_ru_total{resource_group="default",source="leader_internal_DDLNotifier",type="rru"}
8.820397371744791
tidb_resource_control_ru_total{resource_group="default",source="leader_internal_DistTask",type="rru"}
141.66855745214764
tidb_resource_control_ru_total{resource_group="default",source="leader_internal_gc",type="rru"} 3.951403409505208
tidb_resource_control_ru_total{resource_group="default",source="leader_internal_stats",type="rru"}
25.386828162109392
tidb_resource_control_ru_total{resource_group="default",source="leader_internal_Timer",type="rru"}
1.0725829999999998
tidb_resource_control_ru_total{resource_group="default",source="leader_internal_TTL",type="rru"} 1.0018056666666666
tidb_resource_control_ru_total{resource_group="default",source="leader_unknown",type="rru"} 1.5273634033203125
tidb_resource_control_ru_total{resource_group="default",source="unknown",type="wru"} 14.598242187499999

$ curl -s http://127.0.0.1:10080/metrics | grep 'bypassed_request_total'
tidb_resource_control_bypassed_request_total{resource_group="default",source="internal_others"} 3
tidb_resource_control_bypassed_request_total{resource_group="default",source="leader_internal_others"} 194

Summary by CodeRabbit

  • New Features
    • Added metrics to monitor resource unit consumption, tracking both read and write usage by resource group and request source.
    • Added metrics to track resource control bypass events for improved visibility into request handling operations.

Add a `tikv_client_go_resource_control_ru_total` counter that records
RRU and WRU consumed per request, labeled by resource group, request
source, and RU type. This fills an observability gap where client-go
computes RU but never exposes it — all existing RU metrics live on the
PD server side after aggregation, with no per-source breakdown.

Signed-off-by: Yuhao Zhang <yhzhang00@outlook.com>
@ti-chi-bot ti-chi-bot bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. dco-signoff: yes Indicates the PR's author has signed the dco. labels Mar 30, 2026
@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot bot commented Mar 30, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign you06 for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 30, 2026

📝 Walkthrough

Walkthrough

This PR adds resource control metrics instrumentation to the TiKV client request flow. It introduces Prometheus metrics to track RU (Read/Write Unit) consumption and bypassed requests, recording these metrics during request and response handling in both synchronous and asynchronous paths.

Changes

Cohort / File(s) Summary
Metrics Infrastructure
metrics/metrics.go
Adds two new Prometheus CounterVec metrics (TiKVResourceControlRUCounter for RU consumption tracking with resource_group/source/type labels, TiKVResourceControlBypassedCounter for bypass events with resource_group/source labels) and introduces the LblResourceGroup label constant.
Request Interceptor Instrumentation
internal/client/client_interceptor.go
Introduces recordResourceControlMetrics() helper to record RU consumption metrics. Instruments SendRequest() and SendRequestAsync() to invoke metric recording after request/response waits and increments bypass counter when requests bypass resource control checks. Updates getResourceControlInfo() to return resource group name and bypass info for metric recording on bypass paths.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • ruv2: align bypass with ru v1 #1926: Shares modifications to SendRequest/SendRequestAsync request paths and reqInfo.Bypass() logic for handling resource control bypass decisions.

Suggested labels

lgtm, approved

Suggested reviewers

  • nolouch
  • ekexium

Poem

🐰 Hop along the metric trail,
RUs and bypasses now reveal their tale,
Prometheus charts dance in the light,
Resource control metrics shining bright!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and clearly describes the main change: adding Prometheus metrics for resource control with focus on RU consumption and bypassed requests.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ti-chi-bot ti-chi-bot bot added contribution This PR is from a community contributor. first-time-contributor Indicates that the PR was contributed by an external member and is a first-time contributor. labels Mar 30, 2026
@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot bot commented Mar 30, 2026

Welcome @YuhaoZhang00!

It looks like this is your first PR to tikv/client-go 🎉.

I'm the bot to help you request reviewers, add labels and more, See available commands.

We want to make sure your contribution gets all the attention it needs!



Thank you, and welcome to tikv/client-go. 😃

@ti-chi-bot ti-chi-bot bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Mar 30, 2026
…terceptor

Add a new bypassed_request_total counter to track requests that bypass
resource control. Update getResourceControlInfo to return request info
for bypassed requests so the interceptor can count them.

Signed-off-by: Yuhao Zhang <yhzhang00@outlook.com>
@YuhaoZhang00 YuhaoZhang00 changed the title resource_control: add Prometheus metric for RU consumption by source resource_control: add Prometheus metrics for RU consumption and bypassed requests Mar 30, 2026
@YuhaoZhang00 YuhaoZhang00 marked this pull request as ready for review March 31, 2026 03:00
@ti-chi-bot ti-chi-bot bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 31, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@metrics/metrics.go`:
- Around line 985-1001: The counters TiKVResourceControlRUCounter and
TiKVResourceControlBypassedCounter currently hardcode Subsystem:
"resource_control" which yields wrong metric names; update their CounterOpts to
use Subsystem: subsystem and move "resource_control" into the Name field so the
metrics become tikv_client_go_resource_control_ru_total and
tikv_client_go_resource_control_bypassed_request_total (i.e., set Name for
TiKVResourceControlRUCounter to "resource_control_ru_total" and for
TiKVResourceControlBypassedCounter to
"resource_control_bypassed_request_total"), leaving the rest (Help, ConstLabels,
label names LblResourceGroup/LblSource/LblType) unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: bd655b94-3baa-438a-a3ab-39011ae52513

📥 Commits

Reviewing files that changed from the base of the PR and between 282ada6 and 7c31694.

📒 Files selected for processing (2)
  • internal/client/client_interceptor.go
  • metrics/metrics.go

Comment on lines +985 to +1001
TiKVResourceControlRUCounter = prometheus.NewCounterVec(
prometheus.CounterOpts{
Namespace: namespace,
Subsystem: "resource_control",
Name: "ru_total",
Help: "Counter of resource units consumed by requests going through resource control.",
ConstLabels: constLabels,
}, []string{LblResourceGroup, LblSource, LblType})

TiKVResourceControlBypassedCounter = prometheus.NewCounterVec(
prometheus.CounterOpts{
Namespace: namespace,
Subsystem: "resource_control",
Name: "bypassed_request_total",
Help: "Counter of requests that bypassed resource control.",
ConstLabels: constLabels,
}, []string{LblResourceGroup, LblSource})
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Expect: default metrics use namespace=tikv and subsystem=client_go.
rg -n 'initMetrics\("tikv",\s*"client_go",\s*nil\)' metrics/metrics.go

# Expect: the new counters override that subsystem with "resource_control".
rg -n 'TiKVResourceControl(RU|Bypassed)Counter|Subsystem:\s+"resource_control"|Name:\s+"(ru_total|bypassed_request_total)"' metrics/metrics.go

Repository: tikv/client-go

Length of output: 665


Change Subsystem and Name in these CounterOpts to match the tikv_client_go_* naming pattern.

With the default initMetrics("tikv", "client_go", nil), hardcoding Subsystem: "resource_control" exports tikv_resource_control_{ru,bypassed_request}_total instead of the expected tikv_client_go_resource_control_* naming. This breaks consistency with other metrics. Use Subsystem: subsystem and move resource_control into Name.

🔧 Proposed fix
  TiKVResourceControlRUCounter = prometheus.NewCounterVec(
      prometheus.CounterOpts{
          Namespace:   namespace,
-         Subsystem:   "resource_control",
-         Name:        "ru_total",
+         Subsystem:   subsystem,
+         Name:        "resource_control_ru_total",
          Help:        "Counter of resource units consumed by requests going through resource control.",
          ConstLabels: constLabels,
      }, []string{LblResourceGroup, LblSource, LblType})

  TiKVResourceControlBypassedCounter = prometheus.NewCounterVec(
      prometheus.CounterOpts{
          Namespace:   namespace,
-         Subsystem:   "resource_control",
-         Name:        "bypassed_request_total",
+         Subsystem:   subsystem,
+         Name:        "resource_control_bypassed_request_total",
          Help:        "Counter of requests that bypassed resource control.",
          ConstLabels: constLabels,
      }, []string{LblResourceGroup, LblSource})
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
TiKVResourceControlRUCounter = prometheus.NewCounterVec(
prometheus.CounterOpts{
Namespace: namespace,
Subsystem: "resource_control",
Name: "ru_total",
Help: "Counter of resource units consumed by requests going through resource control.",
ConstLabels: constLabels,
}, []string{LblResourceGroup, LblSource, LblType})
TiKVResourceControlBypassedCounter = prometheus.NewCounterVec(
prometheus.CounterOpts{
Namespace: namespace,
Subsystem: "resource_control",
Name: "bypassed_request_total",
Help: "Counter of requests that bypassed resource control.",
ConstLabels: constLabels,
}, []string{LblResourceGroup, LblSource})
TiKVResourceControlRUCounter = prometheus.NewCounterVec(
prometheus.CounterOpts{
Namespace: namespace,
Subsystem: subsystem,
Name: "resource_control_ru_total",
Help: "Counter of resource units consumed by requests going through resource control.",
ConstLabels: constLabels,
}, []string{LblResourceGroup, LblSource, LblType})
TiKVResourceControlBypassedCounter = prometheus.NewCounterVec(
prometheus.CounterOpts{
Namespace: namespace,
Subsystem: subsystem,
Name: "resource_control_bypassed_request_total",
Help: "Counter of requests that bypassed resource control.",
ConstLabels: constLabels,
}, []string{LblResourceGroup, LblSource})
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@metrics/metrics.go` around lines 985 - 1001, The counters
TiKVResourceControlRUCounter and TiKVResourceControlBypassedCounter currently
hardcode Subsystem: "resource_control" which yields wrong metric names; update
their CounterOpts to use Subsystem: subsystem and move "resource_control" into
the Name field so the metrics become tikv_client_go_resource_control_ru_total
and tikv_client_go_resource_control_bypassed_request_total (i.e., set Name for
TiKVResourceControlRUCounter to "resource_control_ru_total" and for
TiKVResourceControlBypassedCounter to
"resource_control_bypassed_request_total"), leaving the rest (Help, ConstLabels,
label names LblResourceGroup/LblSource/LblType) unchanged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

contribution This PR is from a community contributor. dco-signoff: yes Indicates the PR's author has signed the dco. first-time-contributor Indicates that the PR was contributed by an external member and is a first-time contributor. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant