Skip to content

metrics: refine batch client metrics#1931

Merged
ti-chi-bot[bot] merged 23 commits intotikv:masterfrom
zyguan:dev/refine-batch-client-metrics
Apr 15, 2026
Merged

metrics: refine batch client metrics#1931
ti-chi-bot[bot] merged 23 commits intotikv:masterfrom
zyguan:dev/refine-batch-client-metrics

Conversation

@zyguan
Copy link
Copy Markdown
Contributor

@zyguan zyguan commented Mar 31, 2026

This PR refines the TiKV batch client metrics system:

  1. Stream-level metrics refinement: Renames and restructures batch receive metrics to be stream-scoped.
    • Metrics like TiKVBatchRecvLoopDuration becomes TiKVBatchStreamRecvLoopDuration, with additional labels (conn, forward) for better granularity.
    • Adds four new counter metrics to track request flow through batch streams.
  2. Batch request stage latency refinement: Renames TiKVBatchRequestDuration to TiKVBatchRequestStageDuration and adds a stage label to break down latency by request lifecycle stages.

Summary by CodeRabbit

  • New Features

    • Enhanced batch observability with per-store and per-connection metrics, per-stage (batched/send/recv) timing, outcome labels, and forwarding dimensions.
    • Added stream-level batch metrics and counters to track tracked/retired/completed/outdated responses.
    • Replaced legacy batch-duration metrics with a unified stage-duration model and removed deprecated shortcuts.
  • Tests

    • Added unit tests for stage timing, terminal outcome mapping, and emitted observations.

Signed-off-by: zyguan <zhongyangguan@gmail.com>
@ti-chi-bot ti-chi-bot bot added dco-signoff: yes Indicates the PR's author has signed the dco. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Mar 31, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 31, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Per-store batch-request stage metrics were added and wired through batch client paths: entries record atomic batched/sent/recv timestamps; per-store storeMetrics.batchReqStage observers replace global batch-duration observers; stream-level counters and per-client stream metrics were introduced and propagated via connection-index metadata.

Changes

Cohort / File(s) Summary
Metric definitions
metrics/metrics.go, metrics/shortcuts.go
Replaced old batch recv/request metrics with stream-oriented recv metrics and new counters; introduced TiKVBatchRequestStageDuration labeled by store, stage, result; added LblConn, LblForward, LblStage; removed global batch-duration observer shortcuts.
Client core & store metrics
internal/client/client.go, internal/client/conn_pool.go
Client obtains per-store metrics via connPool.getStoreMetrics(storeID); removed connPool.updateRPCMetrics; storeMetrics now holds batchReqStage used for RPC and batch metrics.
Async client send
internal/client/client_async.go
SendRequestAsync attaches per-entry batchRequestMetrics, stores stage timer in entry, records completion via observeBatchRequestCompletion, and delegates RPC metrics updates to storeMetrics.
Batch command logic & instrumentation
internal/client/client_batch.go, internal/client/conn_batch.go
Added per-entry atomic timestamps (batchedNS,sentNS,recvNS), stage/outcome enums and helpers, ordered observation emission (including done-only-for-OK), per-client stream metrics, conn-index plumbing with batchConnIdxMetadataKey, and stage-aware timeout reason formatting.
Conn pool / client metrics init
internal/client/conn_pool.go, internal/client/conn_batch.go
Batch client metrics initialized per connection index (initBatchCommandsClientMetrics(connIdx)); removed old recv-loop metric init; send instrumentation now records batched stage via recordBatchRequestStage.
Tests
internal/client/client_test.go
Updated sendBatchRequest call sites for new signature and added unit tests for stage recording, terminal outcome mapping, observation ordering, and timeout-reason normalization.
Other small edits
internal/client/client_async.go (import changes)
Removed direct global metrics observer usage; metric access routed via per-store/per-client metrics structs.

Sequence Diagram(s)

sequenceDiagram
    participant Client as Client
    participant ConnPool as ConnPool
    participant StoreMetrics as StoreMetrics
    participant BatchClient as BatchClient
    participant BatchEntry as BatchEntry
    participant Observer as Observer

    rect rgba(100,200,150,0.5)
        Note over Client,ConnPool: Start request — fetch per-store metrics
        Client->>ConnPool: sendRequest(req)
        ConnPool->>StoreMetrics: getStoreMetrics(storeID)
        ConnPool-->>Client: storeMetrics (includes batchReqStage)
    end

    rect rgba(150,150,200,0.5)
        Note over Client,BatchEntry: Create entry and record batched stage
        Client->>BatchEntry: create entry + attach storeMetrics.batchReqStage
        Client->>BatchEntry: recordBatchRequestStage(batched)
    end

    rect rgba(200,150,150,0.5)
        Note over Client,BatchClient: Send path records sent stage
        Client->>BatchClient: send via stream (connIdx metadata)
        BatchClient->>BatchEntry: recordBatchRequestStage(sent)
    end

    rect rgba(150,200,150,0.5)
        Note over BatchEntry,Observer: Recv path records recv and emits observations
        BatchClient->>BatchEntry: receive response -> recordBatchRequestStage(recv)
        BatchEntry->>Observer: visitBatchRequestObservations -> observe(stage,outcome,duration)
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested labels

size/XL

Suggested reviewers

  • cfzjywxk
  • ekexium

Poem

🐰 I nibbled logs and tracked each hop,
Batched, sent, received — I never stop.
Per-store I tally each stage and time,
Carrots for metrics, marshmallow rhyme,
Hopping through code, one small metric at a time.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 13.04% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly describes the main objective: refining batch client metrics across stream-level and request stage latency dimensions.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@zyguan zyguan force-pushed the dev/refine-batch-client-metrics branch from 80b2be9 to 723f512 Compare March 31, 2026 05:44
Signed-off-by: zyguan <zhongyangguan@gmail.com>
@zyguan zyguan force-pushed the dev/refine-batch-client-metrics branch from 723f512 to 4b701c5 Compare March 31, 2026 05:47
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/client/client.go`:
- Around line 342-346: The deferred call currently invokes the removed
connPool.updateRPCMetrics and must instead call the store-specific metrics
updater; capture elapsed as already done and replace
connPool.updateRPCMetrics(req, resp, elapsed) with
storeMetrics.updateRPCMetrics(req, resp, elapsed) so the sync path delegates to
the same storeMetrics implementation used by the async path (see
connPool.getStoreMetrics, start, req, resp, elapsed, and
storeMetrics.updateRPCMetrics).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 48eee403-94f7-4d36-bd24-b24f75b094c3

📥 Commits

Reviewing files that changed from the base of the PR and between 3805cb7 and 80b2be9.

📒 Files selected for processing (8)
  • internal/client/client.go
  • internal/client/client_async.go
  • internal/client/client_batch.go
  • internal/client/client_test.go
  • internal/client/conn_batch.go
  • internal/client/conn_pool.go
  • metrics/metrics.go
  • metrics/shortcuts.go
💤 Files with no reviewable changes (1)
  • metrics/shortcuts.go

Comment thread internal/client/client.go Outdated
@zyguan
Copy link
Copy Markdown
Contributor Author

zyguan commented Mar 31, 2026

/retest

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
internal/client/client.go (1)

833-883: Consider documenting that non-OK done stage outcomes are intentionally not observed.

The observe method only records doneOK for the batchRequestStageDone stage (lines 875-878), silently ignoring other outcomes. This appears intentional since failed requests would have already recorded their failure in a prior stage, but a brief comment would clarify this design choice for future maintainers.

💡 Suggested comment addition
 	case batchRequestStageDone:
+		// Only OK outcome is recorded for done stage; failure outcomes are captured in earlier stages
 		if outcome == batchRequestOutcomeOK {
 			observer = m.doneOK
 		}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/client/client.go` around lines 833 - 883, The observe method on
batchRequestStageMetrics only records metrics for batchRequestStageDone when
outcome == batchRequestOutcomeOK and ignores non-OK outcomes; add a brief
explanatory comment inside observe near the batchRequestStageDone handling
(referencing batchRequestStageMetrics.observe, batchRequestStageDone,
batchRequestOutcomeOK, and the doneOK observer) stating that non-OK done
outcomes are intentionally not observed because failures are recorded in earlier
stages, so future maintainers understand this design choice.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@internal/client/client.go`:
- Around line 833-883: The observe method on batchRequestStageMetrics only
records metrics for batchRequestStageDone when outcome == batchRequestOutcomeOK
and ignores non-OK outcomes; add a brief explanatory comment inside observe near
the batchRequestStageDone handling (referencing
batchRequestStageMetrics.observe, batchRequestStageDone, batchRequestOutcomeOK,
and the doneOK observer) stating that non-OK done outcomes are intentionally not
observed because failures are recorded in earlier stages, so future maintainers
understand this design choice.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 4bae2e87-4e1b-47fa-80a3-89d52300aea9

📥 Commits

Reviewing files that changed from the base of the PR and between 80b2be9 and 4b701c5.

📒 Files selected for processing (8)
  • internal/client/client.go
  • internal/client/client_async.go
  • internal/client/client_batch.go
  • internal/client/client_test.go
  • internal/client/conn_batch.go
  • internal/client/conn_pool.go
  • metrics/metrics.go
  • metrics/shortcuts.go
💤 Files with no reviewable changes (1)
  • metrics/shortcuts.go
🚧 Files skipped from review as they are similar to previous changes (3)
  • internal/client/client_test.go
  • internal/client/client_async.go
  • internal/client/client_batch.go

Signed-off-by: zyguan <zhongyangguan@gmail.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
internal/client/client_batch.go (1)

252-257: String-based error matching is fragile.

Comparing cause.Error() against literal strings couples this code to exact error message text. If the error messages at creation sites change, this matching silently fails and falls through to batchRequestOutcomeFailed.

Consider defining sentinel errors and comparing by identity:

♻️ Suggested refactor using sentinel errors
+var (
+	errBatchConnClosed   = errors.New("batchConn closed")
+	errBatchClientClosed = errors.New("batch client closed")
+)
+
 func batchRequestTerminalOutcome(err error) batchRequestOutcome {
 	if err == nil {
 		return batchRequestOutcomeOK
 	}

 	cause := errors.Cause(err)
 	switch cause {
 	case context.DeadlineExceeded:
 		return batchRequestOutcomeTimeout
 	case context.Canceled:
 		return batchRequestOutcomeCanceled
+	case errBatchConnClosed, errBatchClientClosed:
+		return batchRequestOutcomeClosed
+	default:
+		return batchRequestOutcomeFailed
 	}
-
-	switch cause.Error() {
-	case "batchConn closed", "batch client closed":
-		return batchRequestOutcomeClosed
-	default:
-		return batchRequestOutcomeFailed
-	}
 }

Then update the error creation sites (e.g., line 1217, 1236, and in failAsyncRequestsOnClose) to use these sentinels.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/client/client_batch.go` around lines 252 - 257, The code is fragile
because it matches error strings from cause.Error(); replace this with sentinel
errors and identity/wrapping-aware checks: declare package-level sentinel vars
(e.g., ErrBatchConnClosed, ErrBatchClientClosed) and change the switch in the
handler to use errors.Is(cause, ErrBatchConnClosed) / errors.Is(cause,
ErrBatchClientClosed) (or direct equality if not wrapped) to return
batchRequestOutcomeClosed, else batchRequestOutcomeFailed; then update all
creation sites that currently return matching string errors (e.g., where errors
are created for closed connection/client and in failAsyncRequestsOnClose) to
return the corresponding sentinel error variables instead so the comparisons
remain robust.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/client/client_batch.go`:
- Around line 1044-1046: The canceled entry tail latency can be inflated when
entry.sentNS is still zero due to a race between send() and batchRecvLoop;
update the logic around canceledEntryTailLat.Observe to load entry.sentNS into a
local variable (e.g., sent := entry.sentNS.Load()), check that sent != 0 (or
sent < recvNS) and only observe (float64(recvNS-sent)/1e9) when sent is
non-zero/valid; otherwise skip the observation (or record a separate "unknown"
metric) so you don't record recvNS/1e9 when sentNS is uninitialized. Use the
existing symbols entry.sentNS, canceledEntryTailLat, batchRecvLoop and send() to
locate where to add this guard.

---

Nitpick comments:
In `@internal/client/client_batch.go`:
- Around line 252-257: The code is fragile because it matches error strings from
cause.Error(); replace this with sentinel errors and identity/wrapping-aware
checks: declare package-level sentinel vars (e.g., ErrBatchConnClosed,
ErrBatchClientClosed) and change the switch in the handler to use
errors.Is(cause, ErrBatchConnClosed) / errors.Is(cause, ErrBatchClientClosed)
(or direct equality if not wrapped) to return batchRequestOutcomeClosed, else
batchRequestOutcomeFailed; then update all creation sites that currently return
matching string errors (e.g., where errors are created for closed
connection/client and in failAsyncRequestsOnClose) to return the corresponding
sentinel error variables instead so the comparisons remain robust.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: bee95a27-8d11-420d-bdc7-a5c6afb8d197

📥 Commits

Reviewing files that changed from the base of the PR and between 4b701c5 and 9e42b52.

📒 Files selected for processing (2)
  • internal/client/client_batch.go
  • metrics/metrics.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • metrics/metrics.go

Comment thread internal/client/client_batch.go
Signed-off-by: zyguan <zhongyangguan@gmail.com>
Comment thread internal/client/client_batch.go Outdated
for _, requestID := range req.RequestIds {
value, ok := c.batched.Load(requestID)
if !ok {
continue
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it mean the request is responsed very quickly? Will it introduce any confusing if we skip to upate the entry.sentNS ?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I see a similar skip in failRequestsByIDs:

func (c *batchCommandsClient) failRequestsByIDs(err error, requestIDs []uint64) {
	for _, requestID := range requestIDs {
		value, ok := c.batched.Load(requestID)
		if !ok {
			continue
		}
		c.failRequest(err, requestID, value.(*batchCommandsEntry))
	}
}

Do we need to assert intest.Assert(ok) here?And if not ok,maybe something wrong,maybe we can also output some logs.

Comment thread internal/client/client_batch.go Outdated
Comment on lines +162 to +164
batchedNS atomic.Int64
sentNS atomic.Int64
recvNS atomic.Int64
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to have some comments about this NS, i.e., the interval def, unit, etc..

Comment thread internal/client/client_batch.go
Comment thread internal/client/client_batch.go Outdated
// this maybe caused by batchCommandsClient#send meets ambiguous error that request has be sent to TiKV but still report a error.
// then TiKV will send response back though stream and reach here.
outdatedRespCount.Inc()
logutil.BgLogger().Warn("batchRecvLoop receives outdated response", zap.Uint64("requestID", requestID), zap.String("forwardedHost", streamClient.forwardedHost))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to add conn in the log fields

Comment thread internal/client/client_batch.go Outdated
}

func recordBatchRequestStage(stage *atomic.Int64, start, now time.Time) int64 {
elapsed := batchRequestElapsedNS(start, now)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we set elapsed to max(1, elapsed) to distinguish not set and a very small duration?

Comment thread internal/client/client_batch.go
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
internal/client/client_batch.go (1)

251-257: String-based error matching is fragile and requires synchronization across multiple creation sites.

Comparing cause.Error() against string literals will silently break if error messages change in any of the 6 locations where "batchConn closed" and "batch client closed" are created. Consider defining sentinel errors at the package level and using identity comparison:

♻️ Suggested approach

Define sentinels once:

var (
    errBatchConnClosed   = errors.New("batchConn closed")
    errBatchClientClosed = errors.New("batch client closed")
)

Replace all six errors.New() calls in client_async.go and client_batch.go with these sentinels, then update the matching logic:

func batchRequestTerminalOutcome(err error) batchRequestOutcome {
    if err == nil {
        return batchRequestOutcomeOK
    }
    cause := errors.Cause(err)
    switch cause {
    case context.DeadlineExceeded:
        return batchRequestOutcomeTimeout
    case context.Canceled:
        return batchRequestOutcomeCanceled
    case errBatchConnClosed, errBatchClientClosed:
        return batchRequestOutcomeClosed
    default:
        return batchRequestOutcomeFailed
    }
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/client/client_batch.go` around lines 251 - 257, The code in
batchRequestTerminalOutcome currently matches on cause.Error() string values
which is brittle; define package-level sentinel errors (e.g., errBatchConnClosed
and errBatchClientClosed) and replace all occurrences of errors.New("batchConn
closed") / errors.New("batch client closed") in client_async.go and
client_batch.go with those sentinel variables, then update
batchRequestTerminalOutcome to compare the unwrapped cause against the sentinels
(using ==) instead of matching strings; keep existing handling for
context.DeadlineExceeded and context.Canceled and ensure errors.Cause(err) is
used before the equality checks.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@internal/client/client_batch.go`:
- Around line 251-257: The code in batchRequestTerminalOutcome currently matches
on cause.Error() string values which is brittle; define package-level sentinel
errors (e.g., errBatchConnClosed and errBatchClientClosed) and replace all
occurrences of errors.New("batchConn closed") / errors.New("batch client
closed") in client_async.go and client_batch.go with those sentinel variables,
then update batchRequestTerminalOutcome to compare the unwrapped cause against
the sentinels (using ==) instead of matching strings; keep existing handling for
context.DeadlineExceeded and context.Canceled and ensure errors.Cause(err) is
used before the equality checks.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: baae8de4-988c-4204-b709-8593d4603f64

📥 Commits

Reviewing files that changed from the base of the PR and between 9e42b52 and 79310ef.

📒 Files selected for processing (3)
  • internal/client/client.go
  • internal/client/client_batch.go
  • internal/client/conn_pool.go
✅ Files skipped from review due to trivial changes (1)
  • internal/client/conn_pool.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • internal/client/client.go

Signed-off-by: zyguan <zhongyangguan@gmail.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
internal/client/client_batch.go (1)

238-257: Use sentinel errors instead of string matching for close classification.

The batchRequestTerminalOutcome function maps literal error strings to the closed outcome. If any error message is changed at the return sites (lines 898, 1239, 1258 in client_batch.go or lines 159, 177, 184 in client_async.go), the metric will silently mislabel those requests as failed instead of closed. Replace the strings with a shared sentinel error variable to prevent this fragility.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/client/client_batch.go` around lines 238 - 257,
batchRequestTerminalOutcome currently matches error text literals to map closed
outcomes; define a shared sentinel error (e.g., ErrBatchClientClosed) and
replace the string-based switch with an errors.Is or direct equality check
against that sentinel in batchRequestTerminalOutcome, and update the places that
create/return the "batchConn closed"/"batch client closed" errors (the return
sites in client_batch.go and client_async.go) to return or wrap that sentinel so
the classification is robust to message changes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/client/client_batch.go`:
- Around line 1005-1009: The code currently uses recvLoopStartTime as the start
for both the "recv" and "process" measurements so the process metric
double-counts recv time; after computing recvDur and calling
recvDur.Observe(...) reset the timer (e.g., set processStart := time.Now()) and
use that new timestamp to measure/process duration
(processDur.Observe(processTime.Seconds())), ensuring recvLoopStartTime is only
used for the "recv" series; update locations around recvLoopStartTime,
respRecvTime, recvDur.Observe and the process measurement to use the new
processStart.

---

Nitpick comments:
In `@internal/client/client_batch.go`:
- Around line 238-257: batchRequestTerminalOutcome currently matches error text
literals to map closed outcomes; define a shared sentinel error (e.g.,
ErrBatchClientClosed) and replace the string-based switch with an errors.Is or
direct equality check against that sentinel in batchRequestTerminalOutcome, and
update the places that create/return the "batchConn closed"/"batch client
closed" errors (the return sites in client_batch.go and client_async.go) to
return or wrap that sentinel so the classification is robust to message changes.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 016ed5e4-c982-48b1-a722-033b3591c784

📥 Commits

Reviewing files that changed from the base of the PR and between 79310ef and 5a166bb.

📒 Files selected for processing (2)
  • internal/client/client_batch.go
  • internal/client/client_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • internal/client/client_test.go

Comment thread internal/client/client_batch.go
Signed-off-by: zyguan <zhongyangguan@gmail.com>
@zyguan
Copy link
Copy Markdown
Contributor Author

zyguan commented Apr 1, 2026

/retest

zyguan added 3 commits April 2, 2026 02:43
Signed-off-by: zyguan <zhongyangguan@gmail.com>
Signed-off-by: zyguan <zhongyangguan@gmail.com>
…ient-metrics

Signed-off-by: zyguan <zhongyangguan@gmail.com>
Comment thread internal/client/client.go
Signed-off-by: zyguan <zhongyangguan@gmail.com>
@ti-chi-bot ti-chi-bot bot added the approved label Apr 3, 2026
@zyguan
Copy link
Copy Markdown
Contributor Author

zyguan commented Apr 7, 2026

/retest

@zyguan
Copy link
Copy Markdown
Contributor Author

zyguan commented Apr 7, 2026

@lcwangchao @ekexium Could you PTAL?

zyguan added 3 commits April 8, 2026 06:16
Signed-off-by: zyguan <zhongyangguan@gmail.com>
Signed-off-by: zyguan <zhongyangguan@gmail.com>
Signed-off-by: zyguan <zhongyangguan@gmail.com>
@zyguan
Copy link
Copy Markdown
Contributor Author

zyguan commented Apr 9, 2026

@hbisheng Could you PTAL?

Comment thread internal/client/client_batch.go Outdated
Copy link
Copy Markdown
Contributor

@ekexium ekexium left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall.
Have you run any benchmarks to check for performance regressions?

Copy link
Copy Markdown
Member

@hbisheng hbisheng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed the logic that sets ClientSendTimeNs before sending and measures recv_ts - resp.GetTikvSendTimeNs() after receiving, and it LGTM.

I also ran an AI-assisted review, which didn't surface any critical issues. A few minor suggestions it raised:

  • Cross-node clock assumptions: This relies on client/TiKV wall-clock sync (UnixNano() vs TikvSendTimeNs). Consider documenting this assumption and clamping or ignoring obvious skew (e.g. negative or extreme values) to avoid metric pollution.

  • Potential panic risk: In batchRecvLoop(), batchState := entry.batchState.Load() is dereferenced based on an invariant. A defensive nil check (warn + continue) could improve robustness against future regressions.

  • Metrics stability / label cardinality: This introduces batch_stream_* metrics with new labels. Please confirm target is a stable store address (avoid peer or ephemeral port strings causing high cardinality), and clarify whether any existing metrics are being replaced or renamed (call out if this is a breaking change).

zyguan added 2 commits April 9, 2026 07:25
Signed-off-by: zyguan <zhongyangguan@gmail.com>
…ient-metrics

Signed-off-by: zyguan <zhongyangguan@gmail.com>
Comment thread internal/client/client_batch.go Outdated

epoch := atomic.LoadUint64(&c.epoch)
for {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

Comment thread internal/client/client_batch.go Outdated
atomic.StoreInt64(&entry.recvLat, int64(respRecvTime.Sub(entry.start)))
batchState := entry.batchState.Load()
batchState.firstRespAfterBatchedNS.CompareAndSwap(0, max(respRecvTime.Sub(batchState.batchedAt).Nanoseconds(), int64(1)))
entry.recvAfterStartNS.CompareAndSwap(0, max(respRecvTime.Sub(entry.start).Nanoseconds(), int64(1)))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to also record tikvSendTailLat here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer not, all durations reported by EntryProgress are observed at tidb side, however, tikvSendTailLat is an estimated duration calculated by cross-nodes wall clock. Track it by metric is ok, but print the exact values to log can be misleading.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the benefit is, if only part of the request of a batch is responded to TiDB, all requests in the batch can get the estimated latency for TiKV, we can confirm whether a timeout request has arrived to TiKV if it presents. Compared with metrics, metrics only displays a whole picture without specified request.

To reduce a misleading, we can just record two fields here:

  1. Whether this request is confirmed to arrivied to TiKV.
  2. Only if the request arrived at TiKV too late (for example, tikvTime - tidbTime > 5s), we record the latency. Otherwise, the slow log does not display it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Whether this request is confirmed to arrivied to TiKV.

It's already included in this PR, please search "ack:" for details.

  1. Only if the request arrived at TiKV too late (for example, tikvTime - tidbTime > 5s), we record the latency. Otherwise, the slow log does not display it.

I wonder how would you like to use the tikvSendTailLat if it's recorded in entry as a field? log it? when?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This info can be be recorded in the error when waiting for the recvLoop timeout:

case <-timer.C:
atomic.StoreInt32(&entry.canceled, 1)
reason := fmt.Sprintf("wait recvLoop timeout, timeout:%s", timeout)
if sendLat := atomic.LoadInt64(&entry.sendLat); sendLat > 0 {
reason += fmt.Sprintf(", send:%s", util.FormatDuration(time.Duration(sendLat)))
if recvLat := atomic.LoadInt64(&entry.recvLat); recvLat > 0 {
reason += fmt.Sprintf(", recv:%s", util.FormatDuration(time.Duration(recvLat-sendLat)))
}
}
return nil, errors.WithMessage(context.DeadlineExceeded, reason)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, if a batch contains 10 requests, as one of the requests responds, all the requests in this batch can be updated.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, if a batch contains 10 requests, as one of the requests responds, all the requests in this batch can be updated.

It's already included, please search ack: as I mentioned in #1931 (comment) , and I think this is not the tikvSendTailLat you originally mentioned, BTW.

If you mean recording the tidbRecvTimeNs - tikvSendTimeNs of the very first response returned within the batch into the current timed-out request, wouldn't that be even more misleading?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point is not that every request should log this. Rather, when tidbRecvTimeNs - tikvSendTimeNs is unusually large—for example, greater than 5000000000 (5s)—we could add a new log field, slowTiKVResponse to print both tidbRecvTimeNs and tikvSendTimeNs. We do not need to record this in every entry. It is sufficient to log it only when the condition is met, so that we can retrieve this information when needed.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it's still fine for me to approve this PR because we added a lot of statistics in this PR. Feel free to ask for approval for me once you think this PR is ready to merge

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

f44bd00 PTAL. I would like to emphasize a crucial point: just because something does not typically occur does not mean it cannot happen. In this context, the condition for triggering log entries should not be based solely on tidbRecvTimeNs - tikvSendTimeNs exceeding a specific threshold; otherwise, should a significant clock skew arise between nodes, it would result in a massive volume of false positive log entries.

if dur := sendLoopEndTime.Sub(headRecvTime); dur > batchSendTailLatThreshold {
a.metrics.batchSendTailLat.Observe(dur.Seconds())
}
if sendLoopEndTime.Sub(lastPendingInspectAt) >= batchRequestInspectInterval {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can also call inspectPendingRequests in line 233 after the loop exit for idle reason

Signed-off-by: zyguan <zhongyangguan@gmail.com>
@zyguan
Copy link
Copy Markdown
Contributor Author

zyguan commented Apr 9, 2026

/retest

@zyguan
Copy link
Copy Markdown
Contributor Author

zyguan commented Apr 9, 2026

LGTM overall. Have you run any benchmarks to check for performance regressions?

I've run sysbench point-get workload, which is extremely sensitive to regressions of batch client. The results show there is very small overhead introduced by this PR.

With default point-get workload (unlimited rate), the baseline has a throughput of 73784.24, and this PR has a throughput of 73256.47, only about -0.7% regression (typically smaller than the system error).
20260409-172543

With fixed rate point-get workload (the CPU usage of tidb is about 75%), the profile diff is shown as the following:
20260409-170337

Signed-off-by: zyguan <zhongyangguan@gmail.com>
Comment thread internal/client/client_batch.go Outdated
return false
}
requestID := entry.requestID.Load()
return requestID > 0 && requestID < stream.maxRespReqID.Load()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be requestID <= stream.maxRespReqID.Load()

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, since the request whose id equals to maxRespReqID must have firstRespAfterSendStartNS, here < won't lead to the request be considered as unconfirmed, however, to make the function semantics more accurate, it would be better to use <=.

zyguan added 2 commits April 15, 2026 06:37
…ient-metrics

Signed-off-by: zyguan <zhongyangguan@gmail.com>
Signed-off-by: zyguan <zhongyangguan@gmail.com>
@zyguan
Copy link
Copy Markdown
Contributor Author

zyguan commented Apr 15, 2026

/retest

Signed-off-by: zyguan <zhongyangguan@gmail.com>
@zyguan
Copy link
Copy Markdown
Contributor Author

zyguan commented Apr 15, 2026

/retest

@ti-chi-bot ti-chi-bot bot added the lgtm label Apr 15, 2026
@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot bot commented Apr 15, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cfzjywxk, ekexium

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot removed the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label Apr 15, 2026
@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot bot commented Apr 15, 2026

[LGTM Timeline notifier]

Timeline:

  • 2026-04-03 09:40:48.299763179 +0000 UTC m=+517253.505123235: ☑️ agreed by cfzjywxk.
  • 2026-04-15 11:51:59.156919504 +0000 UTC m=+1561924.362279551: ☑️ agreed by ekexium.

@ti-chi-bot ti-chi-bot bot merged commit 0eed1ff into tikv:master Apr 15, 2026
18 of 19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved dco-signoff: yes Indicates the PR's author has signed the dco. lgtm size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants