metrics: refine batch client metrics by zyguan · Pull Request #1931 · tikv/client-go

zyguan · 2026-03-31T05:39:42Z

This PR refines the TiKV batch client metrics system:

Stream-level metrics refinement: Renames and restructures batch receive metrics to be stream-scoped.
- Metrics like TiKVBatchRecvLoopDuration becomes TiKVBatchStreamRecvLoopDuration, with additional labels (conn, forward) for better granularity.
- Adds four new counter metrics to track request flow through batch streams.
Batch request stage latency refinement: Renames TiKVBatchRequestDuration to TiKVBatchRequestStageDuration and adds a stage label to break down latency by request lifecycle stages.

Summary by CodeRabbit

New Features
- Enhanced batch observability with per-store and per-connection metrics, per-stage (batched/send/recv) timing, outcome labels, and forwarding dimensions.
- Added stream-level batch metrics and counters to track tracked/retired/completed/outdated responses.
- Replaced legacy batch-duration metrics with a unified stage-duration model and removed deprecated shortcuts.
Tests
- Added unit tests for stage timing, terminal outcome mapping, and emitted observations.

Signed-off-by: zyguan <zhongyangguan@gmail.com>

coderabbitai · 2026-03-31T05:39:58Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Per-store batch-request stage metrics were added and wired through batch client paths: entries record atomic batched/sent/recv timestamps; per-store storeMetrics.batchReqStage observers replace global batch-duration observers; stream-level counters and per-client stream metrics were introduced and propagated via connection-index metadata.

Changes

Cohort / File(s)	Summary
Metric definitions `metrics/metrics.go`, `metrics/shortcuts.go`	Replaced old batch recv/request metrics with stream-oriented recv metrics and new counters; introduced `TiKVBatchRequestStageDuration` labeled by `store, stage, result`; added `LblConn`, `LblForward`, `LblStage`; removed global batch-duration observer shortcuts.
Client core & store metrics `internal/client/client.go`, `internal/client/conn_pool.go`	Client obtains per-store metrics via `connPool.getStoreMetrics(storeID)`; removed `connPool.updateRPCMetrics`; `storeMetrics` now holds `batchReqStage` used for RPC and batch metrics.
Async client send `internal/client/client_async.go`	`SendRequestAsync` attaches per-entry `batchRequestMetrics`, stores stage timer in entry, records completion via `observeBatchRequestCompletion`, and delegates RPC metrics updates to `storeMetrics`.
Batch command logic & instrumentation `internal/client/client_batch.go`, `internal/client/conn_batch.go`	Added per-entry atomic timestamps (`batchedNS`,`sentNS`,`recvNS`), stage/outcome enums and helpers, ordered observation emission (including done-only-for-OK), per-client stream metrics, conn-index plumbing with `batchConnIdxMetadataKey`, and stage-aware timeout reason formatting.
Conn pool / client metrics init `internal/client/conn_pool.go`, `internal/client/conn_batch.go`	Batch client metrics initialized per connection index (`initBatchCommandsClientMetrics(connIdx)`); removed old recv-loop metric init; send instrumentation now records batched stage via `recordBatchRequestStage`.
Tests `internal/client/client_test.go`	Updated `sendBatchRequest` call sites for new signature and added unit tests for stage recording, terminal outcome mapping, observation ordering, and timeout-reason normalization.
Other small edits `internal/client/client_async.go` (import changes)	Removed direct global `metrics` observer usage; metric access routed via per-store/per-client metrics structs.

Sequence Diagram(s)

sequenceDiagram
    participant Client as Client
    participant ConnPool as ConnPool
    participant StoreMetrics as StoreMetrics
    participant BatchClient as BatchClient
    participant BatchEntry as BatchEntry
    participant Observer as Observer

    rect rgba(100,200,150,0.5)
        Note over Client,ConnPool: Start request — fetch per-store metrics
        Client->>ConnPool: sendRequest(req)
        ConnPool->>StoreMetrics: getStoreMetrics(storeID)
        ConnPool-->>Client: storeMetrics (includes batchReqStage)
    end

    rect rgba(150,150,200,0.5)
        Note over Client,BatchEntry: Create entry and record batched stage
        Client->>BatchEntry: create entry + attach storeMetrics.batchReqStage
        Client->>BatchEntry: recordBatchRequestStage(batched)
    end

    rect rgba(200,150,150,0.5)
        Note over Client,BatchClient: Send path records sent stage
        Client->>BatchClient: send via stream (connIdx metadata)
        BatchClient->>BatchEntry: recordBatchRequestStage(sent)
    end

    rect rgba(150,200,150,0.5)
        Note over BatchEntry,Observer: Recv path records recv and emits observations
        BatchClient->>BatchEntry: receive response -> recordBatchRequestStage(recv)
        BatchEntry->>Observer: visitBatchRequestObservations -> observe(stage,outcome,duration)
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested labels

size/XL

Suggested reviewers

cfzjywxk
ekexium

Poem

🐰 I nibbled logs and tracked each hop,
Batched, sent, received — I never stop.
Per-store I tally each stage and time,
Carrots for metrics, marshmallow rhyme,
Hopping through code, one small metric at a time.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 13.04% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly describes the main objective: refining batch client metrics across stream-level and request stage latency dimensions.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Signed-off-by: zyguan <zhongyangguan@gmail.com>

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/client/client.go`:
- Around line 342-346: The deferred call currently invokes the removed
connPool.updateRPCMetrics and must instead call the store-specific metrics
updater; capture elapsed as already done and replace
connPool.updateRPCMetrics(req, resp, elapsed) with
storeMetrics.updateRPCMetrics(req, resp, elapsed) so the sync path delegates to
the same storeMetrics implementation used by the async path (see
connPool.getStoreMetrics, start, req, resp, elapsed, and
storeMetrics.updateRPCMetrics).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 48eee403-94f7-4d36-bd24-b24f75b094c3

📥 Commits

Reviewing files that changed from the base of the PR and between 3805cb7 and 80b2be9.

📒 Files selected for processing (8)

internal/client/client.go
internal/client/client_async.go
internal/client/client_batch.go
internal/client/client_test.go
internal/client/conn_batch.go
internal/client/conn_pool.go
metrics/metrics.go
metrics/shortcuts.go

💤 Files with no reviewable changes (1)

metrics/shortcuts.go

zyguan · 2026-03-31T06:00:08Z

/retest

coderabbitai

🧹 Nitpick comments (1)

internal/client/client.go (1)
833-883: Consider documenting that non-OK done stage outcomes are intentionally not observed.

The observe method only records doneOK for the batchRequestStageDone stage (lines 875-878), silently ignoring other outcomes. This appears intentional since failed requests would have already recorded their failure in a prior stage, but a brief comment would clarify this design choice for future maintainers.
💡 Suggested comment addition
 	case batchRequestStageDone:
+		// Only OK outcome is recorded for done stage; failure outcomes are captured in earlier stages
 		if outcome == batchRequestOutcomeOK {
 			observer = m.doneOK
 		}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/client/client.go` around lines 833 - 883, The observe method on
batchRequestStageMetrics only records metrics for batchRequestStageDone when
outcome == batchRequestOutcomeOK and ignores non-OK outcomes; add a brief
explanatory comment inside observe near the batchRequestStageDone handling
(referencing batchRequestStageMetrics.observe, batchRequestStageDone,
batchRequestOutcomeOK, and the doneOK observer) stating that non-OK done
outcomes are intentionally not observed because failures are recorded in earlier
stages, so future maintainers understand this design choice.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@internal/client/client.go`:
- Around line 833-883: The observe method on batchRequestStageMetrics only
records metrics for batchRequestStageDone when outcome == batchRequestOutcomeOK
and ignores non-OK outcomes; add a brief explanatory comment inside observe near
the batchRequestStageDone handling (referencing
batchRequestStageMetrics.observe, batchRequestStageDone, batchRequestOutcomeOK,
and the doneOK observer) stating that non-OK done outcomes are intentionally not
observed because failures are recorded in earlier stages, so future maintainers
understand this design choice.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 4bae2e87-4e1b-47fa-80a3-89d52300aea9

📥 Commits

Reviewing files that changed from the base of the PR and between 80b2be9 and 4b701c5.

📒 Files selected for processing (8)

internal/client/client.go
internal/client/client_async.go
internal/client/client_batch.go
internal/client/client_test.go
internal/client/conn_batch.go
internal/client/conn_pool.go
metrics/metrics.go
metrics/shortcuts.go

💤 Files with no reviewable changes (1)

metrics/shortcuts.go

🚧 Files skipped from review as they are similar to previous changes (3)

internal/client/client_test.go
internal/client/client_async.go
internal/client/client_batch.go

Signed-off-by: zyguan <zhongyangguan@gmail.com>

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

internal/client/client_batch.go (1)

252-257: String-based error matching is fragile.

Comparing cause.Error() against literal strings couples this code to exact error message text. If the error messages at creation sites change, this matching silently fails and falls through to batchRequestOutcomeFailed.

Consider defining sentinel errors and comparing by identity:

♻️ Suggested refactor using sentinel errors

+var (
+	errBatchConnClosed   = errors.New("batchConn closed")
+	errBatchClientClosed = errors.New("batch client closed")
+)
+
 func batchRequestTerminalOutcome(err error) batchRequestOutcome {
 	if err == nil {
 		return batchRequestOutcomeOK
 	}

 	cause := errors.Cause(err)
 	switch cause {
 	case context.DeadlineExceeded:
 		return batchRequestOutcomeTimeout
 	case context.Canceled:
 		return batchRequestOutcomeCanceled
+	case errBatchConnClosed, errBatchClientClosed:
+		return batchRequestOutcomeClosed
+	default:
+		return batchRequestOutcomeFailed
 	}
-
-	switch cause.Error() {
-	case "batchConn closed", "batch client closed":
-		return batchRequestOutcomeClosed
-	default:
-		return batchRequestOutcomeFailed
-	}
 }

Then update the error creation sites (e.g., line 1217, 1236, and in failAsyncRequestsOnClose) to use these sentinels.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@internal/client/client_batch.go` around lines 252 - 257, The code is fragile
because it matches error strings from cause.Error(); replace this with sentinel
errors and identity/wrapping-aware checks: declare package-level sentinel vars
(e.g., ErrBatchConnClosed, ErrBatchClientClosed) and change the switch in the
handler to use errors.Is(cause, ErrBatchConnClosed) / errors.Is(cause,
ErrBatchClientClosed) (or direct equality if not wrapped) to return
batchRequestOutcomeClosed, else batchRequestOutcomeFailed; then update all
creation sites that currently return matching string errors (e.g., where errors
are created for closed connection/client and in failAsyncRequestsOnClose) to
return the corresponding sentinel error variables instead so the comparisons
remain robust.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/client/client_batch.go`:
- Around line 1044-1046: The canceled entry tail latency can be inflated when
entry.sentNS is still zero due to a race between send() and batchRecvLoop;
update the logic around canceledEntryTailLat.Observe to load entry.sentNS into a
local variable (e.g., sent := entry.sentNS.Load()), check that sent != 0 (or
sent < recvNS) and only observe (float64(recvNS-sent)/1e9) when sent is
non-zero/valid; otherwise skip the observation (or record a separate "unknown"
metric) so you don't record recvNS/1e9 when sentNS is uninitialized. Use the
existing symbols entry.sentNS, canceledEntryTailLat, batchRecvLoop and send() to
locate where to add this guard.

---

Nitpick comments:
In `@internal/client/client_batch.go`:
- Around line 252-257: The code is fragile because it matches error strings from
cause.Error(); replace this with sentinel errors and identity/wrapping-aware
checks: declare package-level sentinel vars (e.g., ErrBatchConnClosed,
ErrBatchClientClosed) and change the switch in the handler to use
errors.Is(cause, ErrBatchConnClosed) / errors.Is(cause, ErrBatchClientClosed)
(or direct equality if not wrapped) to return batchRequestOutcomeClosed, else
batchRequestOutcomeFailed; then update all creation sites that currently return
matching string errors (e.g., where errors are created for closed
connection/client and in failAsyncRequestsOnClose) to return the corresponding
sentinel error variables instead so the comparisons remain robust.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: bee95a27-8d11-420d-bdc7-a5c6afb8d197

📥 Commits

Reviewing files that changed from the base of the PR and between 4b701c5 and 9e42b52.

📒 Files selected for processing (2)

internal/client/client_batch.go
metrics/metrics.go

🚧 Files skipped from review as they are similar to previous changes (1)

metrics/metrics.go

Signed-off-by: zyguan <zhongyangguan@gmail.com>

lcwangchao · 2026-04-01T06:34:23Z

+	for _, requestID := range req.RequestIds {
+		value, ok := c.batched.Load(requestID)
+		if !ok {
+			continue


Does it mean the request is responsed very quickly? Will it introduce any confusing if we skip to upate the entry.sentNS ?

Also, I see a similar skip in failRequestsByIDs:

func (c *batchCommandsClient) failRequestsByIDs(err error, requestIDs []uint64) { for _, requestID := range requestIDs { value, ok := c.batched.Load(requestID) if !ok { continue } c.failRequest(err, requestID, value.(*batchCommandsEntry)) } }

Do we need to assert intest.Assert(ok) here？And if not ok，maybe something wrong，maybe we can also output some logs.

lcwangchao · 2026-04-01T06:36:07Z

+	batchedNS atomic.Int64
+	sentNS    atomic.Int64
+	recvNS    atomic.Int64


Better to have some comments about this NS, i.e., the interval def, unit, etc..

lcwangchao · 2026-04-01T06:43:44Z

 				// this maybe caused by batchCommandsClient#send meets ambiguous error that request has be sent to TiKV but still report a error.
 				// then TiKV will send response back though stream and reach here.
+				outdatedRespCount.Inc()
 				logutil.BgLogger().Warn("batchRecvLoop receives outdated response", zap.Uint64("requestID", requestID), zap.String("forwardedHost", streamClient.forwardedHost))


Better to add conn in the log fields

lcwangchao · 2026-04-01T06:50:59Z

+}
+
+func recordBatchRequestStage(stage *atomic.Int64, start, now time.Time) int64 {
+	elapsed := batchRequestElapsedNS(start, now)


Could we set elapsed to max(1, elapsed) to distinguish not set and a very small duration?

coderabbitai

🧹 Nitpick comments (1)

internal/client/client_batch.go (1)

251-257: String-based error matching is fragile and requires synchronization across multiple creation sites.

Comparing cause.Error() against string literals will silently break if error messages change in any of the 6 locations where "batchConn closed" and "batch client closed" are created. Consider defining sentinel errors at the package level and using identity comparison:

♻️ Suggested approach

Define sentinels once:

var (
    errBatchConnClosed   = errors.New("batchConn closed")
    errBatchClientClosed = errors.New("batch client closed")
)

Replace all six errors.New() calls in client_async.go and client_batch.go with these sentinels, then update the matching logic:

func batchRequestTerminalOutcome(err error) batchRequestOutcome {
    if err == nil {
        return batchRequestOutcomeOK
    }
    cause := errors.Cause(err)
    switch cause {
    case context.DeadlineExceeded:
        return batchRequestOutcomeTimeout
    case context.Canceled:
        return batchRequestOutcomeCanceled
    case errBatchConnClosed, errBatchClientClosed:
        return batchRequestOutcomeClosed
    default:
        return batchRequestOutcomeFailed
    }
}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@internal/client/client_batch.go` around lines 251 - 257, The code in
batchRequestTerminalOutcome currently matches on cause.Error() string values
which is brittle; define package-level sentinel errors (e.g., errBatchConnClosed
and errBatchClientClosed) and replace all occurrences of errors.New("batchConn
closed") / errors.New("batch client closed") in client_async.go and
client_batch.go with those sentinel variables, then update
batchRequestTerminalOutcome to compare the unwrapped cause against the sentinels
(using ==) instead of matching strings; keep existing handling for
context.DeadlineExceeded and context.Canceled and ensure errors.Cause(err) is
used before the equality checks.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@internal/client/client_batch.go`:
- Around line 251-257: The code in batchRequestTerminalOutcome currently matches
on cause.Error() string values which is brittle; define package-level sentinel
errors (e.g., errBatchConnClosed and errBatchClientClosed) and replace all
occurrences of errors.New("batchConn closed") / errors.New("batch client
closed") in client_async.go and client_batch.go with those sentinel variables,
then update batchRequestTerminalOutcome to compare the unwrapped cause against
the sentinels (using ==) instead of matching strings; keep existing handling for
context.DeadlineExceeded and context.Canceled and ensure errors.Cause(err) is
used before the equality checks.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: baae8de4-988c-4204-b709-8593d4603f64

📥 Commits

Reviewing files that changed from the base of the PR and between 9e42b52 and 79310ef.

📒 Files selected for processing (3)

internal/client/client.go
internal/client/client_batch.go
internal/client/conn_pool.go

✅ Files skipped from review due to trivial changes (1)

internal/client/conn_pool.go

🚧 Files skipped from review as they are similar to previous changes (1)

internal/client/client.go

Signed-off-by: zyguan <zhongyangguan@gmail.com>

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

internal/client/client_batch.go (1)
238-257: Use sentinel errors instead of string matching for close classification.

The batchRequestTerminalOutcome function maps literal error strings to the closed outcome. If any error message is changed at the return sites (lines 898, 1239, 1258 in client_batch.go or lines 159, 177, 184 in client_async.go), the metric will silently mislabel those requests as failed instead of closed. Replace the strings with a shared sentinel error variable to prevent this fragility.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/client/client_batch.go` around lines 238 - 257,
batchRequestTerminalOutcome currently matches error text literals to map closed
outcomes; define a shared sentinel error (e.g., ErrBatchClientClosed) and
replace the string-based switch with an errors.Is or direct equality check
against that sentinel in batchRequestTerminalOutcome, and update the places that
create/return the "batchConn closed"/"batch client closed" errors (the return
sites in client_batch.go and client_async.go) to return or wrap that sentinel so
the classification is robust to message changes.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/client/client_batch.go`:
- Around line 1005-1009: The code currently uses recvLoopStartTime as the start
for both the "recv" and "process" measurements so the process metric
double-counts recv time; after computing recvDur and calling
recvDur.Observe(...) reset the timer (e.g., set processStart := time.Now()) and
use that new timestamp to measure/process duration
(processDur.Observe(processTime.Seconds())), ensuring recvLoopStartTime is only
used for the "recv" series; update locations around recvLoopStartTime,
respRecvTime, recvDur.Observe and the process measurement to use the new
processStart.

---

Nitpick comments:
In `@internal/client/client_batch.go`:
- Around line 238-257: batchRequestTerminalOutcome currently matches error text
literals to map closed outcomes; define a shared sentinel error (e.g.,
ErrBatchClientClosed) and replace the string-based switch with an errors.Is or
direct equality check against that sentinel in batchRequestTerminalOutcome, and
update the places that create/return the "batchConn closed"/"batch client
closed" errors (the return sites in client_batch.go and client_async.go) to
return or wrap that sentinel so the classification is robust to message changes.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 016ed5e4-c982-48b1-a722-033b3591c784

📥 Commits

Reviewing files that changed from the base of the PR and between 79310ef and 5a166bb.

📒 Files selected for processing (2)

internal/client/client_batch.go
internal/client/client_test.go

🚧 Files skipped from review as they are similar to previous changes (1)

internal/client/client_test.go

Signed-off-by: zyguan <zhongyangguan@gmail.com>

zyguan · 2026-04-01T11:01:25Z

/retest

Signed-off-by: zyguan <zhongyangguan@gmail.com>

…ient-metrics Signed-off-by: zyguan <zhongyangguan@gmail.com>

Signed-off-by: zyguan <zhongyangguan@gmail.com>

…ient-metrics

zyguan · 2026-04-07T03:48:31Z

/retest

zyguan · 2026-04-07T04:02:57Z

@lcwangchao @ekexium Could you PTAL?

Signed-off-by: zyguan <zhongyangguan@gmail.com>

zyguan · 2026-04-09T03:02:04Z

@hbisheng Could you PTAL?

ekexium

LGTM overall.
Have you run any benchmarks to check for performance regressions?

hbisheng

I reviewed the logic that sets ClientSendTimeNs before sending and measures recv_ts - resp.GetTikvSendTimeNs() after receiving, and it LGTM.

I also ran an AI-assisted review, which didn't surface any critical issues. A few minor suggestions it raised:

Cross-node clock assumptions: This relies on client/TiKV wall-clock sync (UnixNano() vs TikvSendTimeNs). Consider documenting this assumption and clamping or ignoring obvious skew (e.g. negative or extreme values) to avoid metric pollution.
Potential panic risk: In batchRecvLoop(), batchState := entry.batchState.Load() is dereferenced based on an invariant. A defensive nil check (warn + continue) could improve robustness against future regressions.
Metrics stability / label cardinality: This introduces batch_stream_* metrics with new labels. Please confirm target is a stable store address (avoid peer or ephemeral port strings causing high cardinality), and clarify whether any existing metrics are being replaced or renamed (call out if this is a breaking change).

Signed-off-by: zyguan <zhongyangguan@gmail.com>

…ient-metrics Signed-off-by: zyguan <zhongyangguan@gmail.com>

lcwangchao · 2026-04-09T07:21:48Z

+
 	epoch := atomic.LoadUint64(&c.epoch)
 	for {
+


Suggested change

lcwangchao · 2026-04-09T07:30:06Z

-			atomic.StoreInt64(&entry.recvLat, int64(respRecvTime.Sub(entry.start)))
+			batchState := entry.batchState.Load()
+			batchState.firstRespAfterBatchedNS.CompareAndSwap(0, max(respRecvTime.Sub(batchState.batchedAt).Nanoseconds(), int64(1)))
+			entry.recvAfterStartNS.CompareAndSwap(0, max(respRecvTime.Sub(entry.start).Nanoseconds(), int64(1)))


Better to also record tikvSendTailLat here?

I prefer not, all durations reported by EntryProgress are observed at tidb side, however, tikvSendTailLat is an estimated duration calculated by cross-nodes wall clock. Track it by metric is ok, but print the exact values to log can be misleading.

I think the benefit is, if only part of the request of a batch is responded to TiDB, all requests in the batch can get the estimated latency for TiKV, we can confirm whether a timeout request has arrived to TiKV if it presents. Compared with metrics, metrics only displays a whole picture without specified request.

To reduce a misleading, we can just record two fields here:

Whether this request is confirmed to arrivied to TiKV.

Only if the request arrived at TiKV too late (for example, tikvTime - tidbTime > 5s), we record the latency. Otherwise, the slow log does not display it.

Whether this request is confirmed to arrivied to TiKV.

It's already included in this PR, please search "ack:" for details.

Only if the request arrived at TiKV too late (for example, tikvTime - tidbTime > 5s), we record the latency. Otherwise, the slow log does not display it.

I wonder how would you like to use the tikvSendTailLat if it's recorded in entry as a field? log it? when?

This info can be be recorded in the error when waiting for the recvLoop timeout:

client-go/internal/client/client_batch.go

Lines 969 to 978 in d1ffcbb

case <-timer.C:

atomic.StoreInt32(&entry.canceled, 1)

reason := fmt.Sprintf("wait recvLoop timeout, timeout:%s", timeout)

if sendLat := atomic.LoadInt64(&entry.sendLat); sendLat > 0 {

reason += fmt.Sprintf(", send:%s", util.FormatDuration(time.Duration(sendLat)))

if recvLat := atomic.LoadInt64(&entry.recvLat); recvLat > 0 {

reason += fmt.Sprintf(", recv:%s", util.FormatDuration(time.Duration(recvLat-sendLat)))

}

}

return nil, errors.WithMessage(context.DeadlineExceeded, reason)

For example, if a batch contains 10 requests, as one of the requests responds, all the requests in this batch can be updated.

For example, if a batch contains 10 requests, as one of the requests responds, all the requests in this batch can be updated.

It's already included, please search ack: as I mentioned in #1931 (comment) , and I think this is not the tikvSendTailLat you originally mentioned, BTW.

If you mean recording the tidbRecvTimeNs - tikvSendTimeNs of the very first response returned within the batch into the current timed-out request, wouldn't that be even more misleading?

My point is not that every request should log this. Rather, when tidbRecvTimeNs - tikvSendTimeNs is unusually large—for example, greater than 5000000000 (5s)—we could add a new log field, slowTiKVResponse to print both tidbRecvTimeNs and tikvSendTimeNs. We do not need to record this in every entry. It is sufficient to log it only when the condition is met, so that we can retrieve this information when needed.

But it's still fine for me to approve this PR because we added a lot of statistics in this PR. Feel free to ask for approval for me once you think this PR is ready to merge

f44bd00 PTAL. I would like to emphasize a crucial point: just because something does not typically occur does not mean it cannot happen. In this context, the condition for triggering log entries should not be based solely on tidbRecvTimeNs - tikvSendTimeNs exceeding a specific threshold; otherwise, should a significant clock skew arise between nodes, it would result in a massive volume of false positive log entries.

lcwangchao · 2026-04-09T07:37:41Z

 		if dur := sendLoopEndTime.Sub(headRecvTime); dur > batchSendTailLatThreshold {
 			a.metrics.batchSendTailLat.Observe(dur.Seconds())
 		}
+		if sendLoopEndTime.Sub(lastPendingInspectAt) >= batchRequestInspectInterval {


I think we can also call inspectPendingRequests in line 233 after the loop exit for idle reason

Signed-off-by: zyguan <zhongyangguan@gmail.com>

zyguan · 2026-04-09T07:52:29Z

/retest

zyguan · 2026-04-09T09:27:36Z

LGTM overall. Have you run any benchmarks to check for performance regressions?

I've run sysbench point-get workload, which is extremely sensitive to regressions of batch client. The results show there is very small overhead introduced by this PR.

With default point-get workload (unlimited rate), the baseline has a throughput of 73784.24, and this PR has a throughput of 73256.47, only about -0.7% regression (typically smaller than the system error).

With fixed rate point-get workload (the CPU usage of tidb is about 75%), the profile diff is shown as the following:

Signed-off-by: zyguan <zhongyangguan@gmail.com>

lcwangchao · 2026-04-15T03:58:42Z

+		return false
+	}
+	requestID := entry.requestID.Load()
+	return requestID > 0 && requestID < stream.maxRespReqID.Load()


Should it be requestID <= stream.maxRespReqID.Load()

Yes, since the request whose id equals to maxRespReqID must have firstRespAfterSendStartNS, here < won't lead to the request be considered as unconfirmed, however, to make the function semantics more accurate, it would be better to use <=.

…ient-metrics Signed-off-by: zyguan <zhongyangguan@gmail.com>

Signed-off-by: zyguan <zhongyangguan@gmail.com>

zyguan · 2026-04-15T06:52:45Z

/retest

Signed-off-by: zyguan <zhongyangguan@gmail.com>

zyguan · 2026-04-15T08:17:09Z

/retest

ti-chi-bot · 2026-04-15T11:51:58Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cfzjywxk, ekexium

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [cfzjywxk,ekexium]
~~internal/client/OWNERS~~ [cfzjywxk,ekexium]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ti-chi-bot · 2026-04-15T11:52:00Z

[LGTM Timeline notifier]

Timeline:

2026-04-03 09:40:48.299763179 +0000 UTC m=+517253.505123235: ☑️ agreed by cfzjywxk.
2026-04-15 11:51:59.156919504 +0000 UTC m=+1561924.362279551: ☑️ agreed by ekexium.

metrics: refine stream-level metrics

b289788

Signed-off-by: zyguan <zhongyangguan@gmail.com>

ti-chi-bot bot added dco-signoff: yes Indicates the PR's author has signed the dco. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Mar 31, 2026

zyguan force-pushed the dev/refine-batch-client-metrics branch from 80b2be9 to 723f512 Compare March 31, 2026 05:44

metrics: refine batch request stage latency metrics

4b701c5

Signed-off-by: zyguan <zhongyangguan@gmail.com>

zyguan force-pushed the dev/refine-batch-client-metrics branch from 723f512 to 4b701c5 Compare March 31, 2026 05:47

coderabbitai bot reviewed Mar 31, 2026

View reviewed changes

Comment thread internal/client/client.go Outdated

coderabbitai bot reviewed Mar 31, 2026

View reviewed changes

cfzjywxk requested review from MyonKeminta, cfzjywxk, ekexium, lcwangchao and you06 April 1, 2026 02:00

track tail latency for canceled entries

9e42b52

Signed-off-by: zyguan <zhongyangguan@gmail.com>

coderabbitai bot reviewed Apr 1, 2026

View reviewed changes

Comment thread internal/client/client_batch.go

send conn-idx info to tikv via metadata

79310ef

Signed-off-by: zyguan <zhongyangguan@gmail.com>

lcwangchao reviewed Apr 1, 2026

View reviewed changes

coderabbitai bot reviewed Apr 1, 2026

View reviewed changes

normalize sentNS for two corner cases

5a166bb

Signed-off-by: zyguan <zhongyangguan@gmail.com>

coderabbitai bot reviewed Apr 1, 2026

View reviewed changes

Comment thread internal/client/client_batch.go

address comments

ee5c9ac

Signed-off-by: zyguan <zhongyangguan@gmail.com>

zyguan mentioned this pull request Apr 1, 2026

metrics: export grpc stats metrics for batch client #1934

Open

zyguan added 3 commits April 2, 2026 02:43

simplify pending requests logging

623a855

Signed-off-by: zyguan <zhongyangguan@gmail.com>

some minor updates

6be8adc

Signed-off-by: zyguan <zhongyangguan@gmail.com>

Merge remote-tracking branch 'origin/master' into dev/refine-batch-cl…

6901600

…ient-metrics Signed-off-by: zyguan <zhongyangguan@gmail.com>

cfzjywxk reviewed Apr 3, 2026

View reviewed changes

Comment thread internal/client/client.go

address tikv#1931 (comment)

47db828

Signed-off-by: zyguan <zhongyangguan@gmail.com>

ti-chi-bot bot added the approved label Apr 3, 2026

Merge remote-tracking branch 'origin/master' into dev/refine-batch-cl…

d672c2d

…ient-metrics

zyguan added 3 commits April 8, 2026 06:16

client: track batch request progress with shared batch state

533df7e

Signed-off-by: zyguan <zhongyangguan@gmail.com>

update go.mod and go.sum of submodules

923d850

Signed-off-by: zyguan <zhongyangguan@gmail.com>

fix data race on batchState

655cf44

Signed-off-by: zyguan <zhongyangguan@gmail.com>

ekexium reviewed Apr 9, 2026

View reviewed changes

Comment thread internal/client/client_batch.go Outdated

ekexium reviewed Apr 9, 2026

View reviewed changes

hbisheng reviewed Apr 9, 2026

View reviewed changes

zyguan added 2 commits April 9, 2026 07:25

add request-id based mechanism for detecting arrival status

7c9fec0

Signed-off-by: zyguan <zhongyangguan@gmail.com>

Merge remote-tracking branch 'origin/master' into dev/refine-batch-cl…

a2caf22

…ient-metrics Signed-off-by: zyguan <zhongyangguan@gmail.com>

lcwangchao reviewed Apr 9, 2026

View reviewed changes

address comments

b29f87d

Signed-off-by: zyguan <zhongyangguan@gmail.com>

simplify the request stage model

e82b78d

Signed-off-by: zyguan <zhongyangguan@gmail.com>

lcwangchao reviewed Apr 15, 2026

View reviewed changes

zyguan added 2 commits April 15, 2026 06:37

Merge remote-tracking branch 'origin/master' into dev/refine-batch-cl…

528da9e

…ient-metrics Signed-off-by: zyguan <zhongyangguan@gmail.com>

address comment

b0f69b3

Signed-off-by: zyguan <zhongyangguan@gmail.com>

log more in batch-recv-loop

f44bd00

Signed-off-by: zyguan <zhongyangguan@gmail.com>

ekexium approved these changes Apr 15, 2026

View reviewed changes

ti-chi-bot bot added the lgtm label Apr 15, 2026

ti-chi-bot bot removed the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label Apr 15, 2026

ti-chi-bot bot merged commit 0eed1ff into tikv:master Apr 15, 2026
18 of 19 checks passed

	case <-timer.C:
	atomic.StoreInt32(&entry.canceled, 1)
	reason := fmt.Sprintf("wait recvLoop timeout, timeout:%s", timeout)
	if sendLat := atomic.LoadInt64(&entry.sendLat); sendLat > 0 {
	reason += fmt.Sprintf(", send:%s", util.FormatDuration(time.Duration(sendLat)))
	if recvLat := atomic.LoadInt64(&entry.recvLat); recvLat > 0 {
	reason += fmt.Sprintf(", recv:%s", util.FormatDuration(time.Duration(recvLat-sendLat)))
	}
	}
	return nil, errors.WithMessage(context.DeadlineExceeded, reason)

Conversation

zyguan commented Mar 31, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zyguan commented Mar 31, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zyguan commented Apr 1, 2026

Uh oh!

Uh oh!

zyguan commented Apr 7, 2026

Uh oh!

zyguan commented Apr 7, 2026

Uh oh!

zyguan commented Apr 9, 2026

Uh oh!

Uh oh!

ekexium left a comment

Choose a reason for hiding this comment

Uh oh!

hbisheng left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

zyguan commented Mar 31, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 31, 2026 •

edited

Loading