Skip to content

txnkv: batch async resolve lock logs in LockResolver#1945

Open
lcwangchao wants to merge 2 commits intotikv:masterfrom
lcwangchao:resolve_lock_log_key
Open

txnkv: batch async resolve lock logs in LockResolver#1945
lcwangchao wants to merge 2 commits intotikv:masterfrom
lcwangchao:resolve_lock_log_key

Conversation

@lcwangchao
Copy link
Copy Markdown
Contributor

@lcwangchao lcwangchao commented Apr 15, 2026

This PR follows up #1901 by improving how resolve-lock related information is logged in LockResolver.

Instead of emitting one log line for every resolveLock / resolvePessimisticLock call, this change introduces an asynchronous batched logger inside LockResolver. The resolve path only records structured log entries and sends them to a background goroutine, which reduces log spam while still preserving per-lock details.

What changed

  • add a background log collector to LockResolver
  • collect structured log entries for both resolveLock and resolvePessimisticLock
  • emit a single aggregated txn resolve locks log for a batch of resolved locks
  • keep lock-specific fields in each entry, including:
    • lock info
    • resolve type
    • action
    • txn start TS
    • commit TS / forUpdateTS
    • region ID
    • region-level resolve flag
    • whether the lock was already resolved by CheckTxnStatus
  • flush logs when the batch reaches 20 entries or after the flush interval
  • avoid blocking the foreground resolve path when the log channel is full; instead, count dropped entries and report them in a warning log on the next flush

Tests

  • add unit tests for batched resolve-lock logging
  • add unit tests for dropped-log reporting when the async log channel is full
  • update integration tests to close temporary LockResolver instances explicitly, so background logging goroutines are cleaned up correctly

Summary by CodeRabbit

  • Bug Fixes

    • Enhanced resource cleanup for lock resolver instances to prevent leaks
  • Improvements

    • Implemented asynchronous batched logging for lock resolution operations with buffering and overflow protection
    • Added structured logging with detailed metadata (transaction timestamps, region IDs) for lock resolution events

@ti-chi-bot ti-chi-bot Bot added the dco-signoff: yes Indicates the PR's author has signed the dco. label Apr 15, 2026
Signed-off-by: Chao Wang <cclcwangchao@hotmail.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 15, 2026

Warning

Rate limit exceeded

@lcwangchao has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 34 minutes and 13 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 34 minutes and 13 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f37c6b89-3f83-4c08-aa18-7f2279790cc2

📥 Commits

Reviewing files that changed from the base of the PR and between 0eed1ff and 5974570.

📒 Files selected for processing (5)
  • integration_tests/2pc_test.go
  • integration_tests/lock_test.go
  • integration_tests/shared_lock_test.go
  • txnkv/txnlock/lock_resolver.go
  • txnkv/txnlock/lock_resolver_test.go
📝 Walkthrough

Walkthrough

This PR adds explicit LockResolver cleanup in integration tests and refactors lock-resolve logging to an asynchronous, buffered, batched pipeline within the lock resolver implementation, including tests for batched logging and missed-entry reporting.

Changes

Cohort / File(s) Summary
Integration tests: resolver cleanup
integration_tests/2pc_test.go, integration_tests/lock_test.go, integration_tests/shared_lock_test.go
Replace inline one-off s.store.NewLockResolver() calls with local variables and ensure Close() is invoked (typically via defer or per-iteration cleanup) to guarantee resolver resource cleanup.
Lock resolver implementation
txnkv/txnlock/lock_resolver.go
Introduce buffered async batched logging for resolve operations: new types for log entries with zap marshaling, async log channel and runner, recording helpers, and use asyncCtx/asyncCancel. Replace many synchronous per-operation logs with enqueued batched logging and adjust backoff contexts to use the new async context.
Lock resolver tests
txnkv/txnlock/lock_resolver_test.go
Add tests verifying batched "txn resolve locks" log emission and missed-entry reporting when log channel is full; add test helper to override log batch config; ensure resolver Close is called in cache test via t.Cleanup.

Sequence Diagram(s)

sequenceDiagram
    participant Caller as Test / Operation
    participant LockResolver as LockResolver
    participant AsyncRunner as BackgroundLogLoop
    participant Logger as Zap Logger

    Caller->>LockResolver: resolveLock / resolvePessimisticLock
    LockResolver->>LockResolver: recordResolveLock(...) (enqueue)
    Note right of LockResolver: non-blocking, drop-on-full
    AsyncRunner-->>LockResolver: periodically flush batch
    AsyncRunner->>Logger: emit single "txn resolve locks" Info with batched entries
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • logs: resolve lock #1901: Modifies txnkv/txnlock/lock_resolver.go with related resolve-lock logging changes; closely related to logging behavior changes in this PR.

Suggested labels

lgtm, size/M

Suggested reviewers

  • MyonKeminta
  • cfzjywxk

Poem

🐰 I queued the logs, then hopped away,
Batched and tidy, they wait to say —
“Resolve done, counts all in a row,”
Close the resolver, watch resources go. 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely summarizes the main change: introducing batched asynchronous resolve lock logging in LockResolver, which is the primary focus of the entire changeset.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ti-chi-bot ti-chi-bot Bot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Apr 15, 2026
@lcwangchao lcwangchao force-pushed the resolve_lock_log_key branch from 6d57ac7 to 6a67186 Compare April 15, 2026 07:12
@ti-chi-bot ti-chi-bot Bot added the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label Apr 15, 2026
@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot Bot commented Apr 15, 2026

[LGTM Timeline notifier]

Timeline:

  • 2026-04-15 07:13:43.742805559 +0000 UTC m=+1545228.948165616: ☑️ agreed by ekexium.

@ti-chi-bot ti-chi-bot Bot added the approved label Apr 15, 2026
@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot Bot commented Apr 16, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ekexium

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot Bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Apr 16, 2026
@lcwangchao lcwangchao force-pushed the resolve_lock_log_key branch from 473a410 to 79d1b55 Compare April 16, 2026 07:21
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@txnkv/txnlock/lock_resolver.go`:
- Around line 249-357: The batched resolver logs lose request-scoped context
because resolveLockLogEntry only stores lock/status data; update
resolveLockLogEntry to include request-scoped fields (e.g., a
map[string]interface{} or a dedicated struct like requestFields with
caller/session/source IDs), populate those fields when creating entries in
recordResolveLock and recordResolvePessimisticLock (capture them from the
current request context where these functions are called, e.g., the context used
previously with logutil.Logger(bo.GetCtx())), extend
resolveLockLogEntry.MarshalLogObject to emit those request fields, and finally
ensure collectResolveLockLogsLoop flush uses zap.Array/zap.Object to include
those preserved request fields when logging the batched entries so
caller/session/request-source correlation is retained.
- Around line 1499-1503: The primary-fast-path branch after getTxnStatusFromLock
(the bytes.Equal(l.Key, l.Primary) case) incorrectly records the resolution as a
region resolve by calling lr.recordResolvePessimisticLock(...,
pessimisticRegionResolve, true) even though no pessimistic rollback RPC is sent;
change the call in that branch to pass regionResolve=false (i.e. use false
instead of pessimisticRegionResolve) so the recordResolvePessimisticLock call
accurately reflects that CheckTxnStatus resolved the lock locally; locate this
in the block surrounding metrics.LockResolverCountWithResolveLocks.Inc(),
bytes.Equal(l.Key, l.Primary) and lr.recordResolvePessimisticLock calls.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b4290cea-8cce-4803-9a09-e584dde28be3

📥 Commits

Reviewing files that changed from the base of the PR and between 1adc54c and 473a410.

📒 Files selected for processing (5)
  • integration_tests/2pc_test.go
  • integration_tests/lock_test.go
  • integration_tests/shared_lock_test.go
  • txnkv/txnlock/lock_resolver.go
  • txnkv/txnlock/lock_resolver_test.go

Comment thread txnkv/txnlock/lock_resolver.go
Comment thread txnkv/txnlock/lock_resolver.go
@lcwangchao lcwangchao changed the title txnkv: add key entry in some logs in resolveLock txnkv: batch async resolve lock logs in LockResolver Apr 16, 2026
@lcwangchao
Copy link
Copy Markdown
Contributor Author

/retest

@lcwangchao
Copy link
Copy Markdown
Contributor Author

@coderabbitai review it again

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 16, 2026

@lcwangchao: Sure, I'll review the PR again now!


🧠 Learnings used
Learnt from: lcwangchao
Repo: tikv/client-go PR: 1945
File: txnkv/txnlock/lock_resolver.go:1499-1503
Timestamp: 2026-04-16T08:09:59.421Z
Learning: In `txnkv/txnlock/lock_resolver.go`, the primary fast-path of `resolvePessimisticLock` (the `bytes.Equal(l.Key, l.Primary)` branch) intentionally passes `pessimisticRegionResolve` (not `false`) as the `regionResolve` argument to `recordResolvePessimisticLock`. This mirrors the pre-refactor per-lock log which also emitted `regionResolve=pessimisticRegionResolve` on this path. The field's meaning may be refined in a future PR, but should not be changed as part of logging refactors that aim to preserve existing log semantics.
✅ Actions performed

Full review triggered.

@lcwangchao
Copy link
Copy Markdown
Contributor Author

/retest

Signed-off-by: Chao Wang <cclcwangchao@hotmail.com>
@lcwangchao lcwangchao force-pushed the resolve_lock_log_key branch from 79d1b55 to 5974570 Compare April 16, 2026 08:27
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
txnkv/txnlock/lock_resolver_test.go (1)

178-189: Test helper modifies package-level state.

overrideResolveLockLogConfig mutates global variables, which could cause interference if tests using different config values run in parallel. Currently this is fine since these tests don't use t.Parallel(), but consider adding a comment noting this constraint for future maintainers.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@txnkv/txnlock/lock_resolver_test.go` around lines 178 - 189, The helper
overrideResolveLockLogConfig mutates package-level variables
(resolveLockLogBatchSize, resolveLockLogFlushInterval,
resolveLockLogChannelSize) which can interfere with tests if run in parallel;
add a clear comment directly above the overrideResolveLockLogConfig function
stating that it modifies global state and that callers/tests must not use
t.Parallel() (or must coordinate) and mark the helper with t.Helper() in callers
if appropriate so future maintainers understand the constraint.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@txnkv/txnlock/lock_resolver_test.go`:
- Around line 178-189: The helper overrideResolveLockLogConfig mutates
package-level variables (resolveLockLogBatchSize, resolveLockLogFlushInterval,
resolveLockLogChannelSize) which can interfere with tests if run in parallel;
add a clear comment directly above the overrideResolveLockLogConfig function
stating that it modifies global state and that callers/tests must not use
t.Parallel() (or must coordinate) and mark the helper with t.Helper() in callers
if appropriate so future maintainers understand the constraint.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 214584bf-64e9-471c-ae9e-f34864b0afc3

📥 Commits

Reviewing files that changed from the base of the PR and between 473a410 and 5974570.

📒 Files selected for processing (5)
  • integration_tests/2pc_test.go
  • integration_tests/lock_test.go
  • integration_tests/shared_lock_test.go
  • txnkv/txnlock/lock_resolver.go
  • txnkv/txnlock/lock_resolver_test.go
✅ Files skipped from review due to trivial changes (2)
  • integration_tests/shared_lock_test.go
  • integration_tests/2pc_test.go

@lcwangchao
Copy link
Copy Markdown
Contributor Author

/retest

1 similar comment
@lcwangchao
Copy link
Copy Markdown
Contributor Author

/retest

@lcwangchao
Copy link
Copy Markdown
Contributor Author

@coderabbitai review this PR again

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 17, 2026

@lcwangchao: Sure, I'll review the PR again now!


🧠 Learnings used
Learnt from: lcwangchao
Repo: tikv/client-go PR: 1945
File: txnkv/txnlock/lock_resolver.go:1499-1503
Timestamp: 2026-04-16T08:09:59.421Z
Learning: In `txnkv/txnlock/lock_resolver.go`, the primary fast-path of `resolvePessimisticLock` (the `bytes.Equal(l.Key, l.Primary)` branch) intentionally passes `pessimisticRegionResolve` (not `false`) as the `regionResolve` argument to `recordResolvePessimisticLock`. This mirrors the pre-refactor per-lock log which also emitted `regionResolve=pessimisticRegionResolve` on this path. The field's meaning may be refined in a future PR, but should not be changed as part of logging refactors that aim to preserve existing log semantics.

Learnt from: lcwangchao
Repo: tikv/client-go PR: 1940
File: txnkv/txnlock/lock_resolver_test.go:131-138
Timestamp: 2026-04-17T05:34:37.571Z
Learning: In `txnkv/txnlock/lock_resolver_test.go` `TestTryAsyncResolve`, the `time.Sleep` + `checkMetricVal` block immediately after `waitSemaphoreSizeWithCheck(5)` is intentional: it asserts that rejected `tryAsyncResolve` calls (when semaphore is exhausted) do NOT mutate the running-tasks gauge, verifying a stability invariant distinct from the `Eventually`-based check above it. Do not flag this as redundant.
✅ Actions performed

Full review triggered.

@lcwangchao
Copy link
Copy Markdown
Contributor Author

/hold

@ti-chi-bot ti-chi-bot Bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved dco-signoff: yes Indicates the PR's author has signed the dco. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. needs-1-more-lgtm Indicates a PR needs 1 more LGTM. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants