Skip to content

gc: Implement GC state cache #10677

Draft
MyonKeminta wants to merge 18 commits into
tikv:masterfrom
MyonKeminta:m/gc-state-cache
Draft

gc: Implement GC state cache #10677
MyonKeminta wants to merge 18 commits into
tikv:masterfrom
MyonKeminta:m/gc-state-cache

Conversation

@MyonKeminta
Copy link
Copy Markdown
Contributor

@MyonKeminta MyonKeminta commented May 14, 2026

What problem does this PR solve?

Issue Number: Ref #10659 , Close #10607

What is changed and how does it work?

This PR implements GC state cache. It now caches GC safe points and txn safe points of all keyspaces in memory, and can be used by GetGCState and GetAllKeyspacesGCStates.
This is a formal version of #10615 , and includes tests from #10617 with proper adaptions

Requires:

Check List

Tests

  • Unit test
  • Integration test

Code changes

Side effects

  • Possible performance regression
  • Increased code complexity
  • Breaking backward compatibility

Related changes

Release note

None.

Summary by CodeRabbit

Release Notes

  • New Features

    • Added configurable options to exclude GC barrier data from API responses for improved performance when barrier details aren't required.
    • Introduced in-memory caching for GC state to enhance query performance on the leader node.
  • Improvements

    • Enhanced stability of GC state retrieval during PD leadership transitions.
  • Tests

    • Added comprehensive test coverage for GC state behavior across leadership transitions and cache scenarios.

Review Change Stack

MyonKeminta and others added 7 commits May 12, 2026 16:00
Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>
Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>
Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>
Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>
Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>
Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>
ref tikv#10607

gc,server: add GC state cache test coverage

Add focused unit and integration coverage for the GC state cache behavior
introduced by GetGCState caching.

The added tests cover cache hit/miss behavior, cache update and invalidation
paths, and leader transition semantics for both cached and uncached requests.

Signed-off-by: Wenxuan Zhang <wenxuangm@gmail.com>
@ti-chi-bot ti-chi-bot Bot added release-note-none Denotes a PR that doesn't merit a release note. dco-signoff: yes Indicates the PR's author has signed the dco. labels May 14, 2026
@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot Bot commented May 14, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign zhouqiang-cl for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@MyonKeminta MyonKeminta marked this pull request as draft May 14, 2026 09:52
@ti-chi-bot ti-chi-bot Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 14, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 14, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c7f96133-5b56-49a1-a079-1909ca7c3442

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR extends the GC state API with configurable options to exclude GC barrier data, refactors data models to enforce access via error-returning accessors, and implements an in-memory cache in GCStateManager with leadership-aware fast/slow read paths and safe-point advancement integration.

Changes

GC Barriers Exclusion & GCStateManager Caching

Layer / File(s) Summary
Client API Contract & Data Model
client/clients/gc/client.go
Introduces GCStatesAPIOptions with functional option helpers (ExcludeGCBarriers, ExcludeGlobalGCBarriers) and refactors GCState / ClusterGCStates to store barrier data internally with boolean flags; adds constructors and error-returning accessor methods.
Client RPC Implementation with Options
client/gc_client.go
Updates GetGCState and GetAllKeyspacesGCStates to accept variadic options; propagates exclude flags to PD requests and conditionally constructs result objects with or without barriers.
GCStateManager Cache Foundation
pkg/gc/gc_state_manager.go
Introduces sharded gcStateCache holding cached per-keyspace TxnSafePoint/GCSafePoint, leadership counter, and ordered singleflight instances; includes cache operations and bulk cloning.
Cached Read Paths with Failpoint Instrumentation
pkg/gc/gc_state_manager.go
Implements fast/slow path getGCStateImpl/getGCStateImplSlow with cache hits (leader only, excluding barriers) and storage fallback; conditionally loads GC barriers only when excludeGCBarriers is false.
GCStateManager Public API Updates
pkg/gc/gc_state_manager.go
Adds excludeGCBarriers boolean to GetGCState and GetAllKeyspacesGCStates; routes through cache when beneficial and uses separate ordered singleflights per exclude-barriers variant.
Safe Point Advancement with Cache Management
pkg/gc/gc_state_manager.go
Integrates cache updates into GC and txn safe point advancement: removes on transaction errors, stores updated values on success.
Cluster Lifecycle Integration & Leadership Wiring
server/cluster/cluster.go
Extends Server interface with GetGCStateManager(), wires manager into cluster start/stop lifecycle via leadership callbacks.
Service RPC Handlers with Barrier Filtering
server/gc_service.go
Updates handlers to accept and forward exclude flags to manager; conditionally loads/returns global barriers; injects failpoint for testing.
Client Unit Tests for Accessors
client/clients/gc/client_test.go
Validates accessor methods report barrier presence correctly and return appropriate errors/data.
GCStateManager Cache & Concurrency Tests
pkg/gc/gc_state_manager_test.go
Validates cache miss/hit behavior, failure isolation, warm cache preservation across mutations, and downgrade compatibility.
Leadership Transition & Failpoint Tests
tests/server/gc/gc_test.go
Adds five tests exercising cache fast-path, slow-path, old-leader rejection, NOT_BOOTSTRAPPED errors, and leadership recovery scenarios.
Integration Tests: Client & Service API Validation
tests/integrations/client/client_test.go, tests/server/api/service_gc_safepoint_test.go
Updates test helpers to validate barrier exclusion defaults and explicit inclusion; asserts accessor behavior and barrier content/TTL.
Cache Access Metrics
pkg/gc/metrics.go
Registers Prometheus CounterVec for cache access totals (hit, slow_hit, miss).
Service Integration Call Sites
server/api/service_gc_safepoint.go
Updates existing service calls to pass the new excludeGCBarriers parameter.
Dependency Updates
go.mod, client/go.mod, tests/integrations/go.mod, tools/go.mod
Adds replace directives to use github.com/MyonKeminta/kvproto fork with barrier-exclusion support.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related issues

  • tikv/pd#10659: Implements the "exclude GC barriers" API design and adds GCStateManager caching with fast-path logic, addressing the first two enhancement subtasks for blocking/timeout optimization.

Suggested labels

size/XXL, type/development, lgtm, approved

Suggested reviewers

  • rleungx
  • ystaticy
  • okJiang

Poem

🐰 The barriers fade when whispers say "exclude,"
A cache stands guard where leaders trod,
Safe points advance through sharded abodes,
Failpoints dance when crowns are swapped,
Accessors guard what once lay bare.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 32.14% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'gc: Implement GC state cache' follows the pkg: description format, is concise, and clearly summarizes the main change—implementing an in-memory cache for GC state data.
Description check ✅ Passed The PR description includes required Issue Number references (Ref #10659, Close #10607), explains changes via commit message, and documents dependencies. The checklist specifies unit and integration tests, API/data changes, and potential side effects.
Linked Issues check ✅ Passed The PR implements all core coding objectives from #10607: introduces context-aware synchronization via singleflight mechanisms [gc_state_manager.go], uses read-friendly cache lookups instead of exclusive locks [GetGCState, GetAllKeyspacesGCStates], shares concurrent request results via cache [gc_state_manager.go:cache logic], and adds configurable excludeGCBarriers support to enable incremental improvements [client.go options API].
Out of Scope Changes check ✅ Passed All code changes directly support the GC state cache implementation and linked issue objectives: client API options [client.go], cache integration [gc_state_manager.go], metric instrumentation [metrics.go], server/service integration [gc_service.go, cluster.go], and comprehensive test coverage [all test files]. No unrelated changes detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ti-chi-bot ti-chi-bot Bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label May 14, 2026
@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot Bot commented May 14, 2026

@MyonKeminta: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-integration-realcluster-test f7eba22 link true /test pull-integration-realcluster-test

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)
client/gc_client.go (1)

331-343: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Skip barrier conversion on the excluded fast path.

pbToGCState still allocates and converts every GcBarrierInfo before checking excludeGCBarriers. Any response that still carries barriers will pay the full decode cost and then discard it, which defeats the default fast path for these APIs.

♻️ Proposed fix
 func pbToGCState(pb *pdpb.GCState, reqStartTime time.Time, excludeGCBarriers bool) gc.GCState {
 	keyspaceID := constants.NullKeyspaceID
 	if pb.KeyspaceScope != nil {
 		keyspaceID = pb.KeyspaceScope.KeyspaceId
 	}
+	if excludeGCBarriers {
+		return gc.NewGCStateWithoutGCBarriers(keyspaceID, pb.GetTxnSafePoint(), pb.GetGcSafePoint())
+	}
 	gcBarriers := make([]*gc.GCBarrierInfo, 0, len(pb.GetGcBarriers()))
 	for _, b := range pb.GetGcBarriers() {
 		gcBarriers = append(gcBarriers, pbToGCBarrierInfo(b, reqStartTime))
 	}
-	if !excludeGCBarriers {
-		return gc.NewGCStateWithGCBarriers(keyspaceID, pb.GetTxnSafePoint(), pb.GetGcSafePoint(), gcBarriers)
-	}
-	return gc.NewGCStateWithoutGCBarriers(keyspaceID, pb.GetTxnSafePoint(), pb.GetGcSafePoint())
+	return gc.NewGCStateWithGCBarriers(keyspaceID, pb.GetTxnSafePoint(), pb.GetGcSafePoint(), gcBarriers)
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/gc_client.go` around lines 331 - 343, pbToGCState currently converts
all pb.GetGcBarriers() into gc.GCBarrierInfo objects even when excludeGCBarriers
is true; change the control flow to first check excludeGCBarriers and only
allocate/iterate pb.GetGcBarriers() and call pbToGCBarrierInfo when
excludeGCBarriers is false. Keep computing keyspaceID and the txn/gc safe points
as before, but call gc.NewGCStateWithoutGCBarriers immediately when
excludeGCBarriers is true to avoid unnecessary allocations; otherwise build
gcBarriers and call gc.NewGCStateWithGCBarriers.
tests/integrations/client/client_test.go (1)

2726-2782: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Avoid asserting barrier order in these integration checks.

These new expectations rely on globalGCBarriers[1] / gcBarriers[0], but the client contract here doesn't guarantee slice ordering. The existing checkGCBarrier helpers in this file already search by ID, which is the safer pattern; otherwise these assertions will get flaky if the server changes iteration order without changing semantics.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/integrations/client/client_test.go` around lines 2726 - 2782, The test
is asserting specific slice indices (globalGCBarriers[1], gcBarriers[0]) but
ordering isn't guaranteed; update the checks after GetAllKeyspacesGCStates /
SetGlobalGCBarrier / SetGCBarrier to find the barrier by BarrierID instead of
indexing: use the existing checkGCBarrier helper (or iterate
state.GetGlobalGCBarriers()/state.GetGCBarriers() to locate BarrierID
"b2","b3","b4") and then assert BarrierTS and TTL on that found entry; change
assertions that reference globalGCBarriers[1] and gcBarriers[0] to use the
helper/found-item approach so the test no longer depends on slice order.
tests/server/api/service_gc_safepoint_test.go (1)

124-146: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Make the post-delete assertions order-insensitive.

Both re.Equal(list.ServiceGCSafepoints[1:3], leftSsps) and the final HTTP response equality hard-code safepoint order, but this test is really about deletion visibility. Comparing by ServiceID/contents (or sorting first) will make it robust against harmless ordering changes in the manager or API layer.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/server/api/service_gc_safepoint_test.go` around lines 124 - 146, The
assertions currently assume a stable order for safepoints (in GetGCState usage
and the HTTP response); change them to be order-insensitive by either sorting
the slices by ServiceID (e.g., sort.Slice on leftSsps and
list.ServiceGCSafepoints before comparing) or by comparing as maps keyed by
ServiceID/contents (build maps from leftSsps and
listRespAfterDelete.ServiceGCSafepoints and assert equality of entries),
updating references to GetGCState, leftSsps, ServiceGCSafepoints,
expectedAfterDelete and listRespAfterDelete accordingly so the test verifies
deletion visibility regardless of ordering.
pkg/gc/gc_state_manager.go (1)

463-470: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Check LoadGCSafePoint failures before refreshing the cache.

err1 is never checked after LoadGCSafePoint, so this transaction can still succeed with gcSafePoint left at zero and then write that fabricated value into gcStateCache. A later GetGCState(..., true) can therefore observe a regressed GCSafePoint even though storage never contained it.

Suggested fix
 		oldTxnSafePoint, err1 = m.gcMetaStorage.LoadTxnSafePoint(keyspaceID)
 		if err1 != nil {
 			return err1
 		}
 		gcSafePoint, err1 = m.gcMetaStorage.LoadGCSafePoint(keyspaceID)
+		if err1 != nil {
+			return err1
+		}
 
 		if target < oldTxnSafePoint {
 			return errs.ErrDecreasingTxnSafePoint.GenWithStackByArgs(oldTxnSafePoint, target)
 		}

Also applies to: 563-567

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/gc/gc_state_manager.go` around lines 463 - 470, The transaction callback
in m.gcMetaStorage.RunInGCStateTransaction is not checking the error after
calling m.gcMetaStorage.LoadGCSafePoint, which can leave gcSafePoint at zero and
then write a bad value into gcStateCache; modify the callback used by
RunInGCStateTransaction so that after calling LoadGCSafePoint you check err1 and
return it on non-nil (just like you do after LoadTxnSafePoint) before proceeding
to refresh gcStateCache or any writes; apply the same fix to the similar block
around lines handling LoadGCSafePoint at the second location (the block
referencing oldTxnSafePoint/gcSafePoint at 563-567) so both paths return on
LoadGCSafePoint errors rather than continuing to update gcStateCache with an
uninitialized value.
🧹 Nitpick comments (1)
pkg/gc/gc_state_manager_test.go (1)

2430-2503: ⚡ Quick win

Add the same sharing test for excludeGCBarriers=true.

This regression only exercises GetAllKeyspacesGCStates(..., false), but the true branch goes through a different OrderedSingleFlight and different context plumbing. Mirroring this test for the exclude-barriers path would pin that branch's sharing/cancellation behavior too.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/gc/gc_state_manager_test.go` around lines 2430 - 2503, Add a parallel
test that mirrors TestGetAllKeyspacesGCStatesConcurrentCallSharingResult but
calls s.manager.GetAllKeyspacesGCStates(context.Background(), true) to exercise
the excludeGCBarriers=true path; reuse the same failpoints
("github.com/tikv/pd/pkg/gc/onGetAllKeyspacesGCStatesStart" and "...Finish"),
atomic executionCount, channel/result struct, the same concurrency pattern, and
the same AdvanceTxnSafePoint(constant.NullKeyspaceID, 100, ...) and assertions
(first caller sees old TxnSafePoint 0, later callers see 100, and
executionCount.Load() equals 2) so the OrderedSingleFlight/cancellation behavior
for the exclude-barriers branch is validated.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@client/clients/gc/client.go`:
- Around line 314-361: Add GoDoc comments for the exported symbols
(GCState.HasGCBarriers, GCState.GetGCBarriers, ClusterGCStates,
NewClusterGCStatesWithoutGlobalGCBarriers,
NewClusterGCStatesWithGlobalGCBarriers, ClusterGCStates.HasGlobalGCBarriers,
ClusterGCStates.GetGlobalGCBarriers) that start with the symbol name and briefly
describe what they return/represent; avoid stuttering in the type comment (e.g.,
"ClusterGCStates represents the GC state for all keyspaces."). Also update the
error text in GetGCBarriers and GetGlobalGCBarriers to instruct callers to use
the option-based API (mention gc.ExcludeGCBarriers(false) and
gc.ExcludeGlobalGCBarriers(false) respectively) instead of the old exclude...
parameter wording so runtime guidance and generated docs are consistent.

In `@pkg/gc/gc_state_manager.go`:
- Around line 775-785: The cached-read path checks m.nodeIsLeader() then returns
cachedGCState, which can race with OnNodeBecomesFollower clearing the cache; to
fix, after loading cachedGCState (gcStateCache.Load) re-check leadership (call
m.nodeIsLeader() again) before returning and only return the cached state if the
second check is true, otherwise fall through so the cache isn't served
post-stepdown; apply the same double-check pattern to the other cached-read site
referenced (lines ~931-935) so cache reads are synchronized with
OnNodeBecomesFollower's invalidation.
- Around line 931-943: The callback passed to
m.allKeyspacesGCStatesExcludeGCBarriersSingleFlight.Do incorrectly uses the
outer ctx when calling m.iterateAllKeyspacesGCStates, so cancellation from the
first caller can cancel the shared execution; change the
iterateAllKeyspacesGCStates invocation (and any other uses of ctx within that Do
callback) to use the provided execCtx instead, keeping the rest of the logic the
same (e.g., preserve the cachedGCStates population, the m.nodeIsLeader() check
and gcStateCache.CloneAllAsGCStates(), and the excludeGCBarriers predicate) so
the singleflight execution respects the joining caller's context.

---

Outside diff comments:
In `@client/gc_client.go`:
- Around line 331-343: pbToGCState currently converts all pb.GetGcBarriers()
into gc.GCBarrierInfo objects even when excludeGCBarriers is true; change the
control flow to first check excludeGCBarriers and only allocate/iterate
pb.GetGcBarriers() and call pbToGCBarrierInfo when excludeGCBarriers is false.
Keep computing keyspaceID and the txn/gc safe points as before, but call
gc.NewGCStateWithoutGCBarriers immediately when excludeGCBarriers is true to
avoid unnecessary allocations; otherwise build gcBarriers and call
gc.NewGCStateWithGCBarriers.

In `@pkg/gc/gc_state_manager.go`:
- Around line 463-470: The transaction callback in
m.gcMetaStorage.RunInGCStateTransaction is not checking the error after calling
m.gcMetaStorage.LoadGCSafePoint, which can leave gcSafePoint at zero and then
write a bad value into gcStateCache; modify the callback used by
RunInGCStateTransaction so that after calling LoadGCSafePoint you check err1 and
return it on non-nil (just like you do after LoadTxnSafePoint) before proceeding
to refresh gcStateCache or any writes; apply the same fix to the similar block
around lines handling LoadGCSafePoint at the second location (the block
referencing oldTxnSafePoint/gcSafePoint at 563-567) so both paths return on
LoadGCSafePoint errors rather than continuing to update gcStateCache with an
uninitialized value.

In `@tests/integrations/client/client_test.go`:
- Around line 2726-2782: The test is asserting specific slice indices
(globalGCBarriers[1], gcBarriers[0]) but ordering isn't guaranteed; update the
checks after GetAllKeyspacesGCStates / SetGlobalGCBarrier / SetGCBarrier to find
the barrier by BarrierID instead of indexing: use the existing checkGCBarrier
helper (or iterate state.GetGlobalGCBarriers()/state.GetGCBarriers() to locate
BarrierID "b2","b3","b4") and then assert BarrierTS and TTL on that found entry;
change assertions that reference globalGCBarriers[1] and gcBarriers[0] to use
the helper/found-item approach so the test no longer depends on slice order.

In `@tests/server/api/service_gc_safepoint_test.go`:
- Around line 124-146: The assertions currently assume a stable order for
safepoints (in GetGCState usage and the HTTP response); change them to be
order-insensitive by either sorting the slices by ServiceID (e.g., sort.Slice on
leftSsps and list.ServiceGCSafepoints before comparing) or by comparing as maps
keyed by ServiceID/contents (build maps from leftSsps and
listRespAfterDelete.ServiceGCSafepoints and assert equality of entries),
updating references to GetGCState, leftSsps, ServiceGCSafepoints,
expectedAfterDelete and listRespAfterDelete accordingly so the test verifies
deletion visibility regardless of ordering.

---

Nitpick comments:
In `@pkg/gc/gc_state_manager_test.go`:
- Around line 2430-2503: Add a parallel test that mirrors
TestGetAllKeyspacesGCStatesConcurrentCallSharingResult but calls
s.manager.GetAllKeyspacesGCStates(context.Background(), true) to exercise the
excludeGCBarriers=true path; reuse the same failpoints
("github.com/tikv/pd/pkg/gc/onGetAllKeyspacesGCStatesStart" and "...Finish"),
atomic executionCount, channel/result struct, the same concurrency pattern, and
the same AdvanceTxnSafePoint(constant.NullKeyspaceID, 100, ...) and assertions
(first caller sees old TxnSafePoint 0, later callers see 100, and
executionCount.Load() equals 2) so the OrderedSingleFlight/cancellation behavior
for the exclude-barriers branch is validated.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 51fa5214-de87-4050-8ed9-81a227a97ff4

📥 Commits

Reviewing files that changed from the base of the PR and between 9036aea and f7eba22.

⛔ Files ignored due to path filters (4)
  • client/go.sum is excluded by !**/*.sum
  • go.sum is excluded by !**/*.sum
  • tests/integrations/go.sum is excluded by !**/*.sum
  • tools/go.sum is excluded by !**/*.sum
📒 Files selected for processing (16)
  • client/clients/gc/client.go
  • client/clients/gc/client_test.go
  • client/gc_client.go
  • client/go.mod
  • go.mod
  • pkg/gc/gc_state_manager.go
  • pkg/gc/gc_state_manager_test.go
  • pkg/gc/metrics.go
  • server/api/service_gc_safepoint.go
  • server/cluster/cluster.go
  • server/gc_service.go
  • tests/integrations/client/client_test.go
  • tests/integrations/go.mod
  • tests/server/api/service_gc_safepoint_test.go
  • tests/server/gc/gc_test.go
  • tools/go.mod

Comment on lines +314 to +361
func (s GCState) HasGCBarriers() bool {
return s.hasGCBarriers
}

func (s GCState) GetGCBarriers() ([]*GCBarrierInfo, error) {
if !s.HasGCBarriers() {
return nil, errors.New("trying to get GC barriers from GCState that doesn't provide GC barriers info. " +
"to retrieve GC barriers, pass false to excludeGCBarriers parameter to GC APIs")
}
return s.gcBarriers, nil
}

// ClusterGCStates represents the information of the GC state for all keyspaces.
type ClusterGCStates struct {
// Maps from keyspace id to GC state of that keyspace.
GCStates map[uint32]GCState
// All existing global GC barriers.
GlobalGCBarriers []*GlobalGCBarrierInfo
GCStates map[uint32]GCState
hasGlobalGCBarriers bool
globalGCBarriers []*GlobalGCBarrierInfo
}

// NewClusterGCStatesWithoutGlobalGCBarriers creates a ClusterGCStates instance without global GC barriers info.
func NewClusterGCStatesWithoutGlobalGCBarriers(gcStates map[uint32]GCState) ClusterGCStates {
return ClusterGCStates{
GCStates: gcStates,
hasGlobalGCBarriers: false,
globalGCBarriers: nil,
}
}

// NewClusterGCStatesWithGlobalGCBarriers creates a ClusterGCStates instance with global GC barriers info.
func NewClusterGCStatesWithGlobalGCBarriers(gcStates map[uint32]GCState, globalGCBarriers []*GlobalGCBarrierInfo) ClusterGCStates {
return ClusterGCStates{
GCStates: gcStates,
hasGlobalGCBarriers: true,
globalGCBarriers: globalGCBarriers,
}
}

func (s ClusterGCStates) HasGlobalGCBarriers() bool {
return s.hasGlobalGCBarriers
}

func (s ClusterGCStates) GetGlobalGCBarriers() ([]*GlobalGCBarrierInfo, error) {
if !s.HasGlobalGCBarriers() {
return nil, errors.New("trying to get global GC barriers from ClusterGCStates that doesn't provide global GC barriers info. " +
"to retrieve global GC barriers, pass false to excludeGlobalGCBarriers parameter to GC APIs")
}
return s.globalGCBarriers, nil
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Make the new accessor API self-describing.

These exported accessors are now part of the public client contract, but they don't have GoDoc, and the error text still tells callers to pass nonexistent exclude... parameters instead of using gc.ExcludeGCBarriers(false) / gc.ExcludeGlobalGCBarriers(false). That leaves both generated docs and runtime guidance out of sync with the option-based API.

As per coding guidelines, "Exported identifiers need GoDoc starting with the name; avoid stutter (pd.PDServer -> Server)."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/clients/gc/client.go` around lines 314 - 361, Add GoDoc comments for
the exported symbols (GCState.HasGCBarriers, GCState.GetGCBarriers,
ClusterGCStates, NewClusterGCStatesWithoutGlobalGCBarriers,
NewClusterGCStatesWithGlobalGCBarriers, ClusterGCStates.HasGlobalGCBarriers,
ClusterGCStates.GetGlobalGCBarriers) that start with the symbol name and briefly
describe what they return/represent; avoid stuttering in the type comment (e.g.,
"ClusterGCStates represents the GC state for all keyspaces."). Also update the
error text in GetGCBarriers and GetGlobalGCBarriers to instruct callers to use
the option-based API (mention gc.ExcludeGCBarriers(false) and
gc.ExcludeGlobalGCBarriers(false) respectively) instead of the old exclude...
parameter wording so runtime guidance and generated docs are consistent.

Comment on lines +775 to +785
if excludeGCBarriers && m.nodeIsLeader() {
if cachedGCState, ok := m.gcStateCache.Load(keyspaceID); ok {
failpoint.InjectCall("getGCStateCacheAccess", "hit")
gcStateCacheAccessHitCounter.Inc()
return GCState{
KeyspaceID: keyspaceID,
IsKeyspaceLevel: keyspaceID != constant.NullKeyspaceID,
TxnSafePoint: cachedGCState.TxnSafePoint,
GCSafePoint: cachedGCState.GCSafePoint,
}, nil
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Synchronize the leadership check with cached reads.

Both cached-read entry points gate on a lock-free nodeIsLeader() check and then read cached state. OnNodeBecomesFollower decrements nodeLeadership before it clears the shards, so a request can observe true, race with stepdown, and still return warmed cache from the old leader before invalidation finishes.

Also applies to: 931-935

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/gc/gc_state_manager.go` around lines 775 - 785, The cached-read path
checks m.nodeIsLeader() then returns cachedGCState, which can race with
OnNodeBecomesFollower clearing the cache; to fix, after loading cachedGCState
(gcStateCache.Load) re-check leadership (call m.nodeIsLeader() again) before
returning and only return the cached state if the second check is true,
otherwise fall through so the cache isn't served post-stepdown; apply the same
double-check pattern to the other cached-read site referenced (lines ~931-935)
so cache reads are synchronized with OnNodeBecomesFollower's invalidation.

Comment on lines +931 to +943
return m.allKeyspacesGCStatesExcludeGCBarriersSingleFlight.Do(ctx, func(execCtx context.Context) (map[uint32]GCState, error) {
cachedGCStates := make(map[uint32]GCState)
if m.nodeIsLeader() {
cachedGCStates = m.gcStateCache.CloneAllAsGCStates()
}
err := m.iterateAllKeyspacesGCStates(ctx, excludeGCBarriers, func(keyspaceID uint32) bool {
_, ok := cachedGCStates[keyspaceID]
return !ok
}, func(gcState GCState) {
cachedGCStates[gcState.KeyspaceID] = gcState
})
failpoint.Inject("onGetAllKeyspacesGCStatesFinish", func() {})
return result, err
return cachedGCStates, err
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use execCtx in the exclude-barriers singleflight branch.

This callback is running under OrderedSingleFlight, but it iterates with the outer ctx instead of the provided execCtx. If the first caller has a short deadline and a later caller joins with a longer one, canceling the first context aborts the shared execution for everyone, which defeats the cancellation-sharing behavior this path is trying to provide.

Suggested fix
-		err := m.iterateAllKeyspacesGCStates(ctx, excludeGCBarriers, func(keyspaceID uint32) bool {
+		err := m.iterateAllKeyspacesGCStates(execCtx, excludeGCBarriers, func(keyspaceID uint32) bool {
 			_, ok := cachedGCStates[keyspaceID]
 			return !ok
 		}, func(gcState GCState) {
 			cachedGCStates[gcState.KeyspaceID] = gcState
 		})
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
return m.allKeyspacesGCStatesExcludeGCBarriersSingleFlight.Do(ctx, func(execCtx context.Context) (map[uint32]GCState, error) {
cachedGCStates := make(map[uint32]GCState)
if m.nodeIsLeader() {
cachedGCStates = m.gcStateCache.CloneAllAsGCStates()
}
err := m.iterateAllKeyspacesGCStates(ctx, excludeGCBarriers, func(keyspaceID uint32) bool {
_, ok := cachedGCStates[keyspaceID]
return !ok
}, func(gcState GCState) {
cachedGCStates[gcState.KeyspaceID] = gcState
})
failpoint.Inject("onGetAllKeyspacesGCStatesFinish", func() {})
return result, err
return cachedGCStates, err
return m.allKeyspacesGCStatesExcludeGCBarriersSingleFlight.Do(ctx, func(execCtx context.Context) (map[uint32]GCState, error) {
cachedGCStates := make(map[uint32]GCState)
if m.nodeIsLeader() {
cachedGCStates = m.gcStateCache.CloneAllAsGCStates()
}
err := m.iterateAllKeyspacesGCStates(execCtx, excludeGCBarriers, func(keyspaceID uint32) bool {
_, ok := cachedGCStates[keyspaceID]
return !ok
}, func(gcState GCState) {
cachedGCStates[gcState.KeyspaceID] = gcState
})
failpoint.Inject("onGetAllKeyspacesGCStatesFinish", func() {})
return cachedGCStates, err
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/gc/gc_state_manager.go` around lines 931 - 943, The callback passed to
m.allKeyspacesGCStatesExcludeGCBarriersSingleFlight.Do incorrectly uses the
outer ctx when calling m.iterateAllKeyspacesGCStates, so cancellation from the
first caller can cancel the shared execution; change the
iterateAllKeyspacesGCStates invocation (and any other uses of ctx within that Do
callback) to use the provided execCtx instead, keeping the rest of the logic the
same (e.g., preserve the cachedGCStates population, the m.nodeIsLeader() check
and gcStateCache.CloneAllAsGCStates(), and the excludeGCBarriers predicate) so
the singleflight execution respects the joining caller's context.

Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>
Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>
Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>
Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>
Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>
Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>
Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dco-signoff: yes Indicates the PR's author has signed the dco. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note-none Denotes a PR that doesn't merit a release note. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

GCStateManager calls may be blocked and unable to cancel or timeout when there's IO issue

2 participants