Skip to content

[ENH] Wire up garbage collector to do 3-phase GC. #4987

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Jun 30, 2025

Conversation

rescrv
Copy link
Contributor

@rescrv rescrv commented Jun 28, 2025

Description of changes

Wire up the garbage collector to do 3-phase GC.

Test plan

CI

Documentation Changes

N/A

Copy link

Reviewer Checklist

Please leverage this checklist to ensure your code review is thorough before approving

Testing, Bugs, Errors, Logs, Documentation

  • Can you think of any use case in which the code does not behave as intended? Have they been tested?
  • Can you think of any inputs or external events that could break the code? Is user input validated and safe? Have they been tested?
  • If appropriate, are there adequate property based tests?
  • If appropriate, are there adequate unit tests?
  • Should any logging, debugging, tracing information be added or removed?
  • Are error messages user-friendly?
  • Have all documentation changes needed been made?
  • Have all non-obvious changes been commented?

System Compatibility

  • Are there any potential impacts on other parts of the system or backward compatibility?
  • Does this change intersect with any items on our roadmap, and if so, is there a plan for fitting them together?

Quality

  • Is this code of a unexpectedly high quality (Readability, Modularity, Intuitiveness)

Copy link
Contributor

propel-code-bot bot commented Jun 28, 2025

Implement 3-Phase Garbage Collection for WAL Logs and Wire Up End-to-End GC Path

This PR introduces a full 3-phase garbage collection (GC) workflow across the WAL-backed log service and garbage collector in Chroma, extending the API, service implementations, and GC operator logic accordingly. The changes add a new GarbageCollectPhase2 RPC to the logservice API and wire all the GC phases (compute, truncate manifest, delete files) into both the log service and the garbage collector, updating configuration, orchestration, and operator layers to support this across distributed services. It also updates documentation and charts to expose the new API and workflow.

Key Changes:
• Introduces and wires up a new GarbageCollectPhase2 gRPC endpoint in the LogService proto and Rust/Go implementations, allowing phase 2 (manifest truncation) to be performed by the log service.
• Refactors the garbage collector's log deletion operators to use explicit three-phase GC: phase 1 (compute garbage), phase 2 (truncate manifest via RPC call to log service), and phase 3 (physical deletion of log files).
• Extends the log abstraction (Log types in Rust) and their trait implementations to support phase 2 and phase 3 of GC as atomic RPC operations.
• Updates several Kubernetes YAMLs (values/Chart/config/config.yaml) to route and expose the new phase2 GC endpoint, including Helm version bump.
• Documents the rationale for three-phase GC and the responsibilities of each phase in wal3/README.md and code comments, clarifying the architectural division of labor.
• Makes breaking config changes for the garbage collector, requiring explicit log service configuration, and propagates log client configuration throughout the stack.
• Updates test and proptest harness to account for the log API changes.
• Minor: NOP implementations of new endpoint in Go (for go log compatibility), proto/plumbing changes, and CI/test/plumbing fixes.

Affected Areas:
• Distributed garbage collection (Rust GC service, Rust log service, wal3)
• Proto API (logservice.proto, generated Rust/Go code, service interfaces)
• Cluster/k8s configuration (Helm charts, values.yaml, service configs)
• Rust log, log-service, garbage_collector, wal3, and associated glue/operators
• Documentation (wal3/README.md, comment updates)

This summary was automatically generated by @propel-code-bot

@rescrv rescrv requested a review from sanketkedia June 29, 2025 17:49
@rescrv rescrv force-pushed the rescrv/three-phase-gc branch from ba3fe76 to 9bbea49 Compare June 29, 2025 17:51
@rescrv rescrv force-pushed the rescrv/three-phase-gc-service branch from d5d740b to a1786d3 Compare June 29, 2025 17:51
Base automatically changed from rescrv/three-phase-gc to main June 30, 2025 16:06
@rescrv rescrv force-pushed the rescrv/three-phase-gc-service branch from a1786d3 to 5475ab1 Compare June 30, 2025 16:08
@@ -163,4 +173,6 @@ service LogService {
rpc InspectLogState(InspectLogStateRequest) returns (InspectLogStateResponse) {}
// This endpoint should route to the rust log service.
rpc ScrubLog(ScrubLogRequest) returns (ScrubLogResponse) {}
// This endpoint should route to the rust log service.
rpc GarbageCollectPhase2(GarbageCollectPhase2Request) returns (GarbageCollectPhase2Response) {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we rename this so that the name explains what this service does in more details?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would you name it? It's phase2 of the process known as garbage collect.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from my understanding it's truncating the manifest? so maybe TruncateManifest

@@ -183,6 +183,10 @@ func (s *logServer) ScrubLog(ctx context.Context, req *logservicepb.ScrubLogRequ
return
}

func (s *logServer) GarbageCollectPhase2(ctx context.Context, req *logservicepb.GarbageCollectPhase2Request) (res *logservicepb.GarbageCollectPhase2Response, err error) {
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this return nil or an actual response?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's an empty response. I'm copying other empty responses here. Should we set the struct?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose it's implicitly converting nil to empty response in this case. We can set the struct for clarity or leave it as is for consistency, whichever you prefer

match writer.garbage_collect(&GarbageCollectionOptions::default(), Some(*minimum_log_offset_to_keep)).await {
Ok(()) => Ok(()),
match writer.garbage_collect_phase1(&GarbageCollectionOptions::default(), Some(*minimum_log_offset_to_keep)).await {
Ok(true) => {},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when is the writer expected to return true v/s false. And why do we skip phase2 and 3 in case it returns false?

Copy link
Contributor

@sanketkedia sanketkedia Jun 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is phase1 not done by the log service? It can compute and write out the garbage and send a path to the garbage file as a response of the rpc?

tracing::error!("Unable to garbage collect log for collection [{collection_id}]: {err}");
return Err(DeleteUnusedLogsError::Gc(err));
};
if let Err(err) = writer.garbage_collect_phase3(&GarbageCollectionOptions::default()).await {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also would prefer to name these phases better instead of just phase1, 2 and 3

return Err(DeleteUnusedLogsError::Wal3(err));
}
};
if let Err(err) = logs.garbage_collect_phase2(*collection_id).await {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add comments on why phase3 is done by the GC service i.e. the argument of keeping a single service that deletes all the files

@rescrv rescrv force-pushed the rescrv/three-phase-gc-service branch from c8d817a to 19793f0 Compare June 30, 2025 20:43
Copy link

github-actions bot commented Jun 30, 2025

✅ The Helm chart's version was changed. Your changes to the chart will be published upon merge to main.

@rescrv rescrv merged commit 76de464 into main Jun 30, 2025
57 checks passed
@rescrv rescrv deleted the rescrv/three-phase-gc-service branch June 30, 2025 22:53
rescrv added a commit that referenced this pull request Jul 1, 2025
## Description of changes

Wire up the garbage collector to do 3-phase GC.

## Test plan

CI

## Documentation Changes

N/A
rescrv added a commit that referenced this pull request Jul 1, 2025
## Description of changes

Included PRs in order:

- **[TST]  Test for #4972 (#4983)**
- **[CLN] Fix dedup in get_collections_with_new_data. (#4974)**
- **[ENH] Implement three-phase garbage collection for WAL3 (#4984)**
- **[CLN] Remove err(Display) from wal3. (#4992)**
- **[ENH] Wire up garbage collector to do 3-phase GC. (#4987)**
- **[ENH]  Do not materialize all fragments to delete. (#5004)**

## Test plan

CI through main and then CI on the hotfix branch.

## Documentation Changes

N/A
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants