-
Notifications
You must be signed in to change notification settings - Fork 15
Description
Description
Running swamp data gc reports expired data entries but never actually reclaims the disk space they occupy. Expired data accumulates on disk permanently because there is no code path that hard-deletes all versions of an expired data entry.
Steps to Reproduce
- Create model data with a short duration lifetime (e.g.,
1m) - Write several versions to the data
- Wait for the lifetime to expire
- Run
swamp data gc - Observe that the data is reported as expired, but version directories remain on disk
- Run
swamp data gcagain — the same data is reported as expired again
Root Cause
The GC operates in two phases inside deleteExpiredData() in data_lifecycle_service.ts:
Phase 1 — Lifecycle expiration (soft delete): Calls removeLatestMarker() to delete the latest pointer file. No version directories are removed — zero disk space is freed.
Phase 2 — Version GC: Calls collectGarbage() on all models to trim excess versions based on the garbageCollection policy (e.g., keep N versions or versions within a duration).
There are two interacting problems:
Problem 1: No hard-delete path for expired data
collectGarbage() always protects the latest version (unified_data_repository.ts:1364: if (version !== Math.max(...versions))). At least one version always survives. There is no code path that removes all versions of an expired entry.
For example, data with garbageCollection: 5 and 3 versions that has expired:
- Phase 1 removes the
latestmarker (no disk space freed) - Phase 2 sees 3 versions, policy is keep 5,
3 <= 5— nothing to remove - All 3 version directories remain on disk indefinitely
Problem 2: Version GC undoes the soft-delete
When collectGarbage() does delete an old version, it calls delete(type, modelId, dataName, version) in unified_data_repository.ts. The delete() method (line 860-863) automatically recreates the latest marker pointing to the highest remaining version:
const versions = await this.listVersions(type, modelId, dataName);
if (versions.length > 0) {
const newLatest = Math.max(...versions);
await this.updateLatestMarker(type, modelId, dataName, newLatest);
}
This undoes the soft-delete from Phase 1. Additionally, getLatestVersion() has a fallback (line 1509-1511) that scans version directories when the latest file is missing, so even between phases the data remains accessible.
Expected Behavior
swamp data gc should fully remove expired data entries from disk (all version directories), freeing the associated disk space. Running GC a second time should not find the same expired data again.
Summary of Fix
The fix involves changes to the deleteExpiredData method in data_lifecycle_service.ts and potentially the collectGarbage or delete methods in unified_data_repository.ts:
- For expired data, all versions should be hard-deleted (not just the
latestmarker removed). This could be done by callingdelete(type, modelId, dataName)without a version (which already removes the entire data directory) instead ofremoveLatestMarker(). - If the two-phase approach is preserved,
collectGarbageneeds to be aware of soft-deleted data and skip recreating thelatestmarker for entries that have been expired. Alternatively, version GC should run before lifecycle expiration so the marker removal happens last.
Environment
- swamp version: 20260315.010849.0-sha.5a9897f6
- Platform: macOS Darwin 25.3.0