Skip to content

Data garbage collection never fully removes expired data from disk #720

@stack72

Description

@stack72

Description

Running swamp data gc reports expired data entries but never actually reclaims the disk space they occupy. Expired data accumulates on disk permanently because there is no code path that hard-deletes all versions of an expired data entry.

Steps to Reproduce

  1. Create model data with a short duration lifetime (e.g., 1m)
  2. Write several versions to the data
  3. Wait for the lifetime to expire
  4. Run swamp data gc
  5. Observe that the data is reported as expired, but version directories remain on disk
  6. Run swamp data gc again — the same data is reported as expired again

Root Cause

The GC operates in two phases inside deleteExpiredData() in data_lifecycle_service.ts:

Phase 1 — Lifecycle expiration (soft delete): Calls removeLatestMarker() to delete the latest pointer file. No version directories are removed — zero disk space is freed.

Phase 2 — Version GC: Calls collectGarbage() on all models to trim excess versions based on the garbageCollection policy (e.g., keep N versions or versions within a duration).

There are two interacting problems:

Problem 1: No hard-delete path for expired data

collectGarbage() always protects the latest version (unified_data_repository.ts:1364: if (version !== Math.max(...versions))). At least one version always survives. There is no code path that removes all versions of an expired entry.

For example, data with garbageCollection: 5 and 3 versions that has expired:

  • Phase 1 removes the latest marker (no disk space freed)
  • Phase 2 sees 3 versions, policy is keep 5, 3 <= 5 — nothing to remove
  • All 3 version directories remain on disk indefinitely

Problem 2: Version GC undoes the soft-delete

When collectGarbage() does delete an old version, it calls delete(type, modelId, dataName, version) in unified_data_repository.ts. The delete() method (line 860-863) automatically recreates the latest marker pointing to the highest remaining version:

const versions = await this.listVersions(type, modelId, dataName);
if (versions.length > 0) {
  const newLatest = Math.max(...versions);
  await this.updateLatestMarker(type, modelId, dataName, newLatest);
}

This undoes the soft-delete from Phase 1. Additionally, getLatestVersion() has a fallback (line 1509-1511) that scans version directories when the latest file is missing, so even between phases the data remains accessible.

Expected Behavior

swamp data gc should fully remove expired data entries from disk (all version directories), freeing the associated disk space. Running GC a second time should not find the same expired data again.

Summary of Fix

The fix involves changes to the deleteExpiredData method in data_lifecycle_service.ts and potentially the collectGarbage or delete methods in unified_data_repository.ts:

  • For expired data, all versions should be hard-deleted (not just the latest marker removed). This could be done by calling delete(type, modelId, dataName) without a version (which already removes the entire data directory) instead of removeLatestMarker().
  • If the two-phase approach is preserved, collectGarbage needs to be aware of soft-deleted data and skip recreating the latest marker for entries that have been expired. Alternatively, version GC should run before lifecycle expiration so the marker removal happens last.

Environment

  • swamp version: 20260315.010849.0-sha.5a9897f6
  • Platform: macOS Darwin 25.3.0

Metadata

Metadata

Assignees

Labels

betaIssues required to close out before public betabugSomething isn't working

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions