Skip to content

fix(backend): add filesystem-first scan to GC to handle orphaned disk resources#973

Merged
brendan-kellam merged 3 commits intomainfrom
brendan/filesystem-first-gc-scan
Mar 2, 2026
Merged

fix(backend): add filesystem-first scan to GC to handle orphaned disk resources#973
brendan-kellam merged 3 commits intomainfrom
brendan/filesystem-first-gc-scan

Conversation

@brendan-kellam
Copy link
Contributor

@brendan-kellam brendan-kellam commented Mar 2, 2026

Summary

  • Adds cleanupOrphanedDiskResources() to RepoIndexManager which walks the repos/ and index/ directories on disk and removes any entries with no corresponding Repo record in the database
  • Runs once at startup (not on every polling tick), since orphaned resource recovery is a startup concern — ongoing cleanup is handled by the normal GC job queue
  • Uses a single batched findMany query for both repo directories and index shards instead of per-entry findUnique calls (avoids N+1)
  • Handles desyncs caused by DB resets or cascade deletes that bypass the normal cleanup job flow (e.g. resetting the dev database while .sourcebot/ persists on disk)
  • Adds getRepoIdFromPath() to @sourcebot/shared (inverse of getRepoPath()) and getRepoIdFromShardFileName() to backend utils (inverse of getShardPrefix())

Test plan

  • Reset the database while .sourcebot/repos/ and .sourcebot/index/ contain data, restart the backend, and verify orphaned directories and shards are removed on startup
  • Verify normal indexing and cleanup flows are unaffected

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes
    • System now automatically removes orphaned repository directories and index shards during startup to maintain filesystem integrity.

@github-actions

This comment has been minimized.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 2, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between feef580 and d942550.

📒 Files selected for processing (3)
  • CHANGELOG.md
  • packages/backend/src/index.ts
  • packages/backend/src/repoIndexManager.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • CHANGELOG.md

Walkthrough

Adds a filesystem-first garbage collection step at scheduler startup: new utilities extract repo IDs from disk paths, RepoIndexManager scans repo and index cache directories for orphaned entries (no DB record) and deletes them; startup now awaits this cleanup before scheduling.

Changes

Cohort / File(s) Summary
Changelog
CHANGELOG.md
Added Unreleased Fixed entry documenting filesystem-first GC scan that removes orphaned repo dirs and index shards.
Repo index manager
packages/backend/src/repoIndexManager.ts
Added private cleanupOrphanedDiskResources() to scan REPOS_CACHE_DIR and INDEX_CACHE_DIR for disk entries without DB records and delete them; startScheduler() is now async and awaits this cleanup. New imports for repo-id utilities and REPOS_CACHE_DIR.
Startup caller
packages/backend/src/index.ts
Changed top-level call to await RepoIndexManager.startScheduler() to ensure startup waits for orphan cleanup.
Backend utils
packages/backend/src/utils.ts
Added `getRepoIdFromShardFileName(fileName: string): number
Shared utils & exports
packages/shared/src/utils.ts, packages/shared/src/index.server.ts
Added `getRepoIdFromPath(repoPath: string): number

Sequence Diagram(s)

sequenceDiagram
    participant Scheduler
    participant RepoIndexManager
    participant FileSystem as File System
    participant Database

    Scheduler->>RepoIndexManager: startScheduler()
    activate RepoIndexManager
    RepoIndexManager->>RepoIndexManager: cleanupOrphanedDiskResources()
    Note over RepoIndexManager: Scan REPOS_CACHE_DIR
    RepoIndexManager->>FileSystem: readdir(REPOS_CACHE_DIR)
    FileSystem-->>RepoIndexManager: [repo_dirs]
    loop each repo_dir
        RepoIndexManager->>RepoIndexManager: getRepoIdFromPath(dir)
        RepoIndexManager->>Database: repoExists(repoId)?
        alt not found
            RepoIndexManager->>FileSystem: delete repo directory
        end
    end
    Note over RepoIndexManager: Scan INDEX_CACHE_DIR
    RepoIndexManager->>FileSystem: readdir(INDEX_CACHE_DIR)
    FileSystem-->>RepoIndexManager: [shard_files]
    loop each shard_file
        RepoIndexManager->>RepoIndexManager: getRepoIdFromShardFileName(file)
        RepoIndexManager->>Database: repoExists(repoId)?
        alt not found
            RepoIndexManager->>FileSystem: delete shard file
        end
    end
    deactivate RepoIndexManager
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested reviewers

  • msukkari
🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and specifically describes the main change: adding a filesystem-first GC scan to handle orphaned disk resources, which is the core objective of the PR.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch brendan/filesystem-first-gc-scan

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

… resources

Adds cleanupOrphanedDiskResources() which walks the repos/ and index/
directories on disk and removes any entries with no corresponding Repo
record in the database. This handles desyncs caused by DB resets or
cascade deletes that bypass the normal cleanup job flow.

Also adds getRepoIdFromPath() to @sourcebot/shared and
getRepoIdFromShardFileName() to the backend utils as inverse helpers
to the existing getRepoPath() and getShardPrefix() functions.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@brendan-kellam brendan-kellam force-pushed the brendan/filesystem-first-gc-scan branch from 0423e5c to feef580 Compare March 2, 2026 19:48
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@packages/backend/src/repoIndexManager.ts`:
- Around line 657-675: The orphan shard cleanup currently does a per-entry DB
lookup (this.db.repo.findUnique) causing N+1 queries; instead, collect all
repoIds by mapping entries through getRepoIdFromShardFileName, deduplicate them,
fetch existing repos once via this.db.repo.findMany({ where: { id: { in: [...] }
} }), build a Set of existing ids, and then iterate entries removing files whose
repoId is missing from that Set; apply the same batching approach used for repo
directories under INDEX_CACHE_DIR to avoid per-entry DB calls.

In `@packages/shared/src/utils.ts`:
- Around line 90-93: The getRepoIdFromPath function uses parseInt which accepts
trailing characters; change it to validate the basename is strictly numeric
before parsing (e.g., test path.basename(repoPath) against a /^\d+$/ check) and
only then convert to a number (or use Number after validation) so names like
"123_tmp" return undefined; update getRepoIdFromPath to perform this strict
check and return undefined for non-pure-numeric basenames.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4004933 and feef580.

📒 Files selected for processing (5)
  • CHANGELOG.md
  • packages/backend/src/repoIndexManager.ts
  • packages/backend/src/utils.ts
  • packages/shared/src/index.server.ts
  • packages/shared/src/utils.ts

@brendan-kellam brendan-kellam merged commit fde1280 into main Mar 2, 2026
8 checks passed
@brendan-kellam brendan-kellam deleted the brendan/filesystem-first-gc-scan branch March 2, 2026 20:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant