Skip to content

Commit 44abfd1

Browse files
committed
Indexing: Remove PathResolver LRU cache
The `PathResolver` (50K-entry LRU cache for path→ID resolution) was redundant — `enrich_entries_with_index` already resolves paths uncached on every page fetch via `store::resolve_path`. The cache also had a latent staleness bug: invalidation methods were `#[cfg(test)]` only, so deleted/renamed paths returned stale IDs until LRU eviction. - Delete `path_resolver.rs` (435 lines) and remove `lru` crate dependency - `get_dir_stats`/`get_dir_stats_batch` now use `store::resolve_path` directly, signatures changed from `&mut self` to `&self` - Module-level wrappers no longer need `&mut` on the `INDEXING` mutex guard - Update CLAUDE.md: remove all `PathResolver` references
1 parent 8f87a4f commit 44abfd1

8 files changed

Lines changed: 122 additions & 601 deletions

File tree

Cargo.lock

Lines changed: 0 additions & 21 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

apps/desktop/src-tauri/Cargo.toml

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -72,8 +72,6 @@ walkdir = "2"
7272
# Drive indexing: parallel directory walker, SQLite store, and macOS FSEvents watcher
7373
jwalk = "0.8"
7474
rusqlite = { version = "0.32", features = ["bundled", "collation"] }
75-
# Drive indexing: LRU cache for path→entry_id resolution
76-
lru = "0.16"
7775
# For chunked copy metadata preservation (network filesystems)
7876
xattr = "1"
7977
filetime = "0.2"

apps/desktop/src-tauri/src/indexing/CLAUDE.md

Lines changed: 7 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -8,12 +8,11 @@ Full design: `docs/specs/drive-indexing/plan.md`
88

99
### Module structure
1010

11-
- **mod.rs** -- Public API (`init()`, `start_indexing()`, `stop_indexing()`, `clear_index()`), `IndexPhase` state machine, `IndexManager` (coordinates all subsystems, owns `PathResolver` for LRU-cached path→ID mapping), `DebugStats` (shared atomic counters for the debug window).
11+
- **mod.rs** -- Public API (`init()`, `start_indexing()`, `stop_indexing()`, `clear_index()`), `IndexPhase` state machine, `IndexManager` (coordinates all subsystems), `DebugStats` (shared atomic counters for the debug window).
1212
- **enrichment.rs** -- `ReadPool` (lock-free thread-local read connections for enrichment and verification), `enrich_entries_with_index()` (called every `get_file_range`). Integer-keyed fast path: resolve parent dir once → batch-fetch child dir stats by ID → match by name. Falls back to individual path resolution for edge cases.
1313
- **event_loop.rs** -- `run_live_event_loop` (real-time FSEvents/inotify processing after scan completes), `run_replay_event_loop` (cold-start journal replay with two-phase approach), `run_background_verification` (post-replay bidirectional readdir diff), `merge_fs_events` (deduplication with flag priority), `process_live_batch`. All bounded-buffer constants live here.
1414
- **events.rs** -- Tauri event payload structs (`IndexScanStartedEvent`, `IndexScanProgressEvent`, `IndexScanCompleteEvent`, `IndexDirUpdatedEvent`, `IndexReplayProgressEvent`), `RescanReason` enum, `emit_rescan_notification()`, IPC response types (`IndexStatusResponse`, `IndexDebugStatusResponse`).
1515
- **store.rs** -- SQLite schema v2 (integer-keyed entries, dir_stats by entry_id, meta), platform_case collation, read queries, DB open/migrate. Schema version check: mismatch triggers drop+rebuild. Both path-keyed (backward compat) and integer-keyed APIs.
16-
- **path_resolver.rs** -- `PathResolver`: resolves filesystem paths to integer entry IDs via component-by-component walk with full-path LRU cache (50K entries). Case-aware `CacheKey` on macOS (NFD + case fold). Prefix-based invalidation for deletes/renames.
1716
- **memory_watchdog.rs** -- Background task monitoring resident memory via `mach_task_info` (macOS). Warns at 8 GB, stops indexing at 16 GB, emits `index-memory-warning` event to frontend. No-op stub on non-macOS. Started from `start_indexing()`.
1817
- **writer.rs** -- Single writer thread, owns the write connection, processes `WriteMessage` channel (bounded `sync_channel`, 20K capacity, backpressure via blocking). Priority: `UpdateDirStats` before `InsertEntries`. `Flush` variant + async `flush()` method let callers wait for all prior writes to commit. Has both integer-keyed variants (`InsertEntriesV2`, `UpsertEntryV2`, `DeleteEntryById`, `DeleteSubtreeById`, `PropagateDeltaById`) and path-keyed backward-compat variants. The integer-keyed delete/subtree-delete handlers auto-propagate negative deltas via the `parent_id` chain (same pattern as the path-keyed variants). `propagate_delta_by_id` walks the parent chain using `get_parent_id` lookups. `UpsertEntryV2` initializes a zero-valued `dir_stats` row when inserting a NEW directory, so enrichment always has a row (subsequent `PropagateDeltaById` calls update it incrementally). Maintains `AccumulatorMaps` during `InsertEntriesV2` processing (two HashMaps: direct children stats and child dir relationships + an `entries_inserted` counter), cleared on `TruncateData`. On `ComputeAllAggregates`, passes accumulated maps to `aggregator::compute_all_aggregates_with_maps()` to skip expensive full-table-scan SQL queries. Accepts an optional `AppHandle` at spawn time to emit `index-aggregation-progress` events during aggregation (phase, current, total). Also emits `saving_entries` phase progress during `InsertEntriesV2` processing when the expected total is set via `set_expected_total_entries()` (an `Arc<AtomicU64>` shared between the writer thread and the `IndexWriter` handle).
1918
- **scanner.rs** -- jwalk-based parallel directory walker. `scan_volume()` for full scan, `scan_subtree()` for targeted subtree rescans (used by post-replay background verification). Uses `ScanContext` (from store.rs) to assign integer IDs and parent IDs during the walk: maintains a `HashMap<PathBuf, i64>` mapping directory paths to assigned IDs. The scan root is mapped to `ROOT_ID` (1). Sends `InsertEntriesV2(Vec<EntryRow>)` batches to the writer. Platform-specific exclusion filters (macOS system paths, Linux virtual filesystems). Physical sizes (`st_blocks * 512`).
@@ -82,7 +81,7 @@ Three tables:
8281
WAL mode, 16 MB page cache, `auto_vacuum = INCREMENTAL` (free pages reclaimed via `PRAGMA incremental_vacuum` after truncation). Custom `platform_case` collation registered on every connection: case-insensitive + NFD normalization on macOS, binary on Linux. **Opening the DB with the sqlite3 CLI will fail** on queries touching the name column (the collation isn't registered).
8382

8483
History of changes:
85-
- **Schema v3**: Bumped from v2 to force DB rebuild after fixing orphan entry bug. Scanner, writer, aggregator, reconciler, enrichment, and IPC commands all fully migrated to integer keys. `IndexManager` owns a `PathResolver` for LRU-cached path→ID resolution in IPC commands (`get_dir_stats`, `get_dir_stats_batch`). Enrichment uses integer-keyed fast path: resolve parent once → batch child dir stats by ID. Reconciler sends integer-keyed messages exclusively. Old path-keyed `WriteMessage` variants and backward-compat shims (`ScannedEntry`, `DirStats`) still exist for post-replay verification — cleanup in milestone 6.
84+
- **Schema v3**: Bumped from v2 to force DB rebuild after fixing orphan entry bug. Scanner, writer, aggregator, reconciler, enrichment, and IPC commands all fully migrated to integer keys. Enrichment uses integer-keyed fast path: resolve parent once → batch child dir stats by ID. Reconciler sends integer-keyed messages exclusively. Old path-keyed `WriteMessage` variants and backward-compat shims (`ScannedEntry`, `DirStats`) still exist for post-replay verification — cleanup in milestone 6.
8685
- **Schema v4**: Bumped from v3 to enable `auto_vacuum = INCREMENTAL` (requires DB rebuild since the pragma must be set before table creation).
8786

8887
## How to test
@@ -98,8 +97,7 @@ Key test files are alongside each module (test functions within `#[cfg(test)]` b
9897
- Scanner: full scan with temp dir trees, exclusion filtering, cancellation
9998
- Firmlinks: path normalization, edge cases
10099
- Writer: message processing, priority handling
101-
- Path resolver: cache hit/miss, prefix invalidation, case-insensitive lookups (macOS)
102-
- mod.rs: end-to-end integration (scan → aggregate → enrich → watcher update → re-enrich), enrichment fast path, fallback, root-level enrichment, PathResolver for dir stats
100+
- mod.rs: end-to-end integration (scan → aggregate → enrich → watcher update → re-enrich), enrichment fast path, fallback, root-level enrichment
103101

104102
## Key decisions
105103

@@ -109,7 +107,7 @@ Key test files are alongside each module (test functions within `#[cfg(test)]` b
109107

110108
**Enrichment uses integer-keyed batch lookup**: Instead of N individual `resolve_path()` calls (one per directory in the listing), `enrich_entries_with_index` resolves the parent directory once, queries `list_child_dir_ids_and_names(parent_id)` for all child dir IDs, then `get_dir_stats_batch_by_ids()`. Two indexed queries total instead of N. Falls back to individual path resolution for edge cases (for example, mixed-parent entries).
111109

112-
**IPC boundary stays path-based**: Frontend sends filesystem paths, backend resolves path→ID internally via `PathResolver`. No frontend changes needed. `IndexManager.get_dir_stats()` and `get_dir_stats_batch()` use the `PathResolver`'s LRU cache for efficient resolution.
110+
**IPC boundary stays path-based**: Frontend sends filesystem paths, backend resolves path→ID internally via `store::resolve_path()`. No frontend changes needed.
113111

114112
**Physical sizes (`st_blocks * 512`)**: More meaningful for disk usage than logical size. May overcount ~10-20% for APFS clones (shared blocks). Volume usage bar uses `statfs()` for true totals.
115113

@@ -155,15 +153,15 @@ Key test files are alongside each module (test functions within `#[cfg(test)]` b
155153

156154
**Scan cancellation leaves partial data**: By design. `scan_completed_at` not set in meta, so next startup detects incomplete scan and runs fresh. No cleanup needed.
157155

158-
**`ReadPool` replaces `INDEXING` lock for all read-only DB access**: Enrichment (`enrich_entries_with_index` in `enrichment.rs`), verification Phase 1 (`verify_affected_dirs` in `event_loop.rs`), and background verification dir-stat reads all use `get_read_pool()` + `pool.with_conn()` — thread-local SQLite connections with no lock contention. The `INDEXING` mutex now guards only lifecycle transitions and IPC commands that need `PathResolver`. `with_conn` uses `thread_local!` storage, so callers must not have `.await` points between obtaining the pool and completing the closure (async task migration would break thread affinity).
156+
**`ReadPool` replaces `INDEXING` lock for all read-only DB access**: Enrichment (`enrich_entries_with_index` in `enrichment.rs`), verification Phase 1 (`verify_affected_dirs` in `event_loop.rs`), and background verification dir-stat reads all use `get_read_pool()` + `pool.with_conn()` — thread-local SQLite connections with no lock contention. The `INDEXING` mutex now guards only lifecycle transitions and IPC commands needing the `IndexManager`'s read connection. `with_conn` uses `thread_local!` storage, so callers must not have `.await` points between obtaining the pool and completing the closure (async task migration would break thread affinity).
159157

160158
**Progress events use `tauri::async_runtime::spawn`**: Not `tokio::spawn`, because indexing can start from Tauri's synchronous `setup()` hook where no Tokio runtime context exists.
161159

162-
**`platform_case` collation must be registered on every connection**: The custom collation is not persisted in the DB file. Both `IndexStore::open()` and `open_write_connection()` register it. Forgetting to register before querying causes `no such collation sequence: platform_case` errors. On macOS it uses NFD normalization + case folding (matching APFS). On Linux it's binary (zero overhead). The `PathResolver`'s `CacheKey` uses the same normalization via `store::normalize_for_comparison()`.
160+
**`platform_case` collation must be registered on every connection**: The custom collation is not persisted in the DB file. Both `IndexStore::open()` and `open_write_connection()` register it. Forgetting to register before querying causes `no such collation sequence: platform_case` errors. On macOS it uses NFD normalization + case folding (matching APFS). On Linux it's binary (zero overhead).
163161

164162
**Backward-compat shims resolve paths via component walk**: Old path-keyed functions (`get_entry`, `delete_entry`, `upsert_entry`, etc.) internally call `resolve_path()` which walks the tree component-by-component. This means parent directories MUST exist before inserting children. The aggregator's path-keyed `propagate_delta` and `compute_subtree_aggregates` also resolve paths internally. The reconciler no longer uses these shims -- it sends integer-keyed messages directly (milestone 4). Enrichment no longer uses the path-keyed `get_dir_stats_batch` -- it uses integer-keyed batch lookups via `list_child_dir_ids_and_names` + `get_dir_stats_batch_by_ids` (milestone 5). Remaining users of path-keyed shims: `verify_affected_dirs` (post-replay verification). Cleanup in milestone 6.
165163

166-
**Reconciler holds a read connection**: `process_fs_event`, `replay`, and `process_live_event` all require a `&Connection` parameter for path-to-ID resolution. Callers (event loops in `event_loop.rs`) open a read connection via `IndexStore::open_write_connection(writer.db_path())` at loop start and pass it through. This is a WAL-mode connection so it doesn't block the writer. The `IndexManager` also owns a `PathResolver` with LRU cache, used by IPC commands (`get_dir_stats`, `get_dir_stats_batch`) for cached resolution. The event loops don't use the `PathResolver` yet because they run in separate async tasks -- could be migrated in a future optimization pass.
164+
**Reconciler holds a read connection**: `process_fs_event`, `replay`, and `process_live_event` all require a `&Connection` parameter for path-to-ID resolution. Callers (event loops in `event_loop.rs`) open a read connection via `IndexStore::open_write_connection(writer.db_path())` at loop start and pass it through. This is a WAL-mode connection so it doesn't block the writer.
167165

168166
**ScanContext maps scan root to ROOT_ID**: Both `scan_volume` and `scan_subtree` create a `ScanContext` that maps the scan root directory to `ROOT_ID` (1). This means all top-level entries under any scan root get `parent_id = ROOT_ID` in the DB. For subtree scans, the root is resolved to its existing entry ID (not ROOT_ID), and `DeleteDescendantsById` is sent before the scan starts. The `ScanContext` opens a temporary read connection to the DB to fetch `next_id` via `get_next_id()`.
169167

0 commit comments

Comments
 (0)