You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- Track seen inodes during scan via `HashSet<u64>`
- Files with `nlink > 1` that were already counted get their sizes zeroed out
- `nlink == 1` fast path skips the HashSet entirely (99%+ of files)
- Fixes overcounting that caused on-disk sizes to exceed actual disk capacity
Copy file name to clipboardExpand all lines: apps/desktop/src-tauri/src/indexing/CLAUDE.md
+3-1Lines changed: 3 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -15,7 +15,7 @@ Full design: `docs/specs/drive-indexing/plan.md`
15
15
-**store.rs** -- SQLite schema v7 (integer-keyed entries with `name_folded` column on macOS, dir_stats by entry_id, meta), platform_case collation, read queries, DB open/migrate. `resolve_component` uses the composite index directly: on macOS queries by `(parent_id, name_folded)`, on Linux/Windows by `(parent_id, name)`. Schema version check: mismatch triggers drop+rebuild. v7 adds dual sizes (logical + physical). Both path-keyed (backward compat) and integer-keyed APIs.
16
16
-**memory_watchdog.rs** -- Background task monitoring resident memory via `mach_task_info` (macOS). Warns at 8 GB, stops indexing at 16 GB, emits `index-memory-warning` event to frontend. No-op stub on non-macOS. Started from `start_indexing()`.
17
17
- **writer.rs** -- Single writer thread, owns the write connection, processes `WriteMessage` channel (bounded `sync_channel`, 20K capacity, backpressure via blocking). `WRITER_GENERATION: AtomicU64` (initialized to 1) bumped on every mutation (`InsertEntriesV2`, `UpsertEntryV2`, `DeleteEntryById`, `DeleteSubtreeById`, `TruncateData`) for search index staleness detection. Priority: `UpdateDirStats` before `InsertEntries`. `Flush` variant + async `flush()` method let callers wait for all prior writes to commit. Has both integer-keyed variants (`InsertEntriesV2`, `UpsertEntryV2`, `DeleteEntryById`, `DeleteSubtreeById`, `PropagateDeltaById`) and path-keyed backward-compat variants. The integer-keyed delete/subtree-delete handlers auto-propagate negative deltas via the `parent_id` chain (same pattern as the path-keyed variants). `propagate_delta_by_id` walks the parent chain using `get_parent_id` lookups. `UpsertEntryV2` auto-propagates deltas on both insert and update: on insert, propagates the full size (+file_count or +dir_count); on update, reads the old entry first and propagates only the size difference. This means callers never need a separate `PropagateDeltaById` for upserted entries. For new directories, also initializes a zero-valued `dir_stats` row so enrichment always has a row. Maintains `AccumulatorMaps` during `InsertEntriesV2` processing (two HashMaps: direct children stats and child dir relationships + an `entries_inserted` counter), cleared on `TruncateData`. On `ComputeAllAggregates`, passes accumulated maps to `aggregator::compute_all_aggregates_with_maps()` to skip expensive full-table-scan SQL queries. Accepts an optional `AppHandle` at spawn time to emit `index-aggregation-progress` events during aggregation (phase, current, total). Also emits `saving_entries` phase progress during `InsertEntriesV2` processing when the expected total is set via `set_expected_total_entries()` (an `Arc<AtomicU64>` shared between the writer thread and the `IndexWriter` handle). No index drop/recreate dance — the composite indexes (`idx_parent_name_folded` on macOS, `idx_parent_name` on Linux) use binary collation and stay present during scans.
18
-
-**scanner.rs** -- jwalk-based parallel directory walker. `scan_volume()` for full scan, `scan_subtree()` for targeted subtree rescans (used by post-replay background verification). Uses `ScanContext` (from store.rs) to assign integer IDs and parent IDs during the walk: maintains a `HashMap<PathBuf, i64>` mapping directory paths to assigned IDs. The scan root is mapped to `ROOT_ID` (1). Sends `InsertEntriesV2(Vec<EntryRow>)` batches to the writer. Platform-specific exclusion filters via `should_exclude` (`pub(super)`) — the single exclusion gate for all code paths (scanner, reconciler, event_loop verification, per-navigation verifier). `default_exclusions()` is `#[cfg(test)]` only. Physical sizes (`st_blocks * 512`).
18
+
-**scanner.rs** -- jwalk-based parallel directory walker. `scan_volume()` for full scan, `scan_subtree()` for targeted subtree rescans (used by post-replay background verification). Uses `ScanContext` (from store.rs) to assign integer IDs and parent IDs during the walk: maintains a `HashMap<PathBuf, i64>` mapping directory paths to assigned IDs. The scan root is mapped to `ROOT_ID` (1). Sends `InsertEntriesV2(Vec<EntryRow>)` batches to the writer. Platform-specific exclusion filters via `should_exclude` (`pub(super)`) — the single exclusion gate for all code paths (scanner, reconciler, event_loop verification, per-navigation verifier). `default_exclusions()` is `#[cfg(test)]` only. Physical sizes (`st_blocks * 512`). Hardlink inode dedup: files with `nlink > 1` are tracked in a `HashSet<u64>` by inode; only the first link's size is counted, subsequent links get `size = None`. Files with `nlink == 1` (vast majority) skip the set entirely.
19
19
- **aggregator.rs** -- Dir stats computation. Bottom-up after full scan (O(N) single pass), per-subtree after subtree rescans, incremental delta propagation up ancestor chain for watcher events. Two entry points for full aggregation: `compute_all_aggregates_reported` (loads maps from SQL) and `compute_all_aggregates_with_maps` (accepts pre-built maps from the writer). Both accept an `on_progress: &mut dyn FnMut(AggregationProgress)` callback and delegate to `compute_and_write()` for the shared topological sort + bottom-up computation + batch write. Progress is reported at phase transitions and every ~1% during compute/write loops. `AggregationPhase` enum: `SavingEntries` (flushing writer channel), `LoadingDirectories`, `Sorting`, `Computing`, `Writing`. (The former `RebuildingIndex` phase was removed when the composite `idx_parent_name` index with `platform_case` collation was replaced — now uses binary-collation composite indexes that don't need rebuilding.) `backfill_missing_dir_stats` is a catch-up pass that finds directories without `dir_stats` rows and computes their stats bottom-up; triggered after reconciler replay and cold-start replay via `BackfillMissingDirStats` writer message.
20
20
-**watcher.rs** -- Drive-level filesystem watcher. macOS: FSEvents via `cmdr-fsevent-stream` with event IDs and `sinceWhen` replay. Linux: `notify` crate (inotify backend) with recursive watching and synthetic event counter. Other platforms: stub. `supports_event_replay()` lets callers branch on whether journal replay is available.
21
21
-**reconciler.rs** -- Buffers FSEvents during scan (capped at 500K events; overflow sets `buffer_overflow` flag forcing full rescan), replays after scan completes using event IDs to skip stale events. Processes live events for file creates/removes/modifies using integer-keyed write messages (`UpsertEntryV2`, `DeleteEntryById`, `DeleteSubtreeById`, `PropagateDeltaById`). Resolves filesystem paths to entry IDs via `store::resolve_path()` using a read connection passed by callers. Key functions (`process_fs_event`, `emit_dir_updated`) are `pub(super)` so `mod.rs` can call them directly during cold-start replay. `reconcile_subtree()` handles MustScanSubDirs by diffing filesystem vs DB directory-by-directory instead of delete-then-reinsert, making it safe to interrupt at any point.
@@ -127,6 +127,8 @@ Key test files are alongside each module (test functions within `#[cfg(test)]` b
127
127
128
128
**Dual sizes (logical + physical)**: Both `meta.len()` (logical) and `st_blocks * 512` (physical) are stored. Logical size is displayed by default (mapped to `recursive_size` at the IPC boundary). Physical size is stored in DB but not yet exposed to the frontend. Physical sizes may overcount ~10-20% for APFS clones (shared blocks). Volume usage bar uses `statfs()` for true totals.
129
129
130
+
**Hardlink inode dedup at scan time**: Files with `nlink > 1` are tracked by inode in a `HashSet<u64>` local to `run_scan`. The second+ link for the same inode gets `logical_size = None, physical_size = None`, so aggregation counts each inode's bytes exactly once. The `nlink > 1` check is a fast path: single-link files (the vast majority) skip the HashSet entirely, so there's no overhead for typical workloads. Memory cost is ~8 bytes per unique hardlinked inode. The set lives for one scan and is dropped with the scan's stack frame. Only applies to the scanner (full/subtree scans); the reconciler handles individual live events where cross-event dedup isn't applicable.
131
+
130
132
**MustScanSubDirs uses reconciliation, not delete-then-reinsert**: `reconcile_subtree()` diffs the filesystem against the DB directory-by-directory, only inserting/deleting/updating entries that changed. This is safe to interrupt at any point (no bulk delete phase that could leave the DB empty). For brand-new directories discovered during reconciliation, a `flush_blocking()` + re-resolve cycle ensures their IDs are available before recursing into them. `scanner::scan_subtree` (which uses destructive `DeleteDescendantsById`) is used by post-replay background verification for newly discovered directories.
131
133
132
134
**In-memory accumulation eliminates aggregation SQL queries**: During a full scan, the writer thread accumulates two HashMaps in `AccumulatorMaps` as `InsertEntriesV2` batches arrive: `direct_stats` (parent_id -> file size/count/dir count) and `child_dirs` (parent_id -> child dir IDs). When `ComputeAllAggregates` fires, these maps are passed to `compute_all_aggregates_with_maps()`, skipping the two expensive full-table-scan SQL queries (`bulk_get_children_stats_by_id` and `bulk_get_child_dir_ids`) that previously dominated aggregation time (~70%). Maps are cleared on `TruncateData` and after aggregation completes. Falls back to SQL queries if maps are empty.
0 commit comments