You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- Hardlink dedup: add `inode` column (schema v8) so the writer can enforce "at most one entry per inode has sizes" at upsert time. Previously, the reconciler/verifier overwrote the scanner's NULL-size dedup, inflating dirs like `target/debug/` by ~2.5x.
- Cloud-only files: change `blocks=0` fallback from `meta.len()` to `0`. File Provider files (Google Drive, iCloud) have `st_blocks=0` but non-zero `len()`; the old fallback treated their full cloud size as local disk usage (~551 GB phantom data).
- Smart size mode: when physical=0 and logical>0 (cloud/dataless files), show logical instead of `min(logical, 0)=0`.
Copy file name to clipboardExpand all lines: apps/desktop/src-tauri/src/indexing/CLAUDE.md
+5-4Lines changed: 5 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -12,10 +12,10 @@ Full design: `docs/specs/drive-indexing/plan.md`
12
12
-**enrichment.rs** -- `ReadPool` (lock-free thread-local read connections for enrichment and verification), `enrich_entries_with_index()` (called when entries are stored in the listing cache — streaming, watcher update, re-sort — NOT on `get_file_range`; index freshness is handled by `index-dir-updated` → `refreshIndexSizes` → `getDirStatsBatch`). Integer-keyed fast path: resolve parent dir once → batch-fetch child dir stats by ID → match by name. Falls back to individual path resolution for edge cases.
13
13
-**event_loop.rs** -- `run_live_event_loop` (real-time FSEvents/inotify processing after scan completes), `run_replay_event_loop` (cold-start journal replay with two-phase approach), `run_background_verification` (post-replay bidirectional readdir diff), `merge_fs_events` (deduplication with flag priority), `process_live_batch`. All bounded-buffer constants live here.
14
14
-**events.rs** -- Tauri event payload structs (`IndexScanStartedEvent`, `IndexScanProgressEvent`, `IndexScanCompleteEvent`, `IndexDirUpdatedEvent`, `IndexReplayProgressEvent`, `IndexReplayCompleteEvent`), `RescanReason` enum, `emit_rescan_notification()`, IPC response types (`IndexStatusResponse`, `IndexDebugStatusResponse`). Also: `ActivityPhase` enum (Replaying/Scanning/Aggregating/Reconciling/Live/Idle) and `PhaseRecord` for the phase timeline system tracked in `DebugStats`.
15
-
-**store.rs** -- SQLite schema v7 (integer-keyed entries with `name_folded` column on macOS, dir_stats by entry_id, meta), platform_case collation, read queries, DB open/migrate. `resolve_component` uses the composite index directly: on macOS queries by `(parent_id, name_folded)`, on Linux/Windows by `(parent_id, name)`. Schema version check: mismatch triggers drop+rebuild. v7 adds dual sizes (logical + physical). Both path-keyed (backward compat) and integer-keyed APIs.
15
+
-**store.rs** -- SQLite schema v8 (integer-keyed entries with `name_folded` column on macOS, `inode` column for hardlink dedup, dir_stats by entry_id, meta), platform_case collation, read queries, DB open/migrate. `resolve_component` uses the composite index directly: on macOS queries by `(parent_id, name_folded)`, on Linux/Windows by `(parent_id, name)`. Schema version check: mismatch triggers drop+rebuild. v7 added dual sizes (logical + physical). v8 adds `inode INTEGER` column and `idx_inode` index for hardlink dedup at write time. `has_sized_entry_for_inode()` checks if another entry with the same inode already has non-NULL sizes. Both path-keyed (backward compat) and integer-keyed APIs.
16
16
-**memory_watchdog.rs** -- Background task monitoring resident memory via `mach_task_info` (macOS). Warns at 8 GB, stops indexing at 16 GB, emits `index-memory-warning` event to frontend. No-op stub on non-macOS. Started from `start_indexing()`.
17
17
- **writer.rs** -- Single writer thread, owns the write connection, processes `WriteMessage` channel (bounded `sync_channel`, 20K capacity, backpressure via blocking). `WRITER_GENERATION: AtomicU64` (initialized to 1) bumped on every mutation (`InsertEntriesV2`, `UpsertEntryV2`, `DeleteEntryById`, `DeleteSubtreeById`, `TruncateData`) for search index staleness detection. Priority: `UpdateDirStats` before `InsertEntries`. `Flush` variant + async `flush()` method let callers wait for all prior writes to commit. Has both integer-keyed variants (`InsertEntriesV2`, `UpsertEntryV2`, `DeleteEntryById`, `DeleteSubtreeById`, `PropagateDeltaById`) and path-keyed backward-compat variants. The integer-keyed delete/subtree-delete handlers auto-propagate negative deltas via the `parent_id` chain (same pattern as the path-keyed variants). `propagate_delta_by_id` walks the parent chain using `get_parent_id` lookups. `UpsertEntryV2` auto-propagates deltas on both insert and update: on insert, propagates the full size (+file_count or +dir_count); on update, reads the old entry first and propagates only the size difference. This means callers never need a separate `PropagateDeltaById` for upserted entries. For new directories, also initializes a zero-valued `dir_stats` row so enrichment always has a row. Maintains `AccumulatorMaps` during `InsertEntriesV2` processing (two HashMaps: direct children stats and child dir relationships + an `entries_inserted` counter), cleared on `TruncateData`. On `ComputeAllAggregates`, passes accumulated maps to `aggregator::compute_all_aggregates_with_maps()` to skip expensive full-table-scan SQL queries. Accepts an optional `AppHandle` at spawn time to emit `index-aggregation-progress` events during aggregation (phase, current, total). Also emits `saving_entries` phase progress during `InsertEntriesV2` processing when the expected total is set via `set_expected_total_entries()` (an `Arc<AtomicU64>` shared between the writer thread and the `IndexWriter` handle). No index drop/recreate dance — the composite indexes (`idx_parent_name_folded` on macOS, `idx_parent_name` on Linux) use binary collation and stay present during scans.
18
-
-**scanner.rs** -- jwalk-based parallel directory walker. `scan_volume()` for full scan, `scan_subtree()` for targeted subtree rescans (used by post-replay background verification). Uses `ScanContext` (from store.rs) to assign integer IDs and parent IDs during the walk: maintains a `HashMap<PathBuf, i64>` mapping directory paths to assigned IDs. The scan root is mapped to `ROOT_ID` (1). Sends `InsertEntriesV2(Vec<EntryRow>)` batches to the writer. Platform-specific exclusion filters via `should_exclude` (`pub(super)`) — the single exclusion gate for all code paths (scanner, reconciler, event_loop verification, per-navigation verifier). `default_exclusions()` is `#[cfg(test)]` only. Physical sizes (`st_blocks * 512`). Hardlink inode dedup: files with `nlink > 1` are tracked in a `HashSet<u64>` by inode; only the first link's size is counted, subsequent links get `size = None`. Files with `nlink == 1` (vast majority) skip the set entirely.
18
+
- **scanner.rs** -- jwalk-based parallel directory walker. `scan_volume()` for full scan, `scan_subtree()` for targeted subtree rescans (used by post-replay background verification). Uses `ScanContext` (from store.rs) to assign integer IDs and parent IDs during the walk: maintains a `HashMap<PathBuf, i64>` mapping directory paths to assigned IDs. The scan root is mapped to `ROOT_ID` (1). Sends `InsertEntriesV2(Vec<EntryRow>)` batches to the writer. Platform-specific exclusion filters via `should_exclude` (`pub(super)`) — the single exclusion gate for all code paths (scanner, reconciler, event_loop verification, per-navigation verifier). `default_exclusions()` is `#[cfg(test)]` only. Physical sizes (`st_blocks * 512`). Hardlink inode dedup: files with `nlink > 1` are tracked in a `HashSet<u64>` by inode; only the first link's size is counted, subsequent links get `size = None`. Files with `nlink == 1` (vast majority) skip the set entirely. All files store `inode` in `EntryRow.inode` (from `MetadataExt::ino()` on Unix, `None` on non-Unix). Directories and symlinks get `inode: None`.
19
19
- **aggregator.rs** -- Dir stats computation. Bottom-up after full scan (O(N) single pass), per-subtree after subtree rescans, incremental delta propagation up ancestor chain for watcher events. Two entry points for full aggregation: `compute_all_aggregates_reported` (loads maps from SQL) and `compute_all_aggregates_with_maps` (accepts pre-built maps from the writer). Both accept an `on_progress: &mut dyn FnMut(AggregationProgress)` callback and delegate to `compute_and_write()` for the shared topological sort + bottom-up computation + batch write. Progress is reported at phase transitions and every ~1% during compute/write loops. `AggregationPhase` enum: `SavingEntries` (flushing writer channel), `LoadingDirectories`, `Sorting`, `Computing`, `Writing`. (The former `RebuildingIndex` phase was removed when the composite `idx_parent_name` index with `platform_case` collation was replaced — now uses binary-collation composite indexes that don't need rebuilding.) `backfill_missing_dir_stats` is a catch-up pass that finds directories without `dir_stats` rows and computes their stats bottom-up; triggered after reconciler replay and cold-start replay via `BackfillMissingDirStats` writer message.
20
20
-**watcher.rs** -- Drive-level filesystem watcher. macOS: FSEvents via `cmdr-fsevent-stream` with event IDs and `sinceWhen` replay. Linux: `notify` crate (inotify backend) with recursive watching and synthetic event counter. Other platforms: stub. `supports_event_replay()` lets callers branch on whether journal replay is available.
21
21
-**reconciler.rs** -- Buffers FSEvents during scan (capped at 500K events; overflow sets `buffer_overflow` flag forcing full rescan), replays after scan completes using event IDs to skip stale events. Processes live events for file creates/removes/modifies using integer-keyed write messages (`UpsertEntryV2`, `DeleteEntryById`, `DeleteSubtreeById`, `PropagateDeltaById`). Resolves filesystem paths to entry IDs via `store::resolve_path()` using a read connection passed by callers. Key functions (`process_fs_event`, `emit_dir_updated`) are `pub(super)` so `mod.rs` can call them directly during cold-start replay. `reconcile_subtree()` handles MustScanSubDirs by diffing filesystem vs DB directory-by-directory instead of delete-then-reinsert, making it safe to interrupt at any point.
@@ -76,14 +76,14 @@ All writes go through a dedicated `std::thread` via a bounded `sync_channel` (20
76
76
77
77
Reads happen on separate WAL connections (any thread). A `ReadPool` provides thread-local read connections for enrichment and verification without contending on the `INDEXING` state-machine mutex.
-`entries` (id INTEGER PK, parent_id, name COLLATE platform_case, [name_folded on macOS], is_directory, is_symlink, logical_size, physical_size, modified_at). Root sentinel: id=1, parent_id=0, name="".
86
+
-`entries` (id INTEGER PK, parent_id, name COLLATE platform_case, [name_folded on macOS], is_directory, is_symlink, logical_size, physical_size, modified_at, inode). Root sentinel: id=1, parent_id=0, name="".
87
87
-**macOS**: has a `name_folded TEXT NOT NULL` column storing `normalize_for_comparison(name)` (NFD + case fold). Index: `idx_parent_name_folded ON entries (parent_id, name_folded)`.
88
88
-**Linux/Windows**: no `name_folded` column. Index: `idx_parent_name ON entries (parent_id, name)`.
89
89
- The old `idx_parent(parent_id)` from v5 is removed; the composite indexes replace it.
@@ -98,6 +98,7 @@ History of changes:
98
98
-**Schema v5**: Replaced composite `UNIQUE INDEX idx_parent_name(parent_id, name)` with simple `INDEX idx_parent(parent_id)`. The composite index with `platform_case` collation was extremely slow to build (~25 min for 5.1M entries). A simple integer index needs no drop/recreate dance during scans.
99
99
-**Schema v6**: Added `name_folded` column (macOS only) storing pre-computed `normalize_for_comparison(name)`. Replaced `idx_parent` with platform-conditional composite indexes: `idx_parent_name_folded(parent_id, name_folded)` on macOS, `idx_parent_name(parent_id, name)` on Linux/Windows. `resolve_component` now queries the index directly instead of fetching all children and matching in Rust.
100
100
-**Schema v7**: Dual sizes. `entries.size` renamed to `entries.logical_size`, added `entries.physical_size`. `dir_stats.recursive_size` renamed to `dir_stats.recursive_logical_size`, added `dir_stats.recursive_physical_size`. Logical size = `meta.len()`, physical size = `st_blocks * 512` on Unix (both = `meta.len()` on non-Unix). The IPC boundary (`DirStats` struct) still exposes `recursive_size` mapped from `recursive_logical_size` to avoid frontend churn. `AccumulatorMaps.direct_stats` changed to 4-tuple `(logical_size_sum, physical_size_sum, file_count, dir_count)`.
101
+
-**Schema v8**: Added `inode INTEGER` column to `entries` (after `modified_at`) for hardlink dedup. Added `idx_inode ON entries (inode)` index. `EntryRow` gains `inode: Option<u64>`. The scanner populates inode from `MetadataExt::ino()` for all files on Unix; dirs/symlinks get `None`. `has_sized_entry_for_inode()` enables the writer to check at upsert time whether another entry for the same inode already has non-NULL sizes, preventing overcounting when reconciler/verifier events overwrite the scanner's NULL-size dedup.
0 commit comments