Skip to content

Commit fe5eff7

Browse files
committed
Indexing: Fix size overcounting
- Hardlink dedup: add `inode` column (schema v8) so the writer can enforce "at most one entry per inode has sizes" at upsert time. Previously, the reconciler/verifier overwrote the scanner's NULL-size dedup, inflating dirs like `target/debug/` by ~2.5x. - Cloud-only files: change `blocks=0` fallback from `meta.len()` to `0`. File Provider files (Google Drive, iCloud) have `st_blocks=0` but non-zero `len()`; the old fallback treated their full cloud size as local disk usage (~551 GB phantom data). - Smart size mode: when physical=0 and logical>0 (cloud/dataless files), show logical instead of `min(logical, 0)=0`.
1 parent b4f740a commit fe5eff7

13 files changed

Lines changed: 1045 additions & 92 deletions

File tree

apps/desktop/src-tauri/src/indexing/CLAUDE.md

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -12,10 +12,10 @@ Full design: `docs/specs/drive-indexing/plan.md`
1212
- **enrichment.rs** -- `ReadPool` (lock-free thread-local read connections for enrichment and verification), `enrich_entries_with_index()` (called when entries are stored in the listing cache — streaming, watcher update, re-sort — NOT on `get_file_range`; index freshness is handled by `index-dir-updated``refreshIndexSizes``getDirStatsBatch`). Integer-keyed fast path: resolve parent dir once → batch-fetch child dir stats by ID → match by name. Falls back to individual path resolution for edge cases.
1313
- **event_loop.rs** -- `run_live_event_loop` (real-time FSEvents/inotify processing after scan completes), `run_replay_event_loop` (cold-start journal replay with two-phase approach), `run_background_verification` (post-replay bidirectional readdir diff), `merge_fs_events` (deduplication with flag priority), `process_live_batch`. All bounded-buffer constants live here.
1414
- **events.rs** -- Tauri event payload structs (`IndexScanStartedEvent`, `IndexScanProgressEvent`, `IndexScanCompleteEvent`, `IndexDirUpdatedEvent`, `IndexReplayProgressEvent`, `IndexReplayCompleteEvent`), `RescanReason` enum, `emit_rescan_notification()`, IPC response types (`IndexStatusResponse`, `IndexDebugStatusResponse`). Also: `ActivityPhase` enum (Replaying/Scanning/Aggregating/Reconciling/Live/Idle) and `PhaseRecord` for the phase timeline system tracked in `DebugStats`.
15-
- **store.rs** -- SQLite schema v7 (integer-keyed entries with `name_folded` column on macOS, dir_stats by entry_id, meta), platform_case collation, read queries, DB open/migrate. `resolve_component` uses the composite index directly: on macOS queries by `(parent_id, name_folded)`, on Linux/Windows by `(parent_id, name)`. Schema version check: mismatch triggers drop+rebuild. v7 adds dual sizes (logical + physical). Both path-keyed (backward compat) and integer-keyed APIs.
15+
- **store.rs** -- SQLite schema v8 (integer-keyed entries with `name_folded` column on macOS, `inode` column for hardlink dedup, dir_stats by entry_id, meta), platform_case collation, read queries, DB open/migrate. `resolve_component` uses the composite index directly: on macOS queries by `(parent_id, name_folded)`, on Linux/Windows by `(parent_id, name)`. Schema version check: mismatch triggers drop+rebuild. v7 added dual sizes (logical + physical). v8 adds `inode INTEGER` column and `idx_inode` index for hardlink dedup at write time. `has_sized_entry_for_inode()` checks if another entry with the same inode already has non-NULL sizes. Both path-keyed (backward compat) and integer-keyed APIs.
1616
- **memory_watchdog.rs** -- Background task monitoring resident memory via `mach_task_info` (macOS). Warns at 8 GB, stops indexing at 16 GB, emits `index-memory-warning` event to frontend. No-op stub on non-macOS. Started from `start_indexing()`.
1717
- **writer.rs** -- Single writer thread, owns the write connection, processes `WriteMessage` channel (bounded `sync_channel`, 20K capacity, backpressure via blocking). `WRITER_GENERATION: AtomicU64` (initialized to 1) bumped on every mutation (`InsertEntriesV2`, `UpsertEntryV2`, `DeleteEntryById`, `DeleteSubtreeById`, `TruncateData`) for search index staleness detection. Priority: `UpdateDirStats` before `InsertEntries`. `Flush` variant + async `flush()` method let callers wait for all prior writes to commit. Has both integer-keyed variants (`InsertEntriesV2`, `UpsertEntryV2`, `DeleteEntryById`, `DeleteSubtreeById`, `PropagateDeltaById`) and path-keyed backward-compat variants. The integer-keyed delete/subtree-delete handlers auto-propagate negative deltas via the `parent_id` chain (same pattern as the path-keyed variants). `propagate_delta_by_id` walks the parent chain using `get_parent_id` lookups. `UpsertEntryV2` auto-propagates deltas on both insert and update: on insert, propagates the full size (+file_count or +dir_count); on update, reads the old entry first and propagates only the size difference. This means callers never need a separate `PropagateDeltaById` for upserted entries. For new directories, also initializes a zero-valued `dir_stats` row so enrichment always has a row. Maintains `AccumulatorMaps` during `InsertEntriesV2` processing (two HashMaps: direct children stats and child dir relationships + an `entries_inserted` counter), cleared on `TruncateData`. On `ComputeAllAggregates`, passes accumulated maps to `aggregator::compute_all_aggregates_with_maps()` to skip expensive full-table-scan SQL queries. Accepts an optional `AppHandle` at spawn time to emit `index-aggregation-progress` events during aggregation (phase, current, total). Also emits `saving_entries` phase progress during `InsertEntriesV2` processing when the expected total is set via `set_expected_total_entries()` (an `Arc<AtomicU64>` shared between the writer thread and the `IndexWriter` handle). No index drop/recreate dance — the composite indexes (`idx_parent_name_folded` on macOS, `idx_parent_name` on Linux) use binary collation and stay present during scans.
18-
- **scanner.rs** -- jwalk-based parallel directory walker. `scan_volume()` for full scan, `scan_subtree()` for targeted subtree rescans (used by post-replay background verification). Uses `ScanContext` (from store.rs) to assign integer IDs and parent IDs during the walk: maintains a `HashMap<PathBuf, i64>` mapping directory paths to assigned IDs. The scan root is mapped to `ROOT_ID` (1). Sends `InsertEntriesV2(Vec<EntryRow>)` batches to the writer. Platform-specific exclusion filters via `should_exclude` (`pub(super)`) — the single exclusion gate for all code paths (scanner, reconciler, event_loop verification, per-navigation verifier). `default_exclusions()` is `#[cfg(test)]` only. Physical sizes (`st_blocks * 512`). Hardlink inode dedup: files with `nlink > 1` are tracked in a `HashSet<u64>` by inode; only the first link's size is counted, subsequent links get `size = None`. Files with `nlink == 1` (vast majority) skip the set entirely.
18+
- **scanner.rs** -- jwalk-based parallel directory walker. `scan_volume()` for full scan, `scan_subtree()` for targeted subtree rescans (used by post-replay background verification). Uses `ScanContext` (from store.rs) to assign integer IDs and parent IDs during the walk: maintains a `HashMap<PathBuf, i64>` mapping directory paths to assigned IDs. The scan root is mapped to `ROOT_ID` (1). Sends `InsertEntriesV2(Vec<EntryRow>)` batches to the writer. Platform-specific exclusion filters via `should_exclude` (`pub(super)`) — the single exclusion gate for all code paths (scanner, reconciler, event_loop verification, per-navigation verifier). `default_exclusions()` is `#[cfg(test)]` only. Physical sizes (`st_blocks * 512`). Hardlink inode dedup: files with `nlink > 1` are tracked in a `HashSet<u64>` by inode; only the first link's size is counted, subsequent links get `size = None`. Files with `nlink == 1` (vast majority) skip the set entirely. All files store `inode` in `EntryRow.inode` (from `MetadataExt::ino()` on Unix, `None` on non-Unix). Directories and symlinks get `inode: None`.
1919
- **aggregator.rs** -- Dir stats computation. Bottom-up after full scan (O(N) single pass), per-subtree after subtree rescans, incremental delta propagation up ancestor chain for watcher events. Two entry points for full aggregation: `compute_all_aggregates_reported` (loads maps from SQL) and `compute_all_aggregates_with_maps` (accepts pre-built maps from the writer). Both accept an `on_progress: &mut dyn FnMut(AggregationProgress)` callback and delegate to `compute_and_write()` for the shared topological sort + bottom-up computation + batch write. Progress is reported at phase transitions and every ~1% during compute/write loops. `AggregationPhase` enum: `SavingEntries` (flushing writer channel), `LoadingDirectories`, `Sorting`, `Computing`, `Writing`. (The former `RebuildingIndex` phase was removed when the composite `idx_parent_name` index with `platform_case` collation was replaced — now uses binary-collation composite indexes that don't need rebuilding.) `backfill_missing_dir_stats` is a catch-up pass that finds directories without `dir_stats` rows and computes their stats bottom-up; triggered after reconciler replay and cold-start replay via `BackfillMissingDirStats` writer message.
2020
- **watcher.rs** -- Drive-level filesystem watcher. macOS: FSEvents via `cmdr-fsevent-stream` with event IDs and `sinceWhen` replay. Linux: `notify` crate (inotify backend) with recursive watching and synthetic event counter. Other platforms: stub. `supports_event_replay()` lets callers branch on whether journal replay is available.
2121
- **reconciler.rs** -- Buffers FSEvents during scan (capped at 500K events; overflow sets `buffer_overflow` flag forcing full rescan), replays after scan completes using event IDs to skip stale events. Processes live events for file creates/removes/modifies using integer-keyed write messages (`UpsertEntryV2`, `DeleteEntryById`, `DeleteSubtreeById`, `PropagateDeltaById`). Resolves filesystem paths to entry IDs via `store::resolve_path()` using a read connection passed by callers. Key functions (`process_fs_event`, `emit_dir_updated`) are `pub(super)` so `mod.rs` can call them directly during cold-start replay. `reconcile_subtree()` handles MustScanSubDirs by diffing filesystem vs DB directory-by-directory instead of delete-then-reinsert, making it safe to interrupt at any point.
@@ -76,14 +76,14 @@ All writes go through a dedicated `std::thread` via a bounded `sync_channel` (20
7676

7777
Reads happen on separate WAL connections (any thread). A `ReadPool` provides thread-local read connections for enrichment and verification without contending on the `INDEXING` state-machine mutex.
7878

79-
### SQLite schema (v7: integer-keyed, platform-conditional composite index)
79+
### SQLite schema (v8: integer-keyed, platform-conditional composite index, inode for hardlink dedup)
8080

8181
One DB per volume. **Dev and prod use separate directories** (see AGENTS.md § Debugging):
8282
- **Prod**: `~/Library/Application Support/com.veszelovszki.cmdr/index-{volume_id}.db`
8383
- **Dev**: `~/Library/Application Support/com.veszelovszki.cmdr-dev/index-{volume_id}.db`
8484

8585
Three tables:
86-
- `entries` (id INTEGER PK, parent_id, name COLLATE platform_case, [name_folded on macOS], is_directory, is_symlink, logical_size, physical_size, modified_at). Root sentinel: id=1, parent_id=0, name="".
86+
- `entries` (id INTEGER PK, parent_id, name COLLATE platform_case, [name_folded on macOS], is_directory, is_symlink, logical_size, physical_size, modified_at, inode). Root sentinel: id=1, parent_id=0, name="".
8787
- **macOS**: has a `name_folded TEXT NOT NULL` column storing `normalize_for_comparison(name)` (NFD + case fold). Index: `idx_parent_name_folded ON entries (parent_id, name_folded)`.
8888
- **Linux/Windows**: no `name_folded` column. Index: `idx_parent_name ON entries (parent_id, name)`.
8989
- The old `idx_parent(parent_id)` from v5 is removed; the composite indexes replace it.
@@ -98,6 +98,7 @@ History of changes:
9898
- **Schema v5**: Replaced composite `UNIQUE INDEX idx_parent_name(parent_id, name)` with simple `INDEX idx_parent(parent_id)`. The composite index with `platform_case` collation was extremely slow to build (~25 min for 5.1M entries). A simple integer index needs no drop/recreate dance during scans.
9999
- **Schema v6**: Added `name_folded` column (macOS only) storing pre-computed `normalize_for_comparison(name)`. Replaced `idx_parent` with platform-conditional composite indexes: `idx_parent_name_folded(parent_id, name_folded)` on macOS, `idx_parent_name(parent_id, name)` on Linux/Windows. `resolve_component` now queries the index directly instead of fetching all children and matching in Rust.
100100
- **Schema v7**: Dual sizes. `entries.size` renamed to `entries.logical_size`, added `entries.physical_size`. `dir_stats.recursive_size` renamed to `dir_stats.recursive_logical_size`, added `dir_stats.recursive_physical_size`. Logical size = `meta.len()`, physical size = `st_blocks * 512` on Unix (both = `meta.len()` on non-Unix). The IPC boundary (`DirStats` struct) still exposes `recursive_size` mapped from `recursive_logical_size` to avoid frontend churn. `AccumulatorMaps.direct_stats` changed to 4-tuple `(logical_size_sum, physical_size_sum, file_count, dir_count)`.
101+
- **Schema v8**: Added `inode INTEGER` column to `entries` (after `modified_at`) for hardlink dedup. Added `idx_inode ON entries (inode)` index. `EntryRow` gains `inode: Option<u64>`. The scanner populates inode from `MetadataExt::ino()` for all files on Unix; dirs/symlinks get `None`. `has_sized_entry_for_inode()` enables the writer to check at upsert time whether another entry for the same inode already has non-NULL sizes, preventing overcounting when reconciler/verifier events overwrite the scanner's NULL-size dedup.
101102

102103
## How to test
103104

apps/desktop/src-tauri/src/indexing/aggregator.rs

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -643,6 +643,7 @@ mod tests {
643643
logical_size: None,
644644
physical_size: None,
645645
modified_at: None,
646+
inode: None,
646647
}
647648
}
648649

@@ -656,6 +657,7 @@ mod tests {
656657
logical_size: Some(size),
657658
physical_size: Some(size),
658659
modified_at: None,
660+
inode: None,
659661
}
660662
}
661663

apps/desktop/src-tauri/src/indexing/event_loop.rs

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -997,6 +997,14 @@ fn verify_affected_dirs(affected_paths: &HashSet<String>, writer: &IndexWriter)
997997
reconciler::entry_size_and_mtime(&metadata)
998998
};
999999

1000+
#[cfg(unix)]
1001+
let (inode, nlink) = {
1002+
use std::os::unix::fs::MetadataExt;
1003+
(Some(metadata.ino()), Some(metadata.nlink()))
1004+
};
1005+
#[cfg(not(unix))]
1006+
let (inode, nlink) = (None, None);
1007+
10001008
let _ = writer.send(WriteMessage::UpsertEntryV2 {
10011009
parent_id: *parent_id,
10021010
name,
@@ -1005,6 +1013,8 @@ fn verify_affected_dirs(affected_paths: &HashSet<String>, writer: &IndexWriter)
10051013
logical_size,
10061014
physical_size,
10071015
modified_at,
1016+
inode,
1017+
nlink,
10081018
});
10091019

10101020
// UpsertEntryV2 auto-propagates deltas in the writer.

apps/desktop/src-tauri/src/indexing/mod.rs

Lines changed: 22 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1403,6 +1403,7 @@ mod tests {
14031403
logical_size: None,
14041404
physical_size: None,
14051405
modified_at: None,
1406+
inode: None,
14061407
},
14071408
EntryRow {
14081409
id: 3,
@@ -1413,6 +1414,7 @@ mod tests {
14131414
logical_size: None,
14141415
physical_size: None,
14151416
modified_at: None,
1417+
inode: None,
14161418
},
14171419
EntryRow {
14181420
id: 4,
@@ -1423,6 +1425,7 @@ mod tests {
14231425
logical_size: Some(100),
14241426
physical_size: Some(100),
14251427
modified_at: None,
1428+
inode: None,
14261429
},
14271430
EntryRow {
14281431
id: 5,
@@ -1433,6 +1436,7 @@ mod tests {
14331436
logical_size: Some(200),
14341437
physical_size: Some(200),
14351438
modified_at: None,
1439+
inode: None,
14361440
},
14371441
EntryRow {
14381442
id: 6,
@@ -1443,6 +1447,7 @@ mod tests {
14431447
logical_size: None,
14441448
physical_size: None,
14451449
modified_at: None,
1450+
inode: None,
14461451
},
14471452
EntryRow {
14481453
id: 7,
@@ -1453,6 +1458,7 @@ mod tests {
14531458
logical_size: Some(300),
14541459
physical_size: Some(300),
14551460
modified_at: None,
1461+
inode: None,
14561462
},
14571463
EntryRow {
14581464
id: 8,
@@ -1463,6 +1469,7 @@ mod tests {
14631469
logical_size: Some(50),
14641470
physical_size: Some(50),
14651471
modified_at: None,
1472+
inode: None,
14661473
},
14671474
];
14681475
IndexStore::insert_entries_v2_batch(&conn, &entries).expect("insert entries");
@@ -1538,6 +1545,7 @@ mod tests {
15381545
logical_size: None,
15391546
physical_size: None,
15401547
modified_at: None,
1548+
inode: None,
15411549
},
15421550
EntryRow {
15431551
id: 3,
@@ -1548,6 +1556,7 @@ mod tests {
15481556
logical_size: Some(500),
15491557
physical_size: Some(500),
15501558
modified_at: None,
1559+
inode: None,
15511560
},
15521561
];
15531562
IndexStore::insert_entries_v2_batch(&conn, &entries).expect("insert");
@@ -1596,6 +1605,7 @@ mod tests {
15961605
logical_size: None,
15971606
physical_size: None,
15981607
modified_at: None,
1608+
inode: None,
15991609
},
16001610
EntryRow {
16011611
id: 3,
@@ -1606,6 +1616,7 @@ mod tests {
16061616
logical_size: None,
16071617
physical_size: None,
16081618
modified_at: None,
1619+
inode: None,
16091620
},
16101621
EntryRow {
16111622
id: 4,
@@ -1616,6 +1627,7 @@ mod tests {
16161627
logical_size: Some(10),
16171628
physical_size: Some(10),
16181629
modified_at: None,
1630+
inode: None,
16191631
},
16201632
];
16211633
IndexStore::insert_entries_v2_batch(&conn, &entries).expect("insert");
@@ -1644,6 +1656,7 @@ mod tests {
16441656
logical_size: None,
16451657
physical_size: None,
16461658
modified_at: None,
1659+
inode: None,
16471660
},
16481661
EntryRow {
16491662
id: 3,
@@ -1654,6 +1667,7 @@ mod tests {
16541667
logical_size: None,
16551668
physical_size: None,
16561669
modified_at: None,
1670+
inode: None,
16571671
},
16581672
EntryRow {
16591673
id: 4,
@@ -1664,6 +1678,7 @@ mod tests {
16641678
logical_size: Some(1000),
16651679
physical_size: Some(1000),
16661680
modified_at: None,
1681+
inode: None,
16671682
},
16681683
];
16691684
IndexStore::insert_entries_v2_batch(&conn, &entries).expect("insert");
@@ -1684,7 +1699,7 @@ mod tests {
16841699
assert_eq!(listing[0].recursive_dir_count, Some(0));
16851700

16861701
// Phase 3: Simulate a watcher event (new file added via reconciler)
1687-
IndexStore::insert_entry_v2(&conn, 3, "notes.txt", false, false, Some(500), Some(500), None)
1702+
IndexStore::insert_entry_v2(&conn, 3, "notes.txt", false, false, Some(500), Some(500), None, None)
16881703
.expect("insert new file");
16891704

16901705
// Simulate delta propagation (as the writer would do)
@@ -1736,6 +1751,7 @@ mod tests {
17361751
logical_size: None,
17371752
physical_size: None,
17381753
modified_at: None,
1754+
inode: None,
17391755
},
17401756
EntryRow {
17411757
id: 3,
@@ -1746,6 +1762,7 @@ mod tests {
17461762
logical_size: Some(5000),
17471763
physical_size: Some(5000),
17481764
modified_at: None,
1765+
inode: None,
17491766
},
17501767
EntryRow {
17511768
id: 4,
@@ -1756,6 +1773,7 @@ mod tests {
17561773
logical_size: None,
17571774
physical_size: None,
17581775
modified_at: None,
1776+
inode: None,
17591777
},
17601778
EntryRow {
17611779
id: 5,
@@ -1766,6 +1784,7 @@ mod tests {
17661784
logical_size: None,
17671785
physical_size: None,
17681786
modified_at: None,
1787+
inode: None,
17691788
},
17701789
];
17711790
IndexStore::insert_entries_v2_batch(&conn, &entries).expect("insert");
@@ -1806,6 +1825,7 @@ mod tests {
18061825
logical_size: None,
18071826
physical_size: None,
18081827
modified_at: None,
1828+
inode: None,
18091829
},
18101830
EntryRow {
18111831
id: 3,
@@ -1816,6 +1836,7 @@ mod tests {
18161836
logical_size: Some(42),
18171837
physical_size: Some(42),
18181838
modified_at: None,
1839+
inode: None,
18191840
},
18201841
];
18211842
IndexStore::insert_entries_v2_batch(&conn, &entries).expect("insert");

0 commit comments

Comments
 (0)