Indexing: Deduplicate hardlinked file sizes

vdavid · vdavid · commit 1d588f841ced · 2026-03-20T22:48:25.000+01:00
- Track seen inodes during scan via `HashSet&lt;u64&gt;`
- Files with `nlink &gt; 1` that were already counted get their sizes zeroed out
- `nlink == 1` fast path skips the HashSet entirely (99%+ of files)
- Fixes overcounting that caused on-disk sizes to exceed actual disk capacity
diff --git a/apps/desktop/src-tauri/src/indexing/CLAUDE.md b/apps/desktop/src-tauri/src/indexing/CLAUDE.md
@@ -15,7 +15,7 @@ Full design: `docs/specs/drive-indexing/plan.md`
 - **store.rs** -- SQLite schema v7 (integer-keyed entries with `name_folded` column on macOS, dir_stats by entry_id, meta), platform_case collation, read queries, DB open/migrate. `resolve_component` uses the composite index directly: on macOS queries by `(parent_id, name_folded)`, on Linux/Windows by `(parent_id, name)`. Schema version check: mismatch triggers drop+rebuild. v7 adds dual sizes (logical + physical). Both path-keyed (backward compat) and integer-keyed APIs.
 - **memory_watchdog.rs** -- Background task monitoring resident memory via `mach_task_info` (macOS). Warns at 8 GB, stops indexing at 16 GB, emits `index-memory-warning` event to frontend. No-op stub on non-macOS. Started from `start_indexing()`.
 - **writer.rs** -- Single writer thread, owns the write connection, processes `WriteMessage` channel (bounded `sync_channel`, 20K capacity, backpressure via blocking). `WRITER_GENERATION: AtomicU64` (initialized to 1) bumped on every mutation (`InsertEntriesV2`, `UpsertEntryV2`, `DeleteEntryById`, `DeleteSubtreeById`, `TruncateData`) for search index staleness detection. Priority: `UpdateDirStats` before `InsertEntries`. `Flush` variant + async `flush()` method let callers wait for all prior writes to commit. Has both integer-keyed variants (`InsertEntriesV2`, `UpsertEntryV2`, `DeleteEntryById`, `DeleteSubtreeById`, `PropagateDeltaById`) and path-keyed backward-compat variants. The integer-keyed delete/subtree-delete handlers auto-propagate negative deltas via the `parent_id` chain (same pattern as the path-keyed variants). `propagate_delta_by_id` walks the parent chain using `get_parent_id` lookups. `UpsertEntryV2` auto-propagates deltas on both insert and update: on insert, propagates the full size (+file_count or +dir_count); on update, reads the old entry first and propagates only the size difference. This means callers never need a separate `PropagateDeltaById` for upserted entries. For new directories, also initializes a zero-valued `dir_stats` row so enrichment always has a row. Maintains `AccumulatorMaps` during `InsertEntriesV2` processing (two HashMaps: direct children stats and child dir relationships + an `entries_inserted` counter), cleared on `TruncateData`. On `ComputeAllAggregates`, passes accumulated maps to `aggregator::compute_all_aggregates_with_maps()` to skip expensive full-table-scan SQL queries. Accepts an optional `AppHandle` at spawn time to emit `index-aggregation-progress` events during aggregation (phase, current, total). Also emits `saving_entries` phase progress during `InsertEntriesV2` processing when the expected total is set via `set_expected_total_entries()` (an `Arc<AtomicU64>` shared between the writer thread and the `IndexWriter` handle). No index drop/recreate dance — the composite indexes (`idx_parent_name_folded` on macOS, `idx_parent_name` on Linux) use binary collation and stay present during scans.
-- **scanner.rs** -- jwalk-based parallel directory walker. `scan_volume()` for full scan, `scan_subtree()` for targeted subtree rescans (used by post-replay background verification). Uses `ScanContext` (from store.rs) to assign integer IDs and parent IDs during the walk: maintains a `HashMap<PathBuf, i64>` mapping directory paths to assigned IDs. The scan root is mapped to `ROOT_ID` (1). Sends `InsertEntriesV2(Vec<EntryRow>)` batches to the writer. Platform-specific exclusion filters via `should_exclude` (`pub(super)`) — the single exclusion gate for all code paths (scanner, reconciler, event_loop verification, per-navigation verifier). `default_exclusions()` is `#[cfg(test)]` only. Physical sizes (`st_blocks * 512`).
+- **scanner.rs** -- jwalk-based parallel directory walker. `scan_volume()` for full scan, `scan_subtree()` for targeted subtree rescans (used by post-replay background verification). Uses `ScanContext` (from store.rs) to assign integer IDs and parent IDs during the walk: maintains a `HashMap<PathBuf, i64>` mapping directory paths to assigned IDs. The scan root is mapped to `ROOT_ID` (1). Sends `InsertEntriesV2(Vec<EntryRow>)` batches to the writer. Platform-specific exclusion filters via `should_exclude` (`pub(super)`) — the single exclusion gate for all code paths (scanner, reconciler, event_loop verification, per-navigation verifier). `default_exclusions()` is `#[cfg(test)]` only. Physical sizes (`st_blocks * 512`). Hardlink inode dedup: files with `nlink > 1` are tracked in a `HashSet<u64>` by inode; only the first link's size is counted, subsequent links get `size = None`. Files with `nlink == 1` (vast majority) skip the set entirely.
 - **aggregator.rs** -- Dir stats computation. Bottom-up after full scan (O(N) single pass), per-subtree after subtree rescans, incremental delta propagation up ancestor chain for watcher events. Two entry points for full aggregation: `compute_all_aggregates_reported` (loads maps from SQL) and `compute_all_aggregates_with_maps` (accepts pre-built maps from the writer). Both accept an `on_progress: &mut dyn FnMut(AggregationProgress)` callback and delegate to `compute_and_write()` for the shared topological sort + bottom-up computation + batch write. Progress is reported at phase transitions and every ~1% during compute/write loops. `AggregationPhase` enum: `SavingEntries` (flushing writer channel), `LoadingDirectories`, `Sorting`, `Computing`, `Writing`. (The former `RebuildingIndex` phase was removed when the composite `idx_parent_name` index with `platform_case` collation was replaced — now uses binary-collation composite indexes that don't need rebuilding.) `backfill_missing_dir_stats` is a catch-up pass that finds directories without `dir_stats` rows and computes their stats bottom-up; triggered after reconciler replay and cold-start replay via `BackfillMissingDirStats` writer message.
 - **watcher.rs** -- Drive-level filesystem watcher. macOS: FSEvents via `cmdr-fsevent-stream` with event IDs and `sinceWhen` replay. Linux: `notify` crate (inotify backend) with recursive watching and synthetic event counter. Other platforms: stub. `supports_event_replay()` lets callers branch on whether journal replay is available.
 - **reconciler.rs** -- Buffers FSEvents during scan (capped at 500K events; overflow sets `buffer_overflow` flag forcing full rescan), replays after scan completes using event IDs to skip stale events. Processes live events for file creates/removes/modifies using integer-keyed write messages (`UpsertEntryV2`, `DeleteEntryById`, `DeleteSubtreeById`, `PropagateDeltaById`). Resolves filesystem paths to entry IDs via `store::resolve_path()` using a read connection passed by callers. Key functions (`process_fs_event`, `emit_dir_updated`) are `pub(super)` so `mod.rs` can call them directly during cold-start replay. `reconcile_subtree()` handles MustScanSubDirs by diffing filesystem vs DB directory-by-directory instead of delete-then-reinsert, making it safe to interrupt at any point.
@@ -127,6 +127,8 @@ Key test files are alongside each module (test functions within `#[cfg(test)]` b
 
 **Dual sizes (logical + physical)**: Both `meta.len()` (logical) and `st_blocks * 512` (physical) are stored. Logical size is displayed by default (mapped to `recursive_size` at the IPC boundary). Physical size is stored in DB but not yet exposed to the frontend. Physical sizes may overcount ~10-20% for APFS clones (shared blocks). Volume usage bar uses `statfs()` for true totals.
 
+**Hardlink inode dedup at scan time**: Files with `nlink > 1` are tracked by inode in a `HashSet<u64>` local to `run_scan`. The second+ link for the same inode gets `logical_size = None, physical_size = None`, so aggregation counts each inode's bytes exactly once. The `nlink > 1` check is a fast path: single-link files (the vast majority) skip the HashSet entirely, so there's no overhead for typical workloads. Memory cost is ~8 bytes per unique hardlinked inode. The set lives for one scan and is dropped with the scan's stack frame. Only applies to the scanner (full/subtree scans); the reconciler handles individual live events where cross-event dedup isn't applicable.
+
 **MustScanSubDirs uses reconciliation, not delete-then-reinsert**: `reconcile_subtree()` diffs the filesystem against the DB directory-by-directory, only inserting/deleting/updating entries that changed. This is safe to interrupt at any point (no bulk delete phase that could leave the DB empty). For brand-new directories discovered during reconciliation, a `flush_blocking()` + re-resolve cycle ensures their IDs are available before recursing into them. `scanner::scan_subtree` (which uses destructive `DeleteDescendantsById`) is used by post-replay background verification for newly discovered directories.
 
 **In-memory accumulation eliminates aggregation SQL queries**: During a full scan, the writer thread accumulates two HashMaps in `AccumulatorMaps` as `InsertEntriesV2` batches arrive: `direct_stats` (parent_id -> file size/count/dir count) and `child_dirs` (parent_id -> child dir IDs). When `ComputeAllAggregates` fires, these maps are passed to `compute_all_aggregates_with_maps()`, skipping the two expensive full-table-scan SQL queries (`bulk_get_children_stats_by_id` and `bulk_get_child_dir_ids`) that previously dominated aggregation time (~70%). Maps are cleared on `TruncateData` and after aggregation completes. Falls back to SQL queries if maps are empty.
diff --git a/apps/desktop/src-tauri/src/indexing/scanner.rs b/apps/desktop/src-tauri/src/indexing/scanner.rs
@@ -7,6 +7,7 @@
 //! Scan exclusions (macOS system directories, virtual filesystems) are filtered via
 //! jwalk's `process_read_dir` callback so excluded subtrees are never descended into.
 
+use std::collections::HashSet;
 use std::path::{Path, PathBuf};
 use std::sync::Arc;
 use std::sync::atomic::{AtomicBool, AtomicU64, Ordering};
@@ -257,6 +258,9 @@ fn run_scan(
     let mut batch: Vec<EntryRow> = Vec::with_capacity(batch_size);
     let mut total_entries: u64 = 0;
     let mut total_dirs: u64 = 0;
+    // Tracks inodes with nlink > 1 so each hardlinked file's size is counted only once.
+    // Files with nlink == 1 (the vast majority) skip the set entirely.
+    let mut seen_inodes: HashSet<u64> = HashSet::new();
 
     // Initialize the scan context: seed root mapping and get next_id from DB.
     // Volume-root scans need a write connection (to create the root sentinel).
@@ -332,7 +336,13 @@ fn run_scan(
         let (logical_size, physical_size, modified_at) = if is_dir || is_symlink {
             (None, None, entry_modified_at(&path))
         } else {
-            entry_size_and_mtime(&path)
+            let (ls, ps, mtime, ino, nlink) = entry_size_and_mtime(&path);
+            // Deduplicate hardlinks: if nlink > 1, only count each inode's size once
+            if nlink > 1 && !seen_inodes.insert(ino) {
+                (None, None, mtime)
+            } else {
+                (ls, ps, mtime)
+            }
         };
 
         // Look up parent_id from the scan context
@@ -453,9 +463,9 @@ pub(super) fn should_exclude(path_str: &str) -> bool {
     false
 }
 
-/// Get logical size, physical size (st_blocks * 512), and modified time for a file.
+/// Get logical size, physical size (st_blocks * 512), modified time, inode, and nlink for a file.
 #[cfg(unix)]
-fn entry_size_and_mtime(path: &Path) -> (Option<u64>, Option<u64>, Option<u64>) {
+fn entry_size_and_mtime(path: &Path) -> (Option<u64>, Option<u64>, Option<u64>, u64, u64) {
     use std::os::unix::fs::MetadataExt;
     match std::fs::symlink_metadata(path) {
         Ok(meta) => {
@@ -464,14 +474,14 @@ fn entry_size_and_mtime(path: &Path) -> (Option<u64>, Option<u64>, Option<u64>)
             let physical_size = if blocks > 0 { blocks * 512 } else { meta.len() };
             let mtime = meta.mtime();
             let mtime_u64 = if mtime >= 0 { Some(mtime as u64) } else { None };
-            (Some(logical_size), Some(physical_size), mtime_u64)
+            (Some(logical_size), Some(physical_size), mtime_u64, meta.ino(), meta.nlink())
         }
-        Err(_) => (None, None, None),
+        Err(_) => (None, None, None, 0, 1),
     }
 }
 
 #[cfg(not(unix))]
-fn entry_size_and_mtime(path: &Path) -> (Option<u64>, Option<u64>, Option<u64>) {
+fn entry_size_and_mtime(path: &Path) -> (Option<u64>, Option<u64>, Option<u64>, u64, u64) {
     match std::fs::symlink_metadata(path) {
         Ok(meta) => {
             let size = meta.len();
@@ -480,9 +490,9 @@ fn entry_size_and_mtime(path: &Path) -> (Option<u64>, Option<u64>, Option<u64>)
                 .ok()
                 .and_then(|t| t.duration_since(std::time::UNIX_EPOCH).ok())
                 .map(|d| d.as_secs());
-            (Some(size), Some(size), mtime)
+            (Some(size), Some(size), mtime, 0, 1)
         }
-        Err(_) => (None, None, None),
+        Err(_) => (None, None, None, 0, 1),
     }
 }