Skip to content

Commit 1d666a7

Browse files
committed
Indexing: Store both logical and physical sizes
- Schema v6 → v7: `entries` table now has `logical_size` and `physical_size` columns (was single `size`), `dir_stats` has `recursive_logical_size` and `recursive_physical_size` - Scanner collects both `meta.len()` (logical) and `st_blocks * 512` (physical) per file - All propagation paths updated: full aggregation, subtree, delta, backfill, live events, verification - IPC boundary (`DirStats`) keeps `recursive_size` field name, mapped from `recursive_logical_size` — zero frontend changes - Logical size is now the default display, fixing overcounting from hardlinks/APFS clones that affected physical-only sizes
1 parent 538ec5a commit 1d666a7

16 files changed

Lines changed: 767 additions & 432 deletions

File tree

apps/desktop/src-tauri/src/commands/search.rs

Lines changed: 8 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -154,10 +154,10 @@ pub async fn search_files(mut query: SearchQuery) -> Result<SearchResult, String
154154
};
155155

156156
// Resolve include paths to entry IDs via SQLite (microseconds, not 20s)
157-
if query.include_paths.as_ref().is_some_and(|p| !p.is_empty()) {
158-
if let Some(pool) = get_read_pool() {
159-
search::resolve_include_paths(&mut query, &pool);
160-
}
157+
if query.include_paths.as_ref().is_some_and(|p| !p.is_empty())
158+
&& let Some(pool) = get_read_pool()
159+
{
160+
search::resolve_include_paths(&mut query, &pool);
161161
}
162162

163163
// Run search on a blocking thread (rayon parallel scan)
@@ -758,7 +758,10 @@ pub(crate) async fn call_ai_translate(
758758
r
759759
}
760760
Err(e) => {
761-
log::warn!("AI search: chat_completion ({pass_label}) failed after {:.1}s for query={natural_query:?}: {e}", t0.elapsed().as_secs_f64());
761+
log::warn!(
762+
"AI search: chat_completion ({pass_label}) failed after {:.1}s for query={natural_query:?}: {e}",
763+
t0.elapsed().as_secs_f64()
764+
);
762765
return Err(format!("{e}"));
763766
}
764767
};

apps/desktop/src-tauri/src/file_system/write_operations/scan.rs

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -296,7 +296,11 @@ impl SourceItemTracker {
296296
let source_path = top_level_source_path(file_info);
297297
let count = self.processed.entry(source_path.clone()).or_insert(0);
298298
*count += 1;
299-
if self.totals.get(&source_path) == Some(count) { Some(source_path) } else { None }
299+
if self.totals.get(&source_path) == Some(count) {
300+
Some(source_path)
301+
} else {
302+
None
303+
}
300304
}
301305
}
302306

apps/desktop/src-tauri/src/indexing/CLAUDE.md

Lines changed: 6 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ Full design: `docs/specs/drive-indexing/plan.md`
1212
- **enrichment.rs** -- `ReadPool` (lock-free thread-local read connections for enrichment and verification), `enrich_entries_with_index()` (called when entries are stored in the listing cache — streaming, watcher update, re-sort — NOT on `get_file_range`; index freshness is handled by `index-dir-updated``refreshIndexSizes``getDirStatsBatch`). Integer-keyed fast path: resolve parent dir once → batch-fetch child dir stats by ID → match by name. Falls back to individual path resolution for edge cases.
1313
- **event_loop.rs** -- `run_live_event_loop` (real-time FSEvents/inotify processing after scan completes), `run_replay_event_loop` (cold-start journal replay with two-phase approach), `run_background_verification` (post-replay bidirectional readdir diff), `merge_fs_events` (deduplication with flag priority), `process_live_batch`. All bounded-buffer constants live here.
1414
- **events.rs** -- Tauri event payload structs (`IndexScanStartedEvent`, `IndexScanProgressEvent`, `IndexScanCompleteEvent`, `IndexDirUpdatedEvent`, `IndexReplayProgressEvent`, `IndexReplayCompleteEvent`), `RescanReason` enum, `emit_rescan_notification()`, IPC response types (`IndexStatusResponse`, `IndexDebugStatusResponse`). Also: `ActivityPhase` enum (Replaying/Scanning/Aggregating/Reconciling/Live/Idle) and `PhaseRecord` for the phase timeline system tracked in `DebugStats`.
15-
- **store.rs** -- SQLite schema v6 (integer-keyed entries with `name_folded` column on macOS, dir_stats by entry_id, meta), platform_case collation, read queries, DB open/migrate. `resolve_component` uses the composite index directly: on macOS queries by `(parent_id, name_folded)`, on Linux/Windows by `(parent_id, name)`. Schema version check: mismatch triggers drop+rebuild. Both path-keyed (backward compat) and integer-keyed APIs.
15+
- **store.rs** -- SQLite schema v7 (integer-keyed entries with `name_folded` column on macOS, dir_stats by entry_id, meta), platform_case collation, read queries, DB open/migrate. `resolve_component` uses the composite index directly: on macOS queries by `(parent_id, name_folded)`, on Linux/Windows by `(parent_id, name)`. Schema version check: mismatch triggers drop+rebuild. v7 adds dual sizes (logical + physical). Both path-keyed (backward compat) and integer-keyed APIs.
1616
- **memory_watchdog.rs** -- Background task monitoring resident memory via `mach_task_info` (macOS). Warns at 8 GB, stops indexing at 16 GB, emits `index-memory-warning` event to frontend. No-op stub on non-macOS. Started from `start_indexing()`.
1717
- **writer.rs** -- Single writer thread, owns the write connection, processes `WriteMessage` channel (bounded `sync_channel`, 20K capacity, backpressure via blocking). `WRITER_GENERATION: AtomicU64` (initialized to 1) bumped on every mutation (`InsertEntriesV2`, `UpsertEntryV2`, `DeleteEntryById`, `DeleteSubtreeById`, `TruncateData`) for search index staleness detection. Priority: `UpdateDirStats` before `InsertEntries`. `Flush` variant + async `flush()` method let callers wait for all prior writes to commit. Has both integer-keyed variants (`InsertEntriesV2`, `UpsertEntryV2`, `DeleteEntryById`, `DeleteSubtreeById`, `PropagateDeltaById`) and path-keyed backward-compat variants. The integer-keyed delete/subtree-delete handlers auto-propagate negative deltas via the `parent_id` chain (same pattern as the path-keyed variants). `propagate_delta_by_id` walks the parent chain using `get_parent_id` lookups. `UpsertEntryV2` initializes a zero-valued `dir_stats` row when inserting a NEW directory, so enrichment always has a row (subsequent `PropagateDeltaById` calls update it incrementally). Maintains `AccumulatorMaps` during `InsertEntriesV2` processing (two HashMaps: direct children stats and child dir relationships + an `entries_inserted` counter), cleared on `TruncateData`. On `ComputeAllAggregates`, passes accumulated maps to `aggregator::compute_all_aggregates_with_maps()` to skip expensive full-table-scan SQL queries. Accepts an optional `AppHandle` at spawn time to emit `index-aggregation-progress` events during aggregation (phase, current, total). Also emits `saving_entries` phase progress during `InsertEntriesV2` processing when the expected total is set via `set_expected_total_entries()` (an `Arc<AtomicU64>` shared between the writer thread and the `IndexWriter` handle). No index drop/recreate dance — the composite indexes (`idx_parent_name_folded` on macOS, `idx_parent_name` on Linux) use binary collation and stay present during scans.
1818
- **scanner.rs** -- jwalk-based parallel directory walker. `scan_volume()` for full scan, `scan_subtree()` for targeted subtree rescans (used by post-replay background verification). Uses `ScanContext` (from store.rs) to assign integer IDs and parent IDs during the walk: maintains a `HashMap<PathBuf, i64>` mapping directory paths to assigned IDs. The scan root is mapped to `ROOT_ID` (1). Sends `InsertEntriesV2(Vec<EntryRow>)` batches to the writer. Platform-specific exclusion filters via `should_exclude` (`pub(super)`) — the single exclusion gate for all code paths (scanner, reconciler, event_loop verification, per-navigation verifier). `default_exclusions()` is `#[cfg(test)]` only. Physical sizes (`st_blocks * 512`).
@@ -76,18 +76,18 @@ All writes go through a dedicated `std::thread` via a bounded `sync_channel` (20
7676

7777
Reads happen on separate WAL connections (any thread). A `ReadPool` provides thread-local read connections for enrichment and verification without contending on the `INDEXING` state-machine mutex.
7878

79-
### SQLite schema (v6: integer-keyed, platform-conditional composite index)
79+
### SQLite schema (v7: integer-keyed, platform-conditional composite index)
8080

8181
One DB per volume. **Dev and prod use separate directories** (see AGENTS.md § Debugging):
8282
- **Prod**: `~/Library/Application Support/com.veszelovszki.cmdr/index-{volume_id}.db`
8383
- **Dev**: `~/Library/Application Support/com.veszelovszki.cmdr-dev/index-{volume_id}.db`
8484

8585
Three tables:
86-
- `entries` (id INTEGER PK, parent_id, name COLLATE platform_case, [name_folded on macOS], is_directory, is_symlink, size, modified_at). Root sentinel: id=1, parent_id=0, name="".
86+
- `entries` (id INTEGER PK, parent_id, name COLLATE platform_case, [name_folded on macOS], is_directory, is_symlink, logical_size, physical_size, modified_at). Root sentinel: id=1, parent_id=0, name="".
8787
- **macOS**: has a `name_folded TEXT NOT NULL` column storing `normalize_for_comparison(name)` (NFD + case fold). Index: `idx_parent_name_folded ON entries (parent_id, name_folded)`.
8888
- **Linux/Windows**: no `name_folded` column. Index: `idx_parent_name ON entries (parent_id, name)`.
8989
- The old `idx_parent(parent_id)` from v5 is removed; the composite indexes replace it.
90-
- `dir_stats` (entry_id INTEGER PK, recursive_size, recursive_file_count, recursive_dir_count)
90+
- `dir_stats` (entry_id INTEGER PK, recursive_logical_size, recursive_physical_size, recursive_file_count, recursive_dir_count)
9191
- `meta` (key TEXT PK, value TEXT) WITHOUT ROWID
9292

9393
WAL mode, 16 MB page cache, `auto_vacuum = INCREMENTAL` (free pages reclaimed via `PRAGMA incremental_vacuum` after truncation). Custom `platform_case` collation registered on every connection: case-insensitive + NFD normalization on macOS, binary on Linux. **Opening the DB with the sqlite3 CLI will fail** on queries touching the name column (the collation isn't registered).
@@ -97,6 +97,7 @@ History of changes:
9797
- **Schema v4**: Bumped from v3 to enable `auto_vacuum = INCREMENTAL` (requires DB rebuild since the pragma must be set before table creation).
9898
- **Schema v5**: Replaced composite `UNIQUE INDEX idx_parent_name(parent_id, name)` with simple `INDEX idx_parent(parent_id)`. The composite index with `platform_case` collation was extremely slow to build (~25 min for 5.1M entries). A simple integer index needs no drop/recreate dance during scans.
9999
- **Schema v6**: Added `name_folded` column (macOS only) storing pre-computed `normalize_for_comparison(name)`. Replaced `idx_parent` with platform-conditional composite indexes: `idx_parent_name_folded(parent_id, name_folded)` on macOS, `idx_parent_name(parent_id, name)` on Linux/Windows. `resolve_component` now queries the index directly instead of fetching all children and matching in Rust.
100+
- **Schema v7**: Dual sizes. `entries.size` renamed to `entries.logical_size`, added `entries.physical_size`. `dir_stats.recursive_size` renamed to `dir_stats.recursive_logical_size`, added `dir_stats.recursive_physical_size`. Logical size = `meta.len()`, physical size = `st_blocks * 512` on Unix (both = `meta.len()` on non-Unix). The IPC boundary (`DirStats` struct) still exposes `recursive_size` mapped from `recursive_logical_size` to avoid frontend churn. `AccumulatorMaps.direct_stats` changed to 4-tuple `(logical_size_sum, physical_size_sum, file_count, dir_count)`.
100101

101102
## How to test
102103

@@ -124,7 +125,7 @@ Key test files are alongside each module (test functions within `#[cfg(test)]` b
124125

125126
**IPC boundary stays path-based**: Frontend sends filesystem paths, backend resolves path→ID internally via `store::resolve_path()`. No frontend changes needed. IPC dir stats queries (`get_dir_stats`, `get_dir_stats_batch`) use `ReadPool` for lock-free reads, same as enrichment.
126127

127-
**Physical sizes (`st_blocks * 512`)**: More meaningful for disk usage than logical size. May overcount ~10-20% for APFS clones (shared blocks). Volume usage bar uses `statfs()` for true totals.
128+
**Dual sizes (logical + physical)**: Both `meta.len()` (logical) and `st_blocks * 512` (physical) are stored. Logical size is displayed by default (mapped to `recursive_size` at the IPC boundary). Physical size is stored in DB but not yet exposed to the frontend. Physical sizes may overcount ~10-20% for APFS clones (shared blocks). Volume usage bar uses `statfs()` for true totals.
128129

129130
**MustScanSubDirs uses reconciliation, not delete-then-reinsert**: `reconcile_subtree()` diffs the filesystem against the DB directory-by-directory, only inserting/deleting/updating entries that changed. This is safe to interrupt at any point (no bulk delete phase that could leave the DB empty). For brand-new directories discovered during reconciliation, a `flush_blocking()` + re-resolve cycle ensures their IDs are available before recursing into them. `scanner::scan_subtree` (which uses destructive `DeleteDescendantsById`) is used by post-replay background verification for newly discovered directories.
130131

0 commit comments

Comments
 (0)