Indexing: detect stale index, notify user, rescan

vdavid · vdavid · commit b590a54e6b9e · 2026-03-10T16:22:57.000+01:00
- Add `RescanReason` enum (7 variants) and `index-rescan-notification` Tauri event emitted from every code path that falls back to a full rescan
- Pre-check in `resume_or_scan()` compares stored `last_event_id` with `FSEventsGetCurrentEventId()` before starting the FSEvents stream — prevents the 1024-capacity `try_send` channel in `cmdr-fsevent-stream` from being overwhelmed with millions of replayed events
- Truncate `entries` + `dir_stats` via new `TruncateData` writer message before rescanning a stale DB — `INSERT OR REPLACE` on a populated table with the `platform_case` collation takes ~30 min vs ~2.5 min on empty
- Add `flush_blocking()` to `IndexWriter` for sync contexts
- Add `did_buffer_overflow()` accessor to `EventReconciler`
- Frontend: listen for `index-rescan-notification`, show info toast with reason-specific user-friendly message (8s timeout, deduped by `id: 'index-rescan'`)
diff --git a/apps/desktop/src-tauri/src/indexing/CLAUDE.md b/apps/desktop/src-tauri/src/indexing/CLAUDE.md
@@ -32,8 +32,11 @@ App startup
   |-- init(): register IndexManagerState in Tauri
   |-- start_indexing(): create IndexManager, open SQLite, spawn writer thread
   |-- resume_or_scan():
-  |   |-- macOS: Has existing index + last_event_id? -> sinceWhen replay (FSEvents journal)
+  |   |-- macOS: Has existing index + last_event_id?
+  |   |   |-- Pre-check: event gap > 1M? -> emit index-rescan-notification (StaleIndex), truncate entries+dir_stats, full scan
+  |   |   |-- Otherwise -> sinceWhen replay (FSEvents journal)
   |   |-- Linux: Always full rescan (no event journal; existing DB used for instant enrichment)
+  |   |-- Incomplete previous scan (has data but no scan_completed_at)? -> notify + fresh scan
   |   |-- Otherwise -> fresh full scan
   |
 Full scan:
@@ -118,8 +121,12 @@ Key test files are alongside each module (test functions within `#[cfg(test)]` b
 
 **APFS firmlinks**: Scan from `/` only, skip `/System/Volumes/Data`. Normalize all paths via firmlink prefix map so DB lookups work regardless of how the user navigated to a path.
 
+**Rescan notification system (`RescanReason` enum)**: Every code path that falls back to a full rescan emits an `index-rescan-notification` event with a `RescanReason` variant and human-readable details. The frontend maps each reason to a user-friendly toast message. Seven reasons: `StaleIndex` (pre-check gap), `JournalGap` (in-loop gap), `ReplayOverflow` (>1M events), `TooManySubdirRescans` (>1K MustScanSubDirs), `WatcherStartFailed`, `ReconcilerBufferOverflow` (>500K buffered events during scan), `IncompletePreviousScan` (has data but no `scan_completed_at`). The pre-check in `resume_or_scan()` catches stale indexes before starting the FSEvents stream, preventing the cmdr-fsevent-stream channel (1024 capacity, `try_send`) from being overwhelmed.
+
 ## Gotchas
 
+**INSERT OR REPLACE on a populated DB is catastrophically slow**: The `platform_case` collation (NFD + case fold on macOS) runs for every B-tree comparison during unique index lookups. On an empty DB a full scan takes ~2.5 min; on a populated DB with 5.5M entries the same scan takes ~30 min because each `INSERT OR REPLACE` triggers ~20 collation calls to traverse the B-tree. The `StaleIndex` path truncates `entries` and `dir_stats` via `TruncateData` + `flush_blocking()` before starting the scan to avoid this. Never do a full rescan into a populated DB without clearing first.
+
 **Cold-start replay enters live mode immediately after flush**: The `run_replay_event_loop` doesn't emit `index-dir-updated` during Phase 1 (replay). It collects affected paths, flushes the writer (ensuring all writes are committed), emits a single batched notification, re-enables micro-scans, and enters live mode right away (~100ms from startup). Post-replay verification (`verify_affected_dirs`) runs in a background task (`run_background_verification`) concurrently with live events. This is safe because the writer serializes all writes. Any corrections found by verification are emitted as a separate `index-dir-updated` batch.
 
 **Live events are deduplicated and batched with a 1s window**: Both `run_live_event_loop` and the Phase 3 live loop in `run_replay_event_loop` collect incoming events into a `HashMap<String, FsChangeEvent>` keyed by normalized path. On each 1s flush tick, only the deduplicated set is processed through `process_live_event`. `merge_fs_events` keeps the most significant flags when events collide: `must_scan_sub_dirs` always wins, then `removed`, then `created`, then `modified`. `UpdateLastEventId` is sent once per batch (in `process_live_batch`) instead of per-event, reducing writer channel pressure during event storms.
diff --git a/apps/desktop/src-tauri/src/indexing/mod.rs b/apps/desktop/src-tauri/src/indexing/mod.rs
@@ -247,6 +247,50 @@ pub struct IndexReplayProgressEvent {
     pub estimated_total: Option<u64>,
 }
 
+/// Why a full rescan was triggered instead of incremental replay.
+/// Sent to the frontend as `index-rescan-notification` so the UI can show
+/// a transparent, user-friendly toast.
+#[derive(Debug, Clone, Serialize, Deserialize)]
+#[serde(rename_all = "snake_case")]
+pub enum RescanReason {
+    /// Event ID gap too large — app hasn't run for a long time.
+    StaleIndex,
+    /// FSEvents journal unavailable (gap detected during replay).
+    JournalGap,
+    /// Replay processed too many events (safety limit exceeded).
+    ReplayOverflow,
+    /// Too many MustScanSubDirs events during replay.
+    TooManySubdirRescans,
+    /// DriveWatcher failed to start for replay.
+    WatcherStartFailed,
+    /// Reconciler event buffer overflowed during scan.
+    ReconcilerBufferOverflow,
+    /// Previous scan didn't complete (app crashed or was force-quit).
+    IncompletePreviousScan,
+}
+
+#[derive(Debug, Clone, Serialize, Deserialize)]
+#[serde(rename_all = "camelCase")]
+pub struct IndexRescanNotificationEvent {
+    pub volume_id: String,
+    pub reason: RescanReason,
+    /// Human-readable details for logs (not shown to user directly).
+    pub details: String,
+}
+
+/// Emit an `index-rescan-notification` event and log the reason at INFO level.
+fn emit_rescan_notification(app: &AppHandle, volume_id: &str, reason: RescanReason, details: String) {
+    log::info!("Index rescan triggered ({reason:?}): {details}");
+    let _ = app.emit(
+        "index-rescan-notification",
+        IndexRescanNotificationEvent {
+            volume_id: volume_id.to_string(),
+            reason,
+            details,
+        },
+    );
+}
+
 // ── Response types ───────────────────────────────────────────────────
 
 #[derive(Debug, Clone, Serialize, Deserialize)]
@@ -370,6 +414,36 @@ impl IndexManager {
             if let Some(ref last_event_id_str) = status.last_event_id {
                 let last_event_id: u64 = last_event_id_str.parse().unwrap_or(0);
                 if last_event_id > 0 {
+                    // Pre-check: compare stored event ID with current system event ID.
+                    // If the gap is too large, skip replay entirely — the cmdr-fsevent-stream
+                    // channel (1024 capacity, try_send) would silently drop most events,
+                    // and replaying millions of events is slower than a fresh scan anyway.
+                    let current_id = watcher::current_event_id();
+                    if current_id > 0 && current_id > last_event_id + JOURNAL_GAP_THRESHOLD {
+                        let gap = current_id - last_event_id;
+                        emit_rescan_notification(
+                            &self.app,
+                            &self.volume_id,
+                            RescanReason::StaleIndex,
+                            format!(
+                                "Stored last_event_id={last_event_id}, current system \
+                                 event_id={current_id}, gap={gap} \
+                                 (threshold={JOURNAL_GAP_THRESHOLD}). \
+                                 The app likely hasn't run for a long time."
+                            ),
+                        );
+                        // Truncate entries + dir_stats before scanning. INSERT OR REPLACE on a
+                        // populated DB with the `platform_case` collation is extremely slow
+                        // (30 min vs 2.5 min on empty). The stale data is useless anyway.
+                        if let Err(e) = self.writer.send(WriteMessage::TruncateData) {
+                            log::warn!("Failed to send TruncateData: {e}");
+                        }
+                        if let Err(e) = self.writer.flush_blocking() {
+                            log::warn!("Failed to flush after TruncateData: {e}");
+                        }
+                        return self.start_scan();
+                    }
+
                     log::debug!(
                         "Existing index found (scan_completed_at={}, last_event_id={last_event_id}), \
                          attempting sinceWhen replay",
@@ -381,8 +455,17 @@ impl IndexManager {
             log::debug!("Existing index found but no last_event_id, starting fresh scan");
         } else if status.scan_completed_at.is_some() {
             log::debug!("Existing index found, starting rescan (no event replay on this platform)");
+        } else if status.last_event_id.is_some() {
+            emit_rescan_notification(
+                &self.app,
+                &self.volume_id,
+                RescanReason::IncompletePreviousScan,
+                "Index DB exists but scan_completed_at is not set. Previous scan likely didn't \
+                 finish."
+                    .to_string(),
+            );
         } else {
-            log::debug!("No existing index (scan_completed_at not set), starting fresh scan");
+            log::debug!("No existing index, starting fresh scan");
         }
 
         self.start_scan()
@@ -403,7 +486,12 @@ impl IndexManager {
                 log::debug!("DriveWatcher started for replay (sinceWhen={since_event_id}, current={current_id})");
             }
             Err(e) => {
-                log::warn!("Failed to start DriveWatcher for replay: {e}, falling back to full scan");
+                emit_rescan_notification(
+                    &self.app,
+                    &self.volume_id,
+                    RescanReason::WatcherStartFailed,
+                    format!("DriveWatcher failed to start for replay: {e}"),
+                );
                 return self.start_scan();
             }
         }
@@ -610,6 +698,18 @@ impl IndexManager {
                     }
                     log::debug!("Reconciler: buffered {buffered_count} events during scan");
 
+                    if reconciler.did_buffer_overflow() {
+                        emit_rescan_notification(
+                            &app,
+                            &volume_id,
+                            RescanReason::ReconcilerBufferOverflow,
+                            "The filesystem watcher buffered over 500,000 events during the \
+                             scan, exceeding the reconciler's capacity. A lot of filesystem \
+                             activity was happening during the scan."
+                                .to_string(),
+                        );
+                    }
+
                     // Flush the writer to ensure all scan batches are committed
                     // before opening the read connection. Without this, the WAL
                     // snapshot may not include the latest InsertEntriesV2 batches,
@@ -1134,11 +1234,17 @@ async fn run_replay_event_loop(
         if !first_event_checked {
             first_event_checked = true;
             if event.event_id > since_event_id + JOURNAL_GAP_THRESHOLD {
-                log::warn!(
-                    "Journal gap detected: stored last_event_id={since_event_id}, \
-                     first received event_id={}, gap={}",
-                    event.event_id,
-                    event.event_id - since_event_id,
+                emit_rescan_notification(
+                    &app,
+                    &volume_id,
+                    RescanReason::JournalGap,
+                    format!(
+                        "Stored last_event_id={since_event_id}, first received event_id={}, \
+                         gap={} (threshold={JOURNAL_GAP_THRESHOLD}). FSEvents journal may \
+                         have been purged.",
+                        event.event_id,
+                        event.event_id - since_event_id,
+                    ),
                 );
                 // Re-enable micro-scans before falling back to full scan
                 micro_scans.set_replay_active(false);
@@ -1212,9 +1318,15 @@ async fn run_replay_event_loop(
         // fall back to a full scan. Handles the FDA-toggle scenario where
         // the app suddenly sees millions of previously hidden paths.
         if event_count >= REPLAY_EVENT_COUNT_LIMIT {
-            log::warn!(
-                "Replay: event count ({event_count}) exceeded safety limit \
-                 ({REPLAY_EVENT_COUNT_LIMIT}). Aborting replay and falling back to full scan."
+            emit_rescan_notification(
+                &app,
+                &volume_id,
+                RescanReason::ReplayOverflow,
+                format!(
+                    "Replay processed {event_count} events, exceeding the safety limit of \
+                     {REPLAY_EVENT_COUNT_LIMIT}. This can happen when Full Disk Access was \
+                     toggled."
+                ),
             );
             micro_scans.set_replay_active(false);
             if let Some(tx) = fallback_tx.take() {
@@ -1321,7 +1433,15 @@ async fn run_replay_event_loop(
     // Queue any MustScanSubDirs rescans that were deferred during replay.
     // If pending_rescans overflowed, trigger a full rescan via fallback.
     if pending_rescans_overflow {
-        log::warn!("Replay: pending rescans overflowed, triggering full rescan");
+        emit_rescan_notification(
+            &app,
+            &volume_id,
+            RescanReason::TooManySubdirRescans,
+            format!(
+                "Replay accumulated more than {MAX_PENDING_RESCANS} directories needing full \
+                 rescans. This typically means a major filesystem reorganization happened."
+            ),
+        );
         if let Some(tx) = fallback_tx.take() {
             let _ = tx.send(());
         }
diff --git a/apps/desktop/src-tauri/src/indexing/reconciler.rs b/apps/desktop/src-tauri/src/indexing/reconciler.rs
@@ -280,6 +280,11 @@ impl EventReconciler {
         });
     }
 
+    /// Whether the reconciler's event buffer overflowed during the scan.
+    pub(super) fn did_buffer_overflow(&self) -> bool {
+        self.buffer_overflow
+    }
+
     /// Number of buffered events (for diagnostics).
     #[cfg(test)]
     pub fn buffer_len(&self) -> usize {
diff --git a/apps/desktop/src-tauri/src/indexing/writer.rs b/apps/desktop/src-tauri/src/indexing/writer.rs
@@ -63,6 +63,10 @@ pub enum WriteMessage {
     /// Flush: confirms all prior messages have been committed.
     /// The writer responds through the channel after processing this message.
     Flush(oneshot::Sender<()>),
+    /// Truncate `entries` and `dir_stats` tables, preserving `meta`.
+    /// Used before a full rescan on a stale DB to avoid slow `INSERT OR REPLACE`
+    /// on a populated table with the expensive `platform_case` collation.
+    TruncateData,
     /// Begin an explicit SQLite transaction.
     /// All subsequent writes are batched until `CommitTransaction`.
     /// Dramatically reduces fsync overhead for bulk operations (replay).
@@ -138,6 +142,19 @@ impl IndexWriter {
         })
     }
 
+    /// Send a `Flush` and block until all prior messages have been committed.
+    /// Safe to call from synchronous code (no async runtime needed).
+    pub fn flush_blocking(&self) -> Result<(), IndexStoreError> {
+        let (tx, rx) = oneshot::channel();
+        self.send(WriteMessage::Flush(tx))?;
+        rx.blocking_recv().map_err(|_| {
+            IndexStoreError::Io(std::io::Error::new(
+                std::io::ErrorKind::BrokenPipe,
+                "Writer thread dropped flush reply",
+            ))
+        })
+    }
+
     /// Send a `Shutdown` message and wait for the writer thread to finish.
     ///
     /// Joins the thread to ensure all buffered writes are flushed.
@@ -388,6 +405,20 @@ fn process_message(conn: &rusqlite::Connection, msg: WriteMessage, stats: &Write
         } => {
             propagate_delta_by_id(conn, entry_id, size_delta, file_count_delta, dir_count_delta);
         }
+        WriteMessage::TruncateData => {
+            let t = Instant::now();
+            match conn.execute_batch(
+                "DELETE FROM dir_stats; DELETE FROM entries; INSERT OR IGNORE INTO entries (id, parent_id, name, is_directory, is_symlink) VALUES (1, 0, '', 1, 0);",
+            ) {
+                Ok(()) => {
+                    log::info!(
+                        "Writer: truncated entries + dir_stats ({}ms)",
+                        t.elapsed().as_millis(),
+                    );
+                }
+                Err(e) => log::warn!("Writer: truncate failed: {e}"),
+            }
+        }
         WriteMessage::ComputeAllAggregates => {
             let t = Instant::now();
             match aggregator::compute_all_aggregates(conn) {
diff --git a/apps/desktop/src/lib/indexing/CLAUDE.md b/apps/desktop/src/lib/indexing/CLAUDE.md
@@ -35,13 +35,14 @@ cancelNavPriority(path: string): Promise<void>
 
 ## Scan state (`index-state.svelte.ts`)
 
-Module-level `$state` variables (`scanning`, `entriesScanned`, `dirsFound`) react to three Tauri events:
+Module-level `$state` variables (`scanning`, `entriesScanned`, `dirsFound`) react to four Tauri events:
 
-| Event                 | Payload                                             | Effect                               |
-| --------------------- | --------------------------------------------------- | ------------------------------------ |
-| `index-scan-started`  | `{ volumeId }`                                      | `scanning = true`, counters reset    |
-| `index-scan-progress` | `{ volumeId, entriesScanned, dirsFound }`           | Update counters                      |
-| `index-scan-complete` | `{ volumeId, totalEntries, totalDirs, durationMs }` | `scanning = false`, set final counts |
+| Event                       | Payload                                             | Effect                                       |
+| --------------------------- | --------------------------------------------------- | -------------------------------------------- |
+| `index-scan-started`        | `{ volumeId }`                                      | `scanning = true`, counters reset            |
+| `index-scan-progress`       | `{ volumeId, entriesScanned, dirsFound }`           | Update counters                              |
+| `index-scan-complete`       | `{ volumeId, totalEntries, totalDirs, durationMs }` | `scanning = false`, set final counts         |
+| `index-rescan-notification` | `{ volumeId, reason, details }`                     | Show info toast with reason-specific message |
 
 **Startup race condition**: The Rust indexer starts in Tauri's `setup()` hook before the frontend registers listeners.
 `initIndexState` uses a "listen first, then query" pattern: registers event listeners, then calls `get_index_status` IPC
@@ -105,4 +106,5 @@ No unit or integration tests exist for this module yet. Manual testing via the R
 
 - `@tauri-apps/api/core` — `invoke`
 - `$lib/tauri-commands` — `listen`, `UnlistenFn`
+- `$lib/ui/toast` — `addToast` (rescan notification toasts)
 - `$lib/file-explorer/selection/selection-info-utils` — `formatNumber` (overlay only)
diff --git a/apps/desktop/src/lib/indexing/index-state.svelte.ts b/apps/desktop/src/lib/indexing/index-state.svelte.ts