Skip to content

Commit bb64ad3

Browse files
committed
Drive indexing: stop the NAS first-scan stalling on snapshot dirs, and cut scan log volume ~90%
A reported NAS first-scan stuck near 50% was the recursive size scan grinding through `@Recently-Snapshot` — a Synology Btrfs snapshot tree holding full point-in-time copies of the whole share (44 TB of hardlinked data on a 10 TB volume). The scanner re-walked the entire filesystem once per snapshot over a single serialized SMB connection, so it could run for hours. The same walk firehosed the always-Debug log file (a fresh 50 MB rotation every few minutes, so a user's error report held only ~25 min). Fix the scan: - The fresh AND reconcile network scanners (`volume_scanner.rs`) no longer recurse into NAS snapshot/system pseudo-dirs (`@eaDir`, `@Recently-Snapshot`, `@Recycle`, `#recycle`, `#snapshot`, `.snapshot`, `$RECYCLE.BIN`, `System Volume Information`, …) via `system_dirs::is_recursion_excluded_dir` (case-insensitive). The dir's own row is still indexed, so it stays listed and navigable (you can walk into `@Recycle` to restore a file); only its subtree is left un-walked, so it shows an honest unknown size (`—`/`≥`) instead of a misleading roll-up. `@eaDir` (Synology thumbnail sidecar) is matched at ANY depth, not just the share root, since it lives inside every media folder. - Summing snapshots was both ruinous (a full tree traversal per snapshot) AND wrong (the bytes are deduped, not real consumed space), so excluding them makes the number more correct, not just faster. - Scope is the network scanner only (the home of these dirs); the local jwalk scanner has its own `should_exclude`. Slim the logging (keeps DEBUG error-report bundles useful, just drops the plumbing flood): - The reconciler's `parent path not in DB, skipping event` line (~13% of normal log volume, almost all harmless build-output churn) now routes through the existing skip-aggregator pattern at TRACE + a 5 s DEBUG summary. Generalized `reconciler.rs::unknown_path_skips` into a reusable `skip_aggregator` with two instances (`UNKNOWN_PATH_SKIPS`, `STALE_PARENT_SKIPS`). - `SmbVolume::list_directory` per-listing lines (~9%) drop to TRACE. - A throttled (~1/s) `volume_scanner: scanning|reconciling… N dirs, M entries, current: <path>` DEBUG heartbeat replaces the per-listing line as the scan signal, so an error report still shows the scan is alive, where it is, and how far along — without the per-directory flood. - Fixed a latent UTF-8 panic in the skip-sample truncation: `&path[..80]` could slice mid-codepoint on accented NAS paths (e.g. "Külkeres síelés"); now truncates on a char boundary. Tests: `skips_recursion_into_nas_system_dirs` (verified RED without the guard) plus `system_dirs` unit tests; full check suite green. Not included: the `smb2` crate's per-frame DEBUG plumbing (signing/recv/execute, ~64% of the firehose) lives in a separate repo and needs its own release; demoting it to TRACE there is the right follow-up, much less urgent now that the snapshot walk no longer generates millions of SMB ops.
1 parent 61c66a0 commit bb64ad3

6 files changed

Lines changed: 340 additions & 60 deletions

File tree

apps/desktop/src-tauri/src/file_system/volume/backends/smb.rs

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ use super::{
1111
};
1212
use crate::file_system::listing::FileEntry;
1313
use crate::file_system::listing::caching::try_get_watched_listing;
14-
use log::{debug, info, warn};
14+
use log::{debug, info, trace, warn};
1515
use smb2::client::tree::Tree;
1616
use smb2::{ClientConfig, SmbClient};
1717
use std::future::Future;
@@ -446,7 +446,11 @@ impl SmbVolume {
446446
let smb_path = self.to_smb_path(path);
447447
let display_path = self.to_display_path(&smb_path);
448448

449-
debug!(
449+
// TRACE, not DEBUG: this fires per listing for both the live pane and the index
450+
// scan, and was ~9% of normal file-log volume. The scan's own progress signal is
451+
// the throttled `volume_scanner: scanning…` DEBUG heartbeat. Bump back with
452+
// `RUST_LOG=cmdr_lib::file_system::volume::backends::smb=trace` when chasing a listing bug.
453+
trace!(
450454
"SmbVolume::list_directory: share={}, input={:?}, smb_path={:?}",
451455
self.share_name, path, smb_path
452456
);
@@ -465,7 +469,7 @@ impl SmbVolume {
465469
.map(|e| directory_entry_to_file_entry(e, &display_path))
466470
.collect();
467471

468-
debug!(
472+
trace!(
469473
"SmbVolume::list_directory: completed in {:?}, {} entries",
470474
start.elapsed(),
471475
entries.len()

apps/desktop/src-tauri/src/indexing/DETAILS.md

Lines changed: 23 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -149,6 +149,10 @@ Three disciplines for network round trips (plan rabbit hole #3), all in `list_on
149149

150150
A sub-directory that fails to list (permission, transient) is skipped and the walk continues (like jwalk skipping errored entries); failing to list the ROOT is fatal (nothing to index) so the caller discards.
151151

152+
#### NAS snapshot/system dirs aren't recursed (`system_dirs.rs`)
153+
154+
The BFS does NOT descend into NAS snapshot/system pseudo-directories (`@eaDir`, `@Recently-Snapshot`, `@Recycle`, `#recycle`, `#snapshot`, `.snapshot`, `$RECYCLE.BIN`, `System Volume Information`, …; matched case-insensitively by `system_dirs::is_recursion_excluded_dir`). Both the fresh scan and the reconcile walk apply it: the dir's own row is still indexed (so it stays listed and navigable — a user can walk into `@Recycle` to restore a file), but its subtree is never walked, so it rolls up as honestly-unknown (`—`/`≥`) rather than a misleading total. **Decision/Why:** these dirs are hardlinked, huge, and re-walking them costs a full filesystem traversal *per snapshot* over serialized SMB — a real first-scan stalled near 50% grinding `@Recently-Snapshot`, which alone reported 44 TB on a 10 TB volume. Summing them is both ruinous and wrong (the bytes are deduped, not real consumed space). The names are reserved vendor conventions (`@`/`#`/`$` prefixes) that don't collide with user folders, so a name match is safe. **Guardrail:** don't remove the exclusion to "fill in" the missing sizes — that re-triggers the stall. Scope is the network scanner only (the home of these dirs); the local jwalk scanner has its own `should_exclude`. `FileEntry` carries no DOS hidden/system attribute today; if one is plumbed through, "hidden + system" would generalize this without the hardcoded list.
155+
152156
### Direct-smb2 gating (`smb_index.rs`, plan rabbit hole #13)
153157

154158
Indexing an SMB share requires Cmdr's own smb2 (`direct`) session, NOT the macOS `os_mount`: `CHANGE_NOTIFY` watching runs over smb2 anyway, and smb2 parallelizes listing far better than per-`readdir` round trips through the kernel mount. An `os_mount` share is registered as a `LocalPosixVolume` on an `smbfs` mount (its `smb_connection_state()` is `None`; the FE enriches it to `OsMount`); a direct one is an `SmbVolume` returning `Some(Direct)`. `ensure_direct_smb` therefore: `Direct` → index now; `Disconnected` → refuse (reconnect first); `os_mount` → trigger/await `upgrade_to_smb_volume_inner`, then re-check. Every refusal is a TYPED `SmbIndexGateReason` (`NotRegistered` / `NotAnSmbVolume` / `UpgradeFailed` / `CredentialsNeeded` / `Disconnected`) that crosses IPC as a snake_case tag — never a message substring (`.claude/rules/no-string-matching.md`). FDA-independent: SMB paths aren't TCC-protected, so `start_indexing_for_smb` never routes through `should_auto_start_indexing` (rabbit hole #12).
@@ -499,16 +503,25 @@ The clamps differ by error band (constants live in `eta.ts`): tier 1 caps at 0.9
499503

500504
## Log levels
501505

502-
Two high-volume per-event lines are at TRACE (off by default; file chain captures
503-
Debug+ only): `Writer: UpsertEntryV2 inserted "X"` and
504-
`Reconciler: removal for unknown path, skipping: X`. Together they were ~90% of normal
505-
log volume; the writer line is fully redundant with the existing `Writer: +N msgs (...)`
506-
aggregate, and the reconciler skip is mostly harmless build-output churn.
507-
508-
The reconciler emits a DEBUG aggregate every ~5s: `Reconciler: skipped N removals
509-
for unknown paths in Xs [total], sample: <path>`. Error reports retain the
510-
existence-of-drift signal. Live in `reconciler.rs::unknown_path_skips`. To get the
511-
per-event detail back: `RUST_LOG=cmdr_lib::indexing::reconciler=trace,cmdr_lib::indexing::writer=trace,debug`.
506+
High-volume per-event lines are at TRACE (off by default; file chain captures Debug+
507+
only) so the always-Debug file target doesn't drown in plumbing: `Writer: UpsertEntryV2
508+
inserted "X"`, the two reconciler skip lines (`removal for unknown path` and `parent path
509+
not in DB, skipping event`), and `SmbVolume::list_directory: …` (per listing, fires for
510+
both the live pane and the index scan). On a stuck NAS first-scan these dominated the log
511+
— a single rotated 50 MB file held only ~25 min, so a user's error report lost the
512+
relevant window. The writer line is redundant with the `Writer: +N msgs (...)` aggregate;
513+
the reconciler skips are mostly harmless build-output churn; `list_directory` is replaced
514+
as the scan signal by the heartbeat below.
515+
516+
The reconciler emits a DEBUG aggregate every ~5s per skip class via
517+
`reconciler.rs::skip_aggregator` (two instances, `UNKNOWN_PATH_SKIPS` and
518+
`STALE_PARENT_SKIPS`): `Reconciler: skipped N <removals|events> <reason> in Xs [total],
519+
sample: <path>`. The `volume_scanner` emits a throttled (~1/s) DEBUG heartbeat
520+
`volume_scanner: scanning|reconciling… N dirs, M entries, current: <path>`, so an error
521+
report still shows the scan is alive, where it is, and how far along — the signal a
522+
triager needs without the per-directory `list_directory` flood. Error reports retain the
523+
existence-of-drift and scan-progress signals. To get the per-event detail back:
524+
`RUST_LOG=cmdr_lib::indexing::reconciler=trace,cmdr_lib::indexing::writer=trace,cmdr_lib::file_system::volume::backends::smb=trace,debug`.
512525

513526
The writer's SQLite busy retry logger (`stall_probe::sqlite_busy` in `writer/mod.rs::spawn`) logs
514527
per-attempt at DEBUG: brief contention is routine (WAL checkpoints, long-lived readers). It

apps/desktop/src-tauri/src/indexing/mod.rs

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,7 @@ pub(crate) mod scanner;
3939
#[cfg(any(target_os = "macos", target_os = "linux"))]
4040
mod smb_index;
4141
mod smb_watch;
42+
mod system_dirs;
4243
mod verifier;
4344
pub(crate) mod volume_scanner;
4445
pub(crate) mod watcher;

apps/desktop/src-tauri/src/indexing/reconciler.rs

Lines changed: 78 additions & 45 deletions
Original file line numberDiff line numberDiff line change
@@ -43,15 +43,17 @@ use crate::pluralize::pluralize;
4343
/// events is always safe.
4444
const MAX_BUFFER_CAPACITY: usize = 500_000;
4545

46-
/// Aggregator for "removal for unknown path, skipping" events. The per-event line is
47-
/// at TRACE (off by default; file chain captures Debug+ only); this module emits a
48-
/// single DEBUG summary every ~5 s, so error report bundles still carry the
49-
/// existence-of-drift signal without the per-event noise.
46+
/// Aggregator for high-volume reconciler skip events. Each per-event line is at TRACE
47+
/// (off by default; file chain captures Debug+ only); this aggregator emits a single
48+
/// DEBUG summary every ~5 s, so error report bundles still carry the existence-of-drift
49+
/// signal without the per-event noise. Two instances cover the two skip classes that
50+
/// dominate normal log volume: [`UNKNOWN_PATH_SKIPS`] (removal for a path not in the DB)
51+
/// and [`STALE_PARENT_SKIPS`] (create/modify whose parent dir isn't in the DB).
5052
///
5153
/// Most skips are harmless build-output churn, but a sustained rate (or a sample path
5254
/// in an unexpected tree) can flag real reconciler/index drift. Triagers: if you need
5355
/// the per-event detail, run with `RUST_LOG=cmdr_lib::indexing::reconciler=trace,debug`.
54-
mod unknown_path_skips {
56+
mod skip_aggregator {
5557
use std::sync::Mutex;
5658
use std::time::Instant;
5759

@@ -69,47 +71,74 @@ mod unknown_path_skips {
6971
sample: Option<String>,
7072
}
7173

72-
static STATE: Mutex<Option<State>> = Mutex::new(None);
74+
/// One skip category: its own rolling state plus the words for the summary line
75+
/// (`"skipped {unit}s {reason} in …"`).
76+
pub(super) struct SkipAggregator {
77+
state: Mutex<Option<State>>,
78+
/// Counted noun, e.g. `"removal"` → "skipped 3 removals".
79+
unit: &'static str,
80+
/// Reason phrase, e.g. `"for unknown paths"`.
81+
reason: &'static str,
82+
}
7383

74-
/// Increment the counter and emit a summary if the flush interval has elapsed.
75-
/// Called on every skipped removal (cheap: one mutex acquisition, one branch).
76-
pub(super) fn record(path: &str) {
77-
let mut guard = match STATE.lock() {
78-
Ok(g) => g,
79-
Err(p) => p.into_inner(),
80-
};
81-
let state = guard.get_or_insert_with(|| State {
82-
count_since_last_flush: 0,
83-
total: 0,
84-
last_flush: Instant::now(),
85-
sample: None,
86-
});
87-
state.count_since_last_flush += 1;
88-
state.total += 1;
89-
if state.sample.is_none() {
90-
let s = if path.len() > SAMPLE_LEN {
91-
format!("{}…", &path[..SAMPLE_LEN])
92-
} else {
93-
path.to_string()
94-
};
95-
state.sample = Some(s);
84+
impl SkipAggregator {
85+
const fn new(unit: &'static str, reason: &'static str) -> Self {
86+
Self {
87+
state: Mutex::new(None),
88+
unit,
89+
reason,
90+
}
9691
}
97-
if state.last_flush.elapsed().as_secs() >= FLUSH_INTERVAL_SECS {
98-
let count = state.count_since_last_flush;
99-
let total = state.total;
100-
let sample = state.sample.clone().unwrap_or_default();
101-
let secs = state.last_flush.elapsed().as_secs_f64();
102-
state.count_since_last_flush = 0;
103-
state.last_flush = Instant::now();
104-
state.sample = None;
105-
// Drop the lock before logging so the message format won't reenter under it.
106-
drop(guard);
107-
log::debug!(
108-
"Reconciler: skipped {} for unknown paths in {secs:.1}s [{total} total], sample: {sample}",
109-
pluralize(count, "removal")
110-
);
92+
93+
/// Increment the counter and emit a summary if the flush interval has elapsed.
94+
/// Called on every skip (cheap: one mutex acquisition, one branch).
95+
pub(super) fn record(&self, path: &str) {
96+
let mut guard = match self.state.lock() {
97+
Ok(g) => g,
98+
Err(p) => p.into_inner(),
99+
};
100+
let state = guard.get_or_insert_with(|| State {
101+
count_since_last_flush: 0,
102+
total: 0,
103+
last_flush: Instant::now(),
104+
sample: None,
105+
});
106+
state.count_since_last_flush += 1;
107+
state.total += 1;
108+
if state.sample.is_none() {
109+
// Truncate on a CHAR boundary, not a byte index: NAS paths carry accented
110+
// names (e.g. "Külkeres síelés"), and `&path[..SAMPLE_LEN]` would panic
111+
// when byte SAMPLE_LEN lands mid-codepoint.
112+
let s = if path.chars().count() > SAMPLE_LEN {
113+
format!("{}…", path.chars().take(SAMPLE_LEN).collect::<String>())
114+
} else {
115+
path.to_string()
116+
};
117+
state.sample = Some(s);
118+
}
119+
if state.last_flush.elapsed().as_secs() >= FLUSH_INTERVAL_SECS {
120+
let count = state.count_since_last_flush;
121+
let total = state.total;
122+
let sample = state.sample.clone().unwrap_or_default();
123+
let secs = state.last_flush.elapsed().as_secs_f64();
124+
state.count_since_last_flush = 0;
125+
state.last_flush = Instant::now();
126+
state.sample = None;
127+
let (unit, reason) = (self.unit, self.reason);
128+
// Drop the lock before logging so the message format won't reenter under it.
129+
drop(guard);
130+
log::debug!(
131+
"Reconciler: skipped {} {reason} in {secs:.1}s [{total} total], sample: {sample}",
132+
pluralize(count, unit)
133+
);
134+
}
111135
}
112136
}
137+
138+
/// Removals for a path that isn't in the DB (mostly harmless build-output churn).
139+
pub(super) static UNKNOWN_PATH_SKIPS: SkipAggregator = SkipAggregator::new("removal", "for unknown paths");
140+
/// Create/modify events whose parent dir isn't in the DB (stale intermediate dir).
141+
pub(super) static STALE_PARENT_SKIPS: SkipAggregator = SkipAggregator::new("event", "for missing parents");
113142
}
114143

115144
/// Buffers FSEvents during the initial scan and replays them after the scan completes.
@@ -921,7 +950,7 @@ fn handle_removal(
921950
// genuinely harmless. The aggregate at DEBUG (below) gives the existence-of-
922951
// drift signal without flooding the file.
923952
log::trace!("Reconciler: removal for unknown path, skipping: {normalized}");
924-
unknown_path_skips::record(normalized);
953+
skip_aggregator::UNKNOWN_PATH_SKIPS.record(normalized);
925954
return Some(affected);
926955
}
927956
Err(e) => {
@@ -985,8 +1014,12 @@ fn handle_creation_or_modification(
9851014
let parent_id = match store::resolve_path(conn, parent_path) {
9861015
Ok(Some(id)) => id,
9871016
Ok(None) => {
988-
// Parent not in DB -- stale event (intermediate directory missing), skip
989-
log::debug!("Reconciler: parent path not in DB, skipping event for {normalized} (parent: {parent_path})");
1017+
// Parent not in DB -- stale event (intermediate directory missing), skip.
1018+
// Per-event at TRACE; a DEBUG aggregate every ~5 s keeps the drift signal in
1019+
// error reports without the per-event flood (this was ~13% of normal log
1020+
// volume, almost all harmless build-output churn).
1021+
log::trace!("Reconciler: parent path not in DB, skipping event for {normalized} (parent: {parent_path})");
1022+
skip_aggregator::STALE_PARENT_SKIPS.record(normalized);
9901023
return Some(affected.clone());
9911024
}
9921025
Err(e) => {
Lines changed: 105 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,105 @@
1+
//! NAS system/snapshot directories the recursive size scan must not descend into.
2+
//!
3+
//! Synology, QNAP, NetApp, and Windows SMB shares expose reserved pseudo-directories
4+
//! that are catastrophic to recursively size:
5+
//!
6+
//! - **Snapshot trees** (`@Recently-Snapshot`, `#snapshot`, `.snapshot`) hold full
7+
//! point-in-time copies of the whole share. Their bytes are hardlinked/deduped, so
8+
//! summing them is both ruinously expensive (the scanner re-walks the entire
9+
//! filesystem once per snapshot) AND wrong (the total isn't real consumed space).
10+
//! One real report: a NAS first-scan stalled near 50% grinding through
11+
//! `@Recently-Snapshot`, which alone reported 44 TB on a 10 TB volume.
12+
//! - **Thumbnail/metadata sidecars** (`@eaDir`, `.@__thumb`) live inside *every* media
13+
//! folder, so a position-based ("only at share root") skip would miss them — they
14+
//! have to be matched at any depth.
15+
//! - **Recycle bins** (`@Recycle`, `#recycle`, `$RECYCLE.BIN`) and other system dirs
16+
//! are large and never what a size roll-up wants.
17+
//!
18+
//! These names are reserved vendor conventions (the `@` / `#` / `$` prefixes and
19+
//! `System Volume Information` don't collide with real user folders), so a name match
20+
//! is safe. We only SKIP RECURSION: the directory's own row is still indexed and stays
21+
//! listed and navigable (a user can walk into `@Recycle` to restore a file); we just
22+
//! don't auto-walk its subtree to compute a recursive size. Its size shows as unknown
23+
//! (`—`/`≥`), the honest state, rather than `0 B`.
24+
//!
25+
//! Scope: applied by the `Volume`-trait network scanner (`volume_scanner.rs`) only,
26+
//! which walks SMB/MTP shares — the home of these dirs. The local jwalk scanner has its
27+
//! own `should_exclude`. `FileEntry` carries no DOS hidden/system attribute, so matching
28+
//! the canonical names is the available signal; if attributes are plumbed through later,
29+
//! "hidden + system" would generalize this without a hardcoded list.
30+
31+
/// Canonical names of NAS system/snapshot directories, matched case-insensitively
32+
/// (NAS shares are typically case-insensitive). Extend as new vendor conventions
33+
/// surface; keep it to reserved, non-user-collidable names.
34+
const EXCLUDED_DIR_NAMES: &[&str] = &[
35+
// Synology
36+
"@eaDir",
37+
"@Recently-Snapshot",
38+
"@Recycle",
39+
"@sharesnap",
40+
"@sharebin",
41+
"@tmp",
42+
// QNAP
43+
"#recycle",
44+
"#snapshot",
45+
".@__thumb",
46+
// NetApp / generic
47+
".snapshot",
48+
".snapshots",
49+
// Windows / SMB
50+
"$RECYCLE.BIN",
51+
"System Volume Information",
52+
];
53+
54+
/// Whether the recursive size scan should NOT descend into a directory with this name.
55+
///
56+
/// `name` is a single path component (the directory's own name, not a path). Matched
57+
/// case-insensitively against [`EXCLUDED_DIR_NAMES`].
58+
pub(crate) fn is_recursion_excluded_dir(name: &str) -> bool {
59+
EXCLUDED_DIR_NAMES
60+
.iter()
61+
.any(|excluded| name.eq_ignore_ascii_case(excluded))
62+
}
63+
64+
#[cfg(test)]
65+
mod tests {
66+
use super::*;
67+
68+
#[test]
69+
fn excludes_known_nas_system_dirs() {
70+
for name in [
71+
"@eaDir",
72+
"@Recently-Snapshot",
73+
"@Recycle",
74+
"#recycle",
75+
"#snapshot",
76+
".snapshot",
77+
"$RECYCLE.BIN",
78+
"System Volume Information",
79+
] {
80+
assert!(is_recursion_excluded_dir(name), "{name} should be excluded");
81+
}
82+
}
83+
84+
#[test]
85+
fn matches_case_insensitively() {
86+
assert!(is_recursion_excluded_dir("@eadir"));
87+
assert!(is_recursion_excluded_dir("@RECENTLY-SNAPSHOT"));
88+
assert!(is_recursion_excluded_dir("system volume information"));
89+
}
90+
91+
#[test]
92+
fn keeps_ordinary_dirs() {
93+
for name in [
94+
"photos",
95+
"Dori-Dropbox",
96+
"videos",
97+
"eaDir",
98+
"recycle",
99+
"snapshot",
100+
"@myfiles",
101+
] {
102+
assert!(!is_recursion_excluded_dir(name), "{name} should NOT be excluded");
103+
}
104+
}
105+
}

0 commit comments

Comments
 (0)