Skip to content

Commit a003f00

Browse files
committed
Drive indexing: scan SMB/MTP volumes with bounded concurrency (~N× faster walks)
The Volume-trait scanner listed directories strictly one-at-a-time: pop a dir, await its full open+query+close round trips, then the next. Directory listing is latency-bound (each dir is a few LAN round trips over an otherwise-idle link), so a real NAS first-scan crawled at ~28 dirs/sec — 575k entries in 17 min and still going — purely from serialization, not the NAS or the link. Both the fresh scan (`scan_volume_via_trait`) and the reconcile walk (`reconcile_volume_via_trait`) now keep up to `SCAN_CONCURRENCY` (32) `list_directory` round trips in flight via a `FuturesUnordered` pump. SMB2 multiplexes many in-flight requests over one session (credits + per-message IDs; `SmbVolume` already supports concurrent use), so overlapping the idle-link latency is a near-linear speedup until credits saturate — minutes-long scans drop to seconds. Only the network I/O overlaps; result processing stays serial on the walk task, so the data-integrity guarantees hold unchanged: - `ScanContext` id allocation (fresh) and the DB read connection + diff (reconcile) stay single-owner — no locking — and the "a dir's id is registered before its children are listed" invariant still holds (a child is enqueued only after its parent's result is processed). - Cancel drops the in-flight set (smb2/MTP tolerate a dropped request waiter); a typed terminal disconnect stops topping up and runs the partial-preserving finish; the consecutive-failure backstop still trips on a real disconnect (failures pile up with no successes to reset the counter), now spanning up to `SCAN_CONCURRENCY` in-flight failures. - The reconcile path resolves new-dir ids at a WAVE boundary (queue AND in-flight both drained) instead of per BFS level. Tests: new `walk_lists_directories_concurrently` proves max-in-flight > 1, capped at `SCAN_CONCURRENCY` (a serial revert would record 1). The disconnect/backstop tests now assert a bounded stop (no full-queue churn) instead of an exact serial call count; the reconcile-correctness suite still proves the concurrent reconcile yields an index identical to a from-scratch scan. Full check green.
1 parent 6c33dfb commit a003f00

2 files changed

Lines changed: 245 additions & 89 deletions

File tree

apps/desktop/src-tauri/src/indexing/DETAILS.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -149,6 +149,10 @@ Three disciplines for network round trips (plan rabbit hole #3), all in `list_on
149149

150150
A sub-directory that fails to list (permission, transient) is skipped and the walk continues (like jwalk skipping errored entries); failing to list the ROOT is fatal (nothing to index) so the caller discards.
151151

152+
#### Bounded-concurrency walk (`SCAN_CONCURRENCY`)
153+
154+
Both the fresh scan and the reconcile walk keep up to `SCAN_CONCURRENCY` (32) `list_directory` round trips in flight at once via a `FuturesUnordered` pump, instead of one-at-a-time. Directory listing is latency-bound — each dir is an open+query+close round trip over an otherwise-idle link — so overlapping them is a near-linear speedup until the server's SMB credits saturate (one real first-scan: ~28 dirs/s serial → the link was idle between round trips). **Only the network I/O overlaps**: results are processed serially on the walk task, so `ScanContext` id allocation (fresh) and the DB read connection + diff (reconcile) stay single-owner with no locking, and the "a dir's id is registered before its children are listed" invariant still holds — a child is enqueued only after its parent's result is processed. **Decision/Why concurrency is safe for the data-integrity guarantees:** cancel drops the in-flight set (the smb2/MTP backends tolerate a dropped request waiter); a typed terminal disconnect stops topping up and runs the partial-preserving finish; the consecutive-failure backstop still trips on a real disconnect (failures pile up with no successes to reset the counter) though "consecutive" now spans up to `SCAN_CONCURRENCY` in-flight failures. The reconcile path's new-dir id resolution flushes at a WAVE boundary (queue AND in-flight both drained) rather than per BFS level. Pinned by `walk_lists_directories_concurrently` (proves max-in-flight > 1, capped at `SCAN_CONCURRENCY`) plus the disconnect/backstop tests (bounded stop, no full-queue churn) and the reconcile-correctness suite (identical index vs a from-scratch scan).
155+
152156
#### NAS snapshot/system dirs aren't recursed (`system_dirs.rs`)
153157

154158
The BFS does NOT descend into NAS snapshot/system pseudo-directories (`@eaDir`, `@Recently-Snapshot`, `@Recycle`, `#recycle`, `#snapshot`, `.snapshot`, `$RECYCLE.BIN`, `System Volume Information`, …; matched case-insensitively by `system_dirs::is_recursion_excluded_dir`). Both the fresh scan and the reconcile walk apply it: the dir's own row is still indexed (so it stays listed and navigable — a user can walk into `@Recycle` to restore a file), but its subtree is never walked, so it rolls up as honestly-unknown (`—`/`≥`) rather than a misleading total. **Decision/Why:** these dirs are hardlinked, huge, and re-walking them costs a full filesystem traversal *per snapshot* over serialized SMB — a real first-scan stalled near 50% grinding `@Recently-Snapshot`, which alone reported 44 TB on a 10 TB volume. Summing them is both ruinous and wrong (the bytes are deduped, not real consumed space). The names are reserved vendor conventions (`@`/`#`/`$` prefixes) that don't collide with user folders, so a name match is safe. **Guardrail:** don't remove the exclusion to "fill in" the missing sizes — that re-triggers the stall. Scope is the network scanner only (the home of these dirs); the local jwalk scanner has its own `should_exclude`. `FileEntry` carries no DOS hidden/system attribute today; if one is plumbed through, "hidden + system" would generalize this without the hardcoded list.

0 commit comments

Comments
 (0)