Skip to content

fix(src): hoist web-ingest regexes and document unsafe mach FFI#65

Merged
saagpatel merged 1 commit into
masterfrom
codex/fix/rust-panic-cleanup
Apr 22, 2026
Merged

fix(src): hoist web-ingest regexes and document unsafe mach FFI#65
saagpatel merged 1 commit into
masterfrom
codex/fix/rust-panic-cleanup

Conversation

@saagpatel
Copy link
Copy Markdown
Owner

Summary

  • Hoisted 8 production HTML-scrubbing regex .unwrap() calls in src-tauri/src/kb/ingest/web.rs to module-level once_cell::sync::Lazy<Regex> constants with descriptive .expect("<name> regex must compile") messages. Compiles once at first use instead of on every call; any future invalid-literal edit now names the broken pattern.
  • Added a multi-paragraph // SAFETY: comment before the Mach FFI unsafe block at src-tauri/src/diagnostics.rs:653 documenting the mach_task_self/task_info contract, MaybeUninit::zeroed alignment/bit-pattern validity, and the assume_init ordering guarantee.

Audit note: the original audit also flagged .expect()/.unwrap() sites in llm.rs, kb/ingest/youtube.rs, and the SSRF resolver in web.rs. On inspection every one of those is inside #[cfg(test)] — panicking on setup failure is idiomatic unit-test behavior. No production panic surface in those files.

Test plan

  • cargo check --all-targets
  • cargo test --lib — 311/312 pass (1 pre-existing #[ignore]'d model-download test)
  • 7 web-ingest regex tests still pass after Lazy hoist

🤖 Generated with Claude Code

Two small correctness improvements prompted by the Rust audit. The
audit's broader claim — that production llm.rs, kb/ingest/youtube.rs,
and parts of kb/ingest/web.rs panicked on failure — turned out to be
wrong on closer reading: those .expect()/.unwrap() sites all live
inside #[cfg(test)] modules, which is the idiomatic Rust way to fail
a test. What did need work was narrower:

kb/ingest/web.rs
  Eight regex_lite::Regex::new(...).unwrap() calls in production HTML
  scrubbing helpers (extract_headings, html_to_text) recompiled the
  same static pattern on every invocation and panicked with the
  generic "called Option::unwrap on a None value" if a future edit
  introduced an invalid literal. Hoists all eight to module-level
  once_cell::sync::Lazy<Regex> constants — HEADING_RE, SCRIPT_RE,
  STYLE_RE, COMMENT_RE, BLOCK_ELEMENT_RE, HTML_TAG_RE, WHITESPACE_RE,
  NEWLINE_COLLAPSE_RE — each initialized with .expect("<name> regex
  must compile"). Compilation happens once at first use and never
  recompiles; the panic message now names which literal is broken.
  once_cell is already a dep used across the crate.

diagnostics.rs
  The macOS-only get_process_memory_bytes() contains an unsafe block
  that calls mach_task_self() and task_info(). The block was
  undocumented. Adds a SAFETY comment explaining the Mach FFI
  contract: task-self port validity, flavor/out-struct match, zeroed
  MaybeUninit alignment/bit-pattern validity, and the assume_init
  ordering on the success path.

No behavior change. cargo check --all-targets clean; cargo test --lib
passes 311/312 (one pre-existing #[ignore]'d model-download test).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@saagpatel saagpatel merged commit e3355ef into master Apr 22, 2026
25 of 26 checks passed
@saagpatel saagpatel deleted the codex/fix/rust-panic-cleanup branch May 31, 2026 09:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants