fix(src): hoist web-ingest regexes and document unsafe mach FFI#65
Merged
Conversation
Two small correctness improvements prompted by the Rust audit. The
audit's broader claim — that production llm.rs, kb/ingest/youtube.rs,
and parts of kb/ingest/web.rs panicked on failure — turned out to be
wrong on closer reading: those .expect()/.unwrap() sites all live
inside #[cfg(test)] modules, which is the idiomatic Rust way to fail
a test. What did need work was narrower:
kb/ingest/web.rs
Eight regex_lite::Regex::new(...).unwrap() calls in production HTML
scrubbing helpers (extract_headings, html_to_text) recompiled the
same static pattern on every invocation and panicked with the
generic "called Option::unwrap on a None value" if a future edit
introduced an invalid literal. Hoists all eight to module-level
once_cell::sync::Lazy<Regex> constants — HEADING_RE, SCRIPT_RE,
STYLE_RE, COMMENT_RE, BLOCK_ELEMENT_RE, HTML_TAG_RE, WHITESPACE_RE,
NEWLINE_COLLAPSE_RE — each initialized with .expect("<name> regex
must compile"). Compilation happens once at first use and never
recompiles; the panic message now names which literal is broken.
once_cell is already a dep used across the crate.
diagnostics.rs
The macOS-only get_process_memory_bytes() contains an unsafe block
that calls mach_task_self() and task_info(). The block was
undocumented. Adds a SAFETY comment explaining the Mach FFI
contract: task-self port validity, flavor/out-struct match, zeroed
MaybeUninit alignment/bit-pattern validity, and the assume_init
ordering on the success path.
No behavior change. cargo check --all-targets clean; cargo test --lib
passes 311/312 (one pre-existing #[ignore]'d model-download test).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
.unwrap()calls insrc-tauri/src/kb/ingest/web.rsto module-levelonce_cell::sync::Lazy<Regex>constants with descriptive.expect("<name> regex must compile")messages. Compiles once at first use instead of on every call; any future invalid-literal edit now names the broken pattern.// SAFETY:comment before the Mach FFIunsafeblock atsrc-tauri/src/diagnostics.rs:653documenting themach_task_self/task_infocontract,MaybeUninit::zeroedalignment/bit-pattern validity, and theassume_initordering guarantee.Audit note: the original audit also flagged
.expect()/.unwrap()sites inllm.rs,kb/ingest/youtube.rs, and the SSRF resolver inweb.rs. On inspection every one of those is inside#[cfg(test)]— panicking on setup failure is idiomatic unit-test behavior. No production panic surface in those files.Test plan
cargo check --all-targetscargo test --lib— 311/312 pass (1 pre-existing#[ignore]'d model-download test)🤖 Generated with Claude Code