edgeparse runs everywhere & sandboxes securely — WASM/RISC-V integration test infrastructure
Greg's AI coding buddy here 🖖 — been living in terminal, compiling Dockerfiles, and arguing with RISC-V proxy kernels. Here's what we cooked up.
Why this matters
edgeparse's killer combo — 15 MB binary, 0.023 s/doc, zero GPU/JVM dependencies — makes it a natural fit for two underserved use cases:
🔒 Zero-trust PDF parsing (security sandboxing)
WASM runtimes (wasmtime, wasmer, wasmedge) and RISC-V sandboxes (libriscv, CKB-VM) enforce hardware-grade isolation: no filesystem escape, no network access, no shared memory with the host. Running edgeparse inside a WASM or RISC-V sandbox means you can parse hostile/untrusted PDFs without risking the host process.
This matters for:
- Email security gateways processing unknown attachments
- Document ingestion pipelines handling user-uploaded PDFs
- Any environment where a malformed PDF could be a vector
The 3.1 MB .wasm binary (or 4.0 MB RISC-V ELF) runs the full 20-stage pipeline — identical output, sandboxed execution.
Why containers alone aren't enough for Agentic AI pipelines: Many teams containerize their document processing pipelines and consider the security problem solved. But a container protects the host — not the workflow inside it. If your agentic pipeline parses untrusted PDFs, generates new documents, and feeds LLM outputs all within the same container, a malicious PDF doesn't need to escape — it just needs to corrupt the data flowing through the pipeline. Process-level sandboxing via WASM/RISC-V VMs provides defense in depth: each PDF parse runs in its own isolated memory space, so even a successful exploit can't touch other documents, access the network, or tamper with the pipeline logic.
🌍 Universal plugin embedding
WASM and RISC-V are the emerging universal plugin/sandbox formats. Any software that uses wasmtime/wasmer/wasmedge (WASM) or libriscv/CKB-VM (RISC-V) as a plugin runtime can embed edgeparse as a PDF extraction capability — no separate process, no IPC, no language bindings needed.
- Serverless functions (Fermyon Spin, Cloudflare Workers, Fastly Compute)
- Desktop apps with WASM plugin architectures
- Game engines and creative tools using RISC-V sandboxes (libriscv in Godot, RVScript)
- Blockchain smart contracts with PDF processing (CKB-VM on Nervos)
🐛 Cross-architecture bug detection
Different targets expose different classes of bugs:
- WASM's strict linear memory model catches out-of-bounds accesses that x86 silently tolerates
- RISC-V's alignment requirements surface misaligned reads hidden by x86's forgiving memory subsystem
- Running on 6+ runtimes with different JIT compilers is essentially a free fuzzing multiplier
More targets = more confidence in correctness.
What we built
A complete Docker-based integration test infrastructure that builds edgeparse for WASM (wasm32-wasip1) and RISC-V (riscv64gc), then runs it across multiple runtimes — all containerized for reproducibility.
Test matrix
| Runtime |
Type |
Tests |
Status |
Notes |
| Wasmtime 43.0 |
WASM |
8/8 ✅ |
Production |
Bytecode Alliance reference |
| Wasmer 7.x |
WASM |
8/8 ✅ |
Production |
WASIX superpowers |
| WasmEdge 0.14.1 |
WASM |
8/8 ✅ |
Production |
CNCF, cloud-native |
| WAMR/iwasm |
WASM |
8/8 ✅ |
Production |
Embedded champion (~100KB) |
| WASIX (Wasmer) |
WASM |
8/8 ✅ |
Production |
POSIX compat on Wasmer |
| RISC-V QEMU |
RISC-V |
8/8 ✅ |
Production |
User-mode emulation |
| Spike + pk |
RISC-V |
1/8 ⚠️ |
WIP |
Official ISA ref sim, limited pk syscalls |
| libriscv |
RISC-V |
2/8 ⚠️ |
WIP |
Fastest sandbox, glibc TLS issue |
| RVVM |
RISC-V |
0/8 ❌ |
Incompatible |
System emulator only (no userland) |
| CKB-VM |
RISC-V |
N/A ❌ |
Upstream broken |
ckb-debugger compile error |
What each test validates
--help — CLI bootstraps correctly on this runtime
--version — version string (semver) is readable
- PDF → JSON — full 20-stage pipeline, structured output
- PDF → Markdown — text extraction + formatting
- PDF → Text — plain text + content sanity check (verifies extracted text matches source PDF)
- PDF → HTML — markup generation
- Error handling — non-existent file returns proper error code
Tests 3–6 parse an actual PDF (tests/fixtures/sample.pdf) through the complete extraction pipeline.
Implementation
CLI changes (minimal, backward-compatible):
rayon made optional behind a native feature flag (default: on)
- New
convert_file() dispatcher: native → convert(), WASI → convert_bytes()
- Build:
cargo build --target wasm32-wasip1 --no-default-features
Test infrastructure:
tests/wasm-runtimes/wasm-test.sh — management script (build|test|status|run|log|rmi|clean)
tests/wasm-runtimes/run-tests.sh — test runner (executes inside containers)
- One Dockerfile per runtime, all sharing a common Ubuntu 24.04 base
- Docker images prefixed with
edgeparse-* (configurable via EDGEPARSE_PREFIX)
- Makefile targets:
make wasi-build, make wasi-test, make wasi-status, make wasi-clean
Checklist
Foundation for the community
Merging this infrastructure gives every developer interested in sandboxed PDF parsing a working starting point — whether they're embedding edgeparse as a WASM plugin in their app, running it inside a RISC-V sandbox, or just want to validate it on a different architecture. The Dockerfiles, test scripts, and build configurations serve as living documentation and ready-to-fork examples.
Sandboxed CLI usage — parse untrusted PDFs safely from your terminal
Anyone scripting edgeparse (or asking an LLM to use it) can run the sandboxed binary directly from the command line via any WASM/RISC-V runtime. No code changes needed — just use the runtime as an isolation wrapper:
# Parse an untrusted PDF in a wasmtime sandbox — no filesystem/network escape
wasmtime run --dir /data edgeparse.wasm -f markdown -o /data/output /data/untrusted.pdf
# Same thing with wasmer (WASIX mode)
wasmer run --volume /data:/data edgeparse.wasm -- -f json -o /data/output /data/untrusted.pdf
# Or via RISC-V isolation under QEMU
qemu-riscv64 ./edgeparse-riscv64 -f text -o /data/output /data/untrusted.pdf
The PRs themselves serve as practical examples of how to build and run edgeparse on each target — check the Dockerfiles and run-tests.sh for the exact CLI invocations per runtime.
We're happy to provide additional PRs with examples, documentation, or helper scripts for compiling edgeparse.wasm / edgeparse-riscv64 and running them under any of the tested engines — just let us know what would be most useful.
Why this is not theoretical — the CVE landscape
Document processing libraries used in AI pipelines have a track record of critical vulnerabilities:
| CVE |
Tool |
CVSS |
Impact |
| CVE-2025-64712 |
unstructured.io |
9.8 |
Path traversal → arbitrary file write/RCE via attachment filenames. Directly affects LangChain UnstructuredLoader and LlamaIndex UnstructuredReader. |
| CVE-2025-66516 |
Apache Tika |
9.8 |
XXE injection via PDF/XFA → reads local files, SSRF to cloud metadata endpoints. Affects LangChain4j, ElasticSearch, Solr. |
| CVE-2025-68664 |
LangChain Core |
9.3 |
Serialization injection in dumps()/dumpd() → secret exfiltration, RCE. 12 distinct vulnerable flows. |
| CVE-2024-4367 |
PDF.js |
High |
FontMatrix code injection → arbitrary JS execution in any PDF viewer, including Electron-based AI UIs. |
| CVE-2025-1753 |
LlamaIndex CLI |
Critical |
OS command injection via unsanitized --files argument passed to os.system(). |
| CVE-2023-33733 |
ReportLab |
Critical |
eval() sandbox bypass in HTML→PDF conversion → RCE with public exploit. |
Every one of these could trigger inside a "secure" container. WASM/RISC-V sandboxing provides the missing inner isolation layer.
Future vision
Once the foundation is merged, these targets become straightforward additions:
- Raspberry Pi (aarch64) — native ARM64 binary, Docker cross-compilation
- s390x — IBM mainframe architecture (QEMU emulation)
- wasm32-wasip2 — WASI Preview 2 with Component Model
- Native WASIX build —
cargo wasix for full POSIX in WASM (threads, sockets)
- Browser WASM — existing
edgeparse-wasm crate already covers this, but integration tests would unify the matrix
How to try it
# Clone, build, and test everything
git clone https://github.com/VariousForks/edgeparse-by-raphaelmansuy.git
cd edgeparse-by-raphaelmansuy
./tests/wasm-runtimes/wasm-test.sh build all
./tests/wasm-runtimes/wasm-test.sh test all
# Or test a single runtime
./tests/wasm-runtimes/wasm-test.sh test wasmtime
# Interactive debugging inside a runtime container
./tests/wasm-runtimes/wasm-test.sh run wasmer
PRs ready for the green targets. Each PR includes its Dockerfile + test validation.
Greg's AI coding buddy, reporting from the terminal trenches 🖖
🤖 Generated with Claude Code
edgeparse runs everywhere & sandboxes securely — WASM/RISC-V integration test infrastructure
Greg's AI coding buddy here 🖖 — been living in terminal, compiling Dockerfiles, and arguing with RISC-V proxy kernels. Here's what we cooked up.
Why this matters
edgeparse's killer combo — 15 MB binary, 0.023 s/doc, zero GPU/JVM dependencies — makes it a natural fit for two underserved use cases:
🔒 Zero-trust PDF parsing (security sandboxing)
WASM runtimes (wasmtime, wasmer, wasmedge) and RISC-V sandboxes (libriscv, CKB-VM) enforce hardware-grade isolation: no filesystem escape, no network access, no shared memory with the host. Running edgeparse inside a WASM or RISC-V sandbox means you can parse hostile/untrusted PDFs without risking the host process.
This matters for:
The 3.1 MB
.wasmbinary (or 4.0 MB RISC-V ELF) runs the full 20-stage pipeline — identical output, sandboxed execution.Why containers alone aren't enough for Agentic AI pipelines: Many teams containerize their document processing pipelines and consider the security problem solved. But a container protects the host — not the workflow inside it. If your agentic pipeline parses untrusted PDFs, generates new documents, and feeds LLM outputs all within the same container, a malicious PDF doesn't need to escape — it just needs to corrupt the data flowing through the pipeline. Process-level sandboxing via WASM/RISC-V VMs provides defense in depth: each PDF parse runs in its own isolated memory space, so even a successful exploit can't touch other documents, access the network, or tamper with the pipeline logic.
🌍 Universal plugin embedding
WASM and RISC-V are the emerging universal plugin/sandbox formats. Any software that uses wasmtime/wasmer/wasmedge (WASM) or libriscv/CKB-VM (RISC-V) as a plugin runtime can embed edgeparse as a PDF extraction capability — no separate process, no IPC, no language bindings needed.
🐛 Cross-architecture bug detection
Different targets expose different classes of bugs:
More targets = more confidence in correctness.
What we built
A complete Docker-based integration test infrastructure that builds edgeparse for WASM (wasm32-wasip1) and RISC-V (riscv64gc), then runs it across multiple runtimes — all containerized for reproducibility.
Test matrix
What each test validates
--help— CLI bootstraps correctly on this runtime--version— version string (semver) is readableTests 3–6 parse an actual PDF (
tests/fixtures/sample.pdf) through the complete extraction pipeline.Implementation
CLI changes (minimal, backward-compatible):
rayonmade optional behind anativefeature flag (default: on)convert_file()dispatcher: native →convert(), WASI →convert_bytes()cargo build --target wasm32-wasip1 --no-default-featuresTest infrastructure:
tests/wasm-runtimes/wasm-test.sh— management script (build|test|status|run|log|rmi|clean)tests/wasm-runtimes/run-tests.sh— test runner (executes inside containers)edgeparse-*(configurable viaEDGEPARSE_PREFIX)make wasi-build,make wasi-test,make wasi-status,make wasi-cleanChecklist
convert_file()dispatcher) — included in all PRs belowwasi-build,wasi-test,wasi-status,wasi-clean) — included in all PRs belowi19-target-spike-not-green-WIPi19-target-libriscv-not-green-WIPi19-target-rvvm-not-green-WIPi19-target-ckb-vm-not-green-WIPFoundation for the community
Merging this infrastructure gives every developer interested in sandboxed PDF parsing a working starting point — whether they're embedding edgeparse as a WASM plugin in their app, running it inside a RISC-V sandbox, or just want to validate it on a different architecture. The Dockerfiles, test scripts, and build configurations serve as living documentation and ready-to-fork examples.
Sandboxed CLI usage — parse untrusted PDFs safely from your terminal
Anyone scripting edgeparse (or asking an LLM to use it) can run the sandboxed binary directly from the command line via any WASM/RISC-V runtime. No code changes needed — just use the runtime as an isolation wrapper:
The PRs themselves serve as practical examples of how to build and run edgeparse on each target — check the Dockerfiles and
run-tests.shfor the exact CLI invocations per runtime.Why this is not theoretical — the CVE landscape
Document processing libraries used in AI pipelines have a track record of critical vulnerabilities:
UnstructuredLoaderand LlamaIndexUnstructuredReader.dumps()/dumpd()→ secret exfiltration, RCE. 12 distinct vulnerable flows.--filesargument passed toos.system().eval()sandbox bypass in HTML→PDF conversion → RCE with public exploit.Every one of these could trigger inside a "secure" container. WASM/RISC-V sandboxing provides the missing inner isolation layer.
Future vision
Once the foundation is merged, these targets become straightforward additions:
cargo wasixfor full POSIX in WASM (threads, sockets)edgeparse-wasmcrate already covers this, but integration tests would unify the matrixHow to try it
PRs ready for the green targets. Each PR includes its Dockerfile + test validation.
Greg's AI coding buddy, reporting from the terminal trenches 🖖
🤖 Generated with Claude Code