Skip to content

edgeparse sandboxes securely & runs everywhere — WASM/RISC-V integration test infrastructure #19

@gwpl

Description

@gwpl

edgeparse runs everywhere & sandboxes securely — WASM/RISC-V integration test infrastructure

Greg's AI coding buddy here 🖖 — been living in terminal, compiling Dockerfiles, and arguing with RISC-V proxy kernels. Here's what we cooked up.

Why this matters

edgeparse's killer combo — 15 MB binary, 0.023 s/doc, zero GPU/JVM dependencies — makes it a natural fit for two underserved use cases:

🔒 Zero-trust PDF parsing (security sandboxing)

WASM runtimes (wasmtime, wasmer, wasmedge) and RISC-V sandboxes (libriscv, CKB-VM) enforce hardware-grade isolation: no filesystem escape, no network access, no shared memory with the host. Running edgeparse inside a WASM or RISC-V sandbox means you can parse hostile/untrusted PDFs without risking the host process.

This matters for:

  • Email security gateways processing unknown attachments
  • Document ingestion pipelines handling user-uploaded PDFs
  • Any environment where a malformed PDF could be a vector

The 3.1 MB .wasm binary (or 4.0 MB RISC-V ELF) runs the full 20-stage pipeline — identical output, sandboxed execution.

Why containers alone aren't enough for Agentic AI pipelines: Many teams containerize their document processing pipelines and consider the security problem solved. But a container protects the host — not the workflow inside it. If your agentic pipeline parses untrusted PDFs, generates new documents, and feeds LLM outputs all within the same container, a malicious PDF doesn't need to escape — it just needs to corrupt the data flowing through the pipeline. Process-level sandboxing via WASM/RISC-V VMs provides defense in depth: each PDF parse runs in its own isolated memory space, so even a successful exploit can't touch other documents, access the network, or tamper with the pipeline logic.

🌍 Universal plugin embedding

WASM and RISC-V are the emerging universal plugin/sandbox formats. Any software that uses wasmtime/wasmer/wasmedge (WASM) or libriscv/CKB-VM (RISC-V) as a plugin runtime can embed edgeparse as a PDF extraction capability — no separate process, no IPC, no language bindings needed.

  • Serverless functions (Fermyon Spin, Cloudflare Workers, Fastly Compute)
  • Desktop apps with WASM plugin architectures
  • Game engines and creative tools using RISC-V sandboxes (libriscv in Godot, RVScript)
  • Blockchain smart contracts with PDF processing (CKB-VM on Nervos)

🐛 Cross-architecture bug detection

Different targets expose different classes of bugs:

  • WASM's strict linear memory model catches out-of-bounds accesses that x86 silently tolerates
  • RISC-V's alignment requirements surface misaligned reads hidden by x86's forgiving memory subsystem
  • Running on 6+ runtimes with different JIT compilers is essentially a free fuzzing multiplier

More targets = more confidence in correctness.

What we built

A complete Docker-based integration test infrastructure that builds edgeparse for WASM (wasm32-wasip1) and RISC-V (riscv64gc), then runs it across multiple runtimes — all containerized for reproducibility.

Test matrix

Runtime Type Tests Status Notes
Wasmtime 43.0 WASM 8/8 ✅ Production Bytecode Alliance reference
Wasmer 7.x WASM 8/8 ✅ Production WASIX superpowers
WasmEdge 0.14.1 WASM 8/8 ✅ Production CNCF, cloud-native
WAMR/iwasm WASM 8/8 ✅ Production Embedded champion (~100KB)
WASIX (Wasmer) WASM 8/8 ✅ Production POSIX compat on Wasmer
RISC-V QEMU RISC-V 8/8 ✅ Production User-mode emulation
Spike + pk RISC-V 1/8 ⚠️ WIP Official ISA ref sim, limited pk syscalls
libriscv RISC-V 2/8 ⚠️ WIP Fastest sandbox, glibc TLS issue
RVVM RISC-V 0/8 ❌ Incompatible System emulator only (no userland)
CKB-VM RISC-V N/A ❌ Upstream broken ckb-debugger compile error

What each test validates

  1. --help — CLI bootstraps correctly on this runtime
  2. --version — version string (semver) is readable
  3. PDF → JSON — full 20-stage pipeline, structured output
  4. PDF → Markdown — text extraction + formatting
  5. PDF → Text — plain text + content sanity check (verifies extracted text matches source PDF)
  6. PDF → HTML — markup generation
  7. Error handling — non-existent file returns proper error code

Tests 3–6 parse an actual PDF (tests/fixtures/sample.pdf) through the complete extraction pipeline.

Implementation

CLI changes (minimal, backward-compatible):

  • rayon made optional behind a native feature flag (default: on)
  • New convert_file() dispatcher: native → convert(), WASI → convert_bytes()
  • Build: cargo build --target wasm32-wasip1 --no-default-features

Test infrastructure:

  • tests/wasm-runtimes/wasm-test.sh — management script (build|test|status|run|log|rmi|clean)
  • tests/wasm-runtimes/run-tests.sh — test runner (executes inside containers)
  • One Dockerfile per runtime, all sharing a common Ubuntu 24.04 base
  • Docker images prefixed with edgeparse-* (configurable via EDGEPARSE_PREFIX)
  • Makefile targets: make wasi-build, make wasi-test, make wasi-status, make wasi-clean

Checklist

  • Foundation: CLI WASI target support (conditional rayon, convert_file() dispatcher) — included in all PRs below
  • Foundation: Test infrastructure (management script, test runner, base Dockerfile) — included in all PRs below
  • Foundation: Makefile integration (wasi-build, wasi-test, wasi-status, wasi-clean) — included in all PRs below
  • Runtime: Wasmtime (8/8 ✅) → PR #20
  • Runtime: Wasmer (8/8 ✅) → PR #21
  • Runtime: WasmEdge (8/8 ✅) → PR #22
  • Runtime: WAMR/iwasm (8/8 ✅) → PR #23
  • Runtime: WASIX on Wasmer (8/8 ✅) → PR #24
  • Runtime: RISC-V QEMU (8/8 ✅) → PR #25
  • WIP: Spike + pk (needs better syscall proxying) → branch i19-target-spike-not-green-WIP
  • WIP: libriscv (needs musl static linking) → branch i19-target-libriscv-not-green-WIP
  • WIP: RVVM (system emulator, no userland mode) → branch i19-target-rvvm-not-green-WIP
  • WIP: CKB-VM (upstream compile error) → branch i19-target-ckb-vm-not-green-WIP
  • CI/CD: GitHub Actions workflow for WASM runtime tests

Foundation for the community

Merging this infrastructure gives every developer interested in sandboxed PDF parsing a working starting point — whether they're embedding edgeparse as a WASM plugin in their app, running it inside a RISC-V sandbox, or just want to validate it on a different architecture. The Dockerfiles, test scripts, and build configurations serve as living documentation and ready-to-fork examples.

Sandboxed CLI usage — parse untrusted PDFs safely from your terminal

Anyone scripting edgeparse (or asking an LLM to use it) can run the sandboxed binary directly from the command line via any WASM/RISC-V runtime. No code changes needed — just use the runtime as an isolation wrapper:

# Parse an untrusted PDF in a wasmtime sandbox — no filesystem/network escape
wasmtime run --dir /data edgeparse.wasm -f markdown -o /data/output /data/untrusted.pdf

# Same thing with wasmer (WASIX mode)
wasmer run --volume /data:/data edgeparse.wasm -- -f json -o /data/output /data/untrusted.pdf

# Or via RISC-V isolation under QEMU
qemu-riscv64 ./edgeparse-riscv64 -f text -o /data/output /data/untrusted.pdf

The PRs themselves serve as practical examples of how to build and run edgeparse on each target — check the Dockerfiles and run-tests.sh for the exact CLI invocations per runtime.

We're happy to provide additional PRs with examples, documentation, or helper scripts for compiling edgeparse.wasm / edgeparse-riscv64 and running them under any of the tested engines — just let us know what would be most useful.

Why this is not theoretical — the CVE landscape

Document processing libraries used in AI pipelines have a track record of critical vulnerabilities:

CVE Tool CVSS Impact
CVE-2025-64712 unstructured.io 9.8 Path traversal → arbitrary file write/RCE via attachment filenames. Directly affects LangChain UnstructuredLoader and LlamaIndex UnstructuredReader.
CVE-2025-66516 Apache Tika 9.8 XXE injection via PDF/XFA → reads local files, SSRF to cloud metadata endpoints. Affects LangChain4j, ElasticSearch, Solr.
CVE-2025-68664 LangChain Core 9.3 Serialization injection in dumps()/dumpd() → secret exfiltration, RCE. 12 distinct vulnerable flows.
CVE-2024-4367 PDF.js High FontMatrix code injection → arbitrary JS execution in any PDF viewer, including Electron-based AI UIs.
CVE-2025-1753 LlamaIndex CLI Critical OS command injection via unsanitized --files argument passed to os.system().
CVE-2023-33733 ReportLab Critical eval() sandbox bypass in HTML→PDF conversion → RCE with public exploit.

Every one of these could trigger inside a "secure" container. WASM/RISC-V sandboxing provides the missing inner isolation layer.

Future vision

Once the foundation is merged, these targets become straightforward additions:

  • Raspberry Pi (aarch64) — native ARM64 binary, Docker cross-compilation
  • s390x — IBM mainframe architecture (QEMU emulation)
  • wasm32-wasip2 — WASI Preview 2 with Component Model
  • Native WASIX buildcargo wasix for full POSIX in WASM (threads, sockets)
  • Browser WASM — existing edgeparse-wasm crate already covers this, but integration tests would unify the matrix

How to try it

# Clone, build, and test everything
git clone https://github.com/VariousForks/edgeparse-by-raphaelmansuy.git
cd edgeparse-by-raphaelmansuy
./tests/wasm-runtimes/wasm-test.sh build all
./tests/wasm-runtimes/wasm-test.sh test all

# Or test a single runtime
./tests/wasm-runtimes/wasm-test.sh test wasmtime

# Interactive debugging inside a runtime container
./tests/wasm-runtimes/wasm-test.sh run wasmer

PRs ready for the green targets. Each PR includes its Dockerfile + test validation.

Greg's AI coding buddy, reporting from the terminal trenches 🖖

🤖 Generated with Claude Code

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions