Fast OOXML extraction without rendering.
oxdoc is a Rust workspace for extracting plain text, CSV, and metadata from Office Open XML containers such as .docx, .xlsx, and .pptx. It is built for automation: shell pipelines, CI jobs, serverless functions, ingestion systems, and embedding through a stable Rust API.
oxdoc is not a document renderer. It does not preserve layout, calculate pagination, generate PDFs, or implement the full OOXML specification. It reads the ZIP-based Office container, targets the XML parts that matter, and emits useful structured output.
oxdoc has a stable 1.x CLI and Rust API contract for:
- DOCX text extraction from the supported document parts.
- DOCX logical text semantics for paragraph breaks, table cell and row separation, tabs, line breaks, and deleted revision text handling.
- PPTX text extraction from slide text boxes and speaker notes.
- XLSX worksheet-to-CSV extraction.
- XLSX sheet listing and sheet selection by name or index, with explicit opt-in for hidden and very hidden sheets.
- Metadata extraction from core, app, and custom document properties.
- A reusable
oxdoc-corecrate and a CLI-facingoxdoc-clicrate. - Typed errors plus recoverable warnings.
- Stdin, batch extraction, file output, release binaries, and checksum-verified installs.
The complete documentation site is built with Docsify and lives in docs/. The public site URL is spereyra-dev.github.io/oxdoc.
Serve it locally:
npx docsify-cli@4 serve docs --port 3000Then open:
http://localhost:3000
The docs GitHub Actions workflow validates that the Docsify site serves correctly, checks internal Markdown links, and verifies the published schema copies stay in sync. Publishing it through GitHub Pages requires enabling Pages for the repository in GitHub settings.
Key documentation pages:
- Getting Started
- Installation
- CLI Reference
- Library API
- Python Integration
- Architecture
- OOXML Model
- Errors and Warnings
- Performance and Memory
- Peak Memory Baselines
- Competitive Workbench
- Testing Strategy
- Roadmap
- Security
Install the latest macOS/Linux release:
curl -fsSL https://raw.githubusercontent.com/spereyra-dev/oxdoc/main/install.sh | shInstall with Cargo:
cargo install oxdoc-cliOr build the workspace from source:
cargo build --workspaceRun the CLI through Cargo:
cargo run -p oxdoc-cli -- --helpInstall the CLI locally from source:
cargo install --path crates/oxdoc-clioxdoc keeps parser performance visible through three reproducible workflows:
| Workflow | Command | What it tells you |
|---|---|---|
| Library throughput | cargo bench -p oxdoc-core |
In-process parser throughput for DOCX text and XLSX CSV paths. |
| Peak memory | make memory-baselines |
Release CLI peak RSS on synthetic DOCX, PPTX, and XLSX workloads. |
| Competitive workbench | make competitor-workbench |
Full CLI wall time and peak RSS against optional local extraction tools. |
cargo bench -p oxdoc-core
make memory-baselines
make competitor-workbench- Criterion throughput benches cover DOCX text extraction, dense XLSX CSV extraction, and shared-string-heavy XLSX extraction.
- Peak-memory baselines measure release CLI RSS on synthetic DOCX, PPTX, and XLSX workloads.
- The competitive workbench compares
oxdocwith optional local tools such as Apache Tika,xlsx2csv, and Mammoth when they are installed.
Comparable tools covered by the workbench:
| Tool | Compared cases | Notes |
|---|---|---|
| Apache Tika | DOCX and PPTX text extraction | Broad document extraction framework; useful as the general-purpose baseline. |
xlsx2csv |
XLSX dense, sparse, and shared-string CSV extraction | Closest direct XLSX-to-CSV comparison. |
| Mammoth | DOCX extraction | DOCX-focused converter; output shape differs from plain-text extraction. |
See Performance and Memory, Peak Memory Baselines, and Competitive Workbench.
For the runtime and usage exit code contract, see docs/cli.md#exit-codes.
oxdoc extract text contrato.docxFor presentations:
oxdoc extract text deck.pptxPlain text is written to stdout:
Este es el texto plano extraido velozmente...
For integrations:
oxdoc extract text contrato.docx --format jsonFor streaming batch ingestion:
oxdoc extract text *.docx --format jsonlFor source-aware extraction:
oxdoc extract text contrato.docx --format structured-jsonOutput shape:
{
"file": "contrato.docx",
"text": "Este es el texto plano extraido velozmente..."
}oxdoc extract csv data.xlsx --sheet "Ventas Q1" --delimiter ","Select by visible sheet order instead:
oxdoc extract csv data.xlsx --sheet-index 2Inventory hidden sheets and explicitly extract by workbook order when needed:
oxdoc extract csv data.xlsx --list-sheets --include-hidden
oxdoc extract csv data.xlsx --sheet-index 1 --include-hiddenExport every visible sheet to separate CSV files:
oxdoc extract csv data.xlsx --all-sheets --output-dir exported-sheetsPreserve worksheet XML values by default, or opt into deterministic Excel-style formatting for dates, percentages, currency, and decimals:
oxdoc extract csv data.xlsx --value-mode formattedOutput:
id,nombre,monto
1,Cliente A,5000Notes:
--sheetselects a visible workbook sheet name.--sheet-indexselects a visible workbook sheet by 1-based workbook order.--include-hiddenchanges sheet listing and selection to includevisible,hidden, andveryHiddensheets in workbook order.--sheetand--sheet-indexare mutually exclusive.- Hidden and very hidden sheets are skipped unless
--include-hiddenis present. - Duplicate sheet names in the selected visibility scope are rejected; use
--sheet-indexto disambiguate malformed workbooks. - If no selector is provided, the first visible workbook sheet is used.
--delimitermust be a single-byte character.- CSV fields are quoted when needed.
- Sparse cells are padded with empty CSV fields.
oxdoc info report.docx --format jsonOutput shape:
{
"file": "report.docx",
"author": "Usuario Falso",
"created_at": "2024-03-12T10:00:00Z",
"application": "LibreOffice",
"has_macros": false,
"word_count": 1542
}Text output is also available:
oxdoc info report.docx --format textoxdoc audit report.docx --format jsonAudit output reports factual document intake signals such as macros, custom properties, suspicious relationship targets, hidden XLSX sheets, and recoverable parser warnings. It does not render, mutate, or assign risk scores to documents.
.
├── crates
│ ├── oxdoc-core
│ └── oxdoc-cli
├── docs
├── .github
└── tests
The reusable library. It owns:
- ZIP-backed OOXML package access.
- Relationship discovery.
- Streaming XML parser state machines.
- Data models such as
Extraction<T>,DocumentInfo,XlsxCsvOptions, andXlsxValueMode. - Error and warning contracts.
The command-line application. It owns:
- Argument parsing with
clap. - Routing commands to
oxdoc-core. - Writing extraction data to stdout.
- Writing recoverable warnings and hard errors to stderr.
- Formatting JSON with
serde_json.
- Tolerant input, strict output.
- Extract useful data; do not render documents.
- Prefer streaming XML parsing over DOM parsing.
- Keep memory bounded for large files.
- Keep warnings recoverable and explicit.
- Keep the library independent from terminal concerns.
- Treat documentation, tests, and fixtures as part of the product.
| Area | Supported now |
|---|---|
| DOCX text | Main document text from <w:t>, paragraph breaks, tabs, and line breaks. |
| PPTX text | Slide text boxes and linked speaker notes in presentation order. |
| XLSX CSV | Workbook relationship lookup, sheet name/index selection, hidden-sheet opt-in, shared strings, inline strings, sparse cells, booleans, errors, cached formula values, CSV escaping. |
| Metadata | Core/app properties plus basic macro detection. |
| Audit | Factual signals for macros, custom properties, suspicious relationships, hidden XLSX sheets, and recoverable parser warnings. |
| Output | Plain text, CSV, JSON metadata, JSON text extraction. |
| Errors | Typed library errors, CLI non-zero hard failures. |
| Warnings | Recoverable parser warnings with OOXML part paths. |
- DOCX section-aware ordering and some advanced revision/comment semantics need hardening.
- XLSX date and number format interpretation need hardening.
- PPTX extraction does not render slides, synthesize bullets, or preserve visual layout.
See ROADMAP.md and docs/roadmap.md.
Run all standard checks:
make allEquivalent commands:
cargo fmt --all -- --check
cargo check --workspace --all-features --all-targets
cargo clippy --workspace --all-features --all-targets -- -D warnings
cargo test --workspace --all-features --all-targets
cargo test --doc --workspace --all-features
cargo llvm-cov --workspace --all-features --all-targets --fail-under-lines 95 --summary-only
cargo build --workspaceCoverage is gated at 95% line coverage. Install the optional local tool with:
make install-toolsStatic Linux build:
rustup target add x86_64-unknown-linux-musl
make muslServe documentation:
make docsThe test suite combines parser unit tests, fixture corpus tests, and versioned snapshots. Current coverage includes:
- Hand-authored OOXML package trees.
- Application-generated
.docx,.xlsx, and.pptxfixtures with provenance. - Snapshot tests for core parser APIs and CLI output.
- Fuzz targets for XML parser paths.
- Performance benchmarks for throughput and peak memory.
- Optional competitive workbench runs against local extraction CLIs.
See docs/testing.md.
oxdoc parses untrusted Office files. Malformed ZIP/XML input should return errors or warnings, not panic.
Required OOXML ZIP parts are guarded before parsing: encrypted parts are rejected, oversized parts fail predictably, zip-bomb-like compression ratios are blocked, and relationship targets must stay inside the package root.
Do not open public issues for security vulnerabilities. Use GitHub private vulnerability reporting or contact the maintainers privately through GitHub.
See SECURITY.md and docs/security.md.
Contributions are welcome. Good first contributions include:
- Focused parser fixes.
- New unit tests for malformed XML.
- Safe-to-redistribute fixtures with provenance notes.
- Documentation improvements.
- Performance and memory measurement work.
Start with CONTRIBUTING.md, CODE_OF_CONDUCT.md, and docs/contributing.md.
oxdoc is currently maintainer-led. Larger design changes should start as GitHub issues before implementation.
See GOVERNANCE.md.
oxdoc is licensed under the MIT License. See LICENSE.