A fast, memory-safe library for text extraction from Office documents. Rust core with first-class bindings for Python, Go, C#/.NET, Node.js (native and WASM), and a stable C FFI. Handles DOCX, XLSX, PPTX, DOC, XLS, and PPT. Up to 100× faster than python-docx, openpyxl, python-pptx, and xlrd. Beats python-calamine on XLSX. 100% pass rate on valid Office files — zero failures on legitimate Word/Excel/PowerPoint documents. MIT/Apache-2.0 dual-licensed.
Scope of "fastest". Benchmarks compare Office Oxide against other native / embeddable libraries (no JVM runtime required): python-docx, openpyxl, python-pptx, python-calamine, xlrd, markitdown, catdoc, antiword, xls2csv, calamine (Rust), dotext, docx-rs. Apache POI and Apache Tika are out of scope for this comparison because they require a JVM and are targeted at a different deployment shape. POI/Tika numbers may be added in a future release.
| Language | Package | Directory | Docs |
|---|---|---|---|
| Rust | office_oxide on crates.io |
src/ |
lib.rs |
| Python | office-oxide on PyPI |
python/ |
python/ |
| Go | github.com/yfedoseev/office_oxide/go |
go/ |
go/README.md |
| C# / .NET | OfficeOxide on NuGet |
csharp/ |
csharp/OfficeOxide/README.md |
| Node.js (native) | office-oxide on npm |
js/ |
js/README.md |
| Node.js / browser (WASM) | office-oxide-wasm on npm |
wasm-pkg/ |
wasm-pkg/README.md |
| C / other | header-only via FFI | include/office_oxide_c/ |
office_oxide.h |
| CLI | office-oxide binary |
crates/office_oxide_cli/ |
|
| MCP server | office-oxide-mcp binary |
crates/office_oxide_mcp/ |
Ready-to-run demos (extract, replace, read_xlsx) exist for every binding under examples/. Deeper language-specific guides live in docs/: Rust · Python · Go · C# · JavaScript (native) · WASM · C FFI.
from office_oxide import Document, extract_text, to_markdown
# One-liner text extraction
text = extract_text("report.docx")
markdown = to_markdown("data.xlsx")
# Context-managed document; accepts str or pathlib.Path
with Document.open("slides.pptx") as doc:
print(doc.format) # "pptx"
print(doc.plain_text())
print(doc.to_markdown())pip install office-oxideuse office_oxide::Document;
let doc = Document::open("report.docx")?;
let text = doc.plain_text();
let markdown = doc.to_markdown();
let ir = doc.to_ir(); // Format-agnostic intermediate representation[dependencies]
office_oxide = "0.1.1"Browser + bundlers:
import { WasmDocument } from "office-oxide-wasm";
const doc = new WasmDocument(buffer, "docx");
console.log(doc.plainText());
console.log(doc.toMarkdown());
doc.free();Node.js native (koffi + C FFI, no node-gyp):
import { Document } from "office-oxide";
using doc = Document.open("report.docx");
console.log(doc.format); // "docx"
console.log(doc.plainText());
console.log(doc.toMarkdown());
console.log(doc.toIr());npm install office-oxide # native addon
npm install office-oxide-wasm # portable WASMimport oo "github.com/yfedoseev/office_oxide/go"
doc, _ := oo.Open("report.docx")
defer doc.Close()
text, _ := doc.PlainText()
md, _ := doc.ToMarkdown()using OfficeOxide;
using var doc = Document.Open("report.docx");
Console.WriteLine(doc.Format); // "docx"
Console.WriteLine(doc.PlainText());
Console.WriteLine(doc.ToMarkdown());Include include/office_oxide_c/office_oxide.h and link against liboffice_oxide. See examples/c/extract.c for a working sample.
- Fast — 8-100× faster than python-docx, openpyxl, python-pptx, xlrd; beats calamine on XLSX
- Reliable — 100% pass rate on valid Office files, tested against 6,062 real-world documents. Zero failures on legitimate Word 97+ / Excel 97+ / PowerPoint 97+ files
- Complete — 6 formats: DOCX, XLSX, PPTX + legacy DOC, XLS, PPT
- Multi-platform — Rust, Python, Go, JavaScript/TypeScript, C#/.NET, WASM, CLI, and MCP server — one library, all platforms
- Permissive — MIT / Apache-2.0, no AGPL or GPL restrictions
Benchmarked on 6,062 files from 11 independent public test suites. Single-thread, release build with LTO, warm disk cache (steady-state), median of three runs on an idle system. Full methodology in BENCHMARKS.md.
| Library | Language | Mean | p99 | Pass Rate | License |
|---|---|---|---|---|---|
| office_oxide | Rust | 0.8ms | 3.9ms | 98.9% | MIT |
| python-docx | Python | 11.8ms | 98ms | 95.1% | MIT |
| Library | Language | Mean | p99 | Pass Rate | License |
|---|---|---|---|---|---|
| office_oxide | Rust | 5.0ms | 40ms | 97.8% | MIT |
| python-calamine | Rust/Python | 13.9ms | 183ms | 96.6% | MIT |
| openpyxl | Python | 94.5ms | 698ms | 96.2% | MIT |
| Library | Language | Mean | p99 | Pass Rate | License |
|---|---|---|---|---|---|
| office_oxide | Rust | 0.7ms | 3.9ms | 98.4% | MIT |
| python-pptx | Python | 32.5ms | 174ms | 86.7% | MIT |
| Library | Format | Mean | p99 | Pass Rate | License |
|---|---|---|---|---|---|
| office_oxide | DOC (246) | 0.3ms | 3.4ms | 94.7% | MIT |
| catdoc | DOC | 4.3ms | 41ms | 90.2% | GPL-2.0 |
| antiword | DOC | 4.5ms | 66ms | 76.8% | GPL-2.0 |
| office_oxide | XLS (494) | 2.8ms | 75ms | 99.2% | MIT |
| xls2csv (catdoc) | XLS | 6.9ms | 58ms | 84.0% | GPL-2.0 |
| python-calamine | XLS | 9.0ms | 96ms | 90.7% | MIT |
| xlrd | XLS | 36.6ms | 503ms | 93.1% | BSD-3 |
| office_oxide | PPT (176) | 0.7ms | 6.6ms | 100% | MIT |
| catppt (catdoc) | PPT | 2.8ms | 8ms | 77.8% | GPL-2.0 |
On .xls, xls2csv has a tighter p99 (58ms vs 75ms) because it emits truncated/lossy output on complex sheets. office_oxide is 2.4× faster on the mean and passes 15pp more of the corpus. No other Rust or Python library supports .doc, .xls, and .ppt text extraction without a JVM (Apache Tika) or external binaries.
| Source | Files | License |
|---|---|---|
| LibreOffice Core | 2,185 | MPL-2.0 |
| Apache POI | 1,298 | Apache-2.0 |
| Open XML SDK | 707 | MIT |
| ClosedXML | 371 | MIT |
| Pandoc | 224 | GPL-2.0 |
| python-docx + python-pptx | 111 | MIT |
| Apache Tika | 108 | Apache-2.0 |
| calamine | 28 | MIT |
| openpreserve | 20 | CC0 |
| oletools | 17 | BSD-2 |
| LibreOffice (legacy) | 12 | MPL-2.0 |
| Total | 6,062 |
100% pass rate on all valid Office files — the 97 non-passing files in the corpus are all invalid inputs:
| Category | Count | Notes |
|---|---|---|
| Truncated / corrupted ZIP | 43 | Missing EOCD, invalid Central Directory, fuzzer-corrupted archives |
| Encrypted / password-protected | 19 | CFBF-encrypted containers — no password supplied |
| XML bomb / security fixture | 17 | Billion-laughs entities, CVE exploit inputs, fuzzer corpus |
| Wrong format / mislabeled | 16 | WordPerfect, IBM DisplayWrite, pre-OLE2 Excel 3/4, ODT stored as .docx, XLSB stored as .xls |
| Truncated binary | 2 | File ends mid-CFB-sector |
Zero failures on legitimate Word 97+ / Excel 97+ / PowerPoint 97+ files. Zero panics, zero timeouts, zero false negatives on valid documents. Full breakdown in BENCHMARKS.md.
| Format | Extension | Read | Write | Edit | Convert | Text | Markdown | HTML | IR |
|---|---|---|---|---|---|---|---|---|---|
| Word (OOXML) | .docx | Yes | Yes | Yes | — | Yes | Yes | Yes | Yes |
| Excel (OOXML) | .xlsx | Yes | Yes | Yes | — | Yes | Yes | Yes | Yes |
| PowerPoint (OOXML) | .pptx | Yes | Yes | Yes | — | Yes | Yes | Yes | Yes |
| Word (Legacy) | .doc | Yes | — | — | → .docx | Yes | Yes | Yes | Yes |
| Excel (Legacy) | .xls | Yes | — | — | → .xlsx | Yes | Yes | Yes | Yes |
| PowerPoint (Legacy) | .ppt | Yes | — | — | → .pptx | Yes | Yes | Yes | Yes |
Legacy formats can be converted to modern OOXML with save_as():
doc = Document.open("report.doc")
doc.save_as("report.docx") # Converts DOC → DOCXfrom office_oxide import Document, extract_text, to_markdown, to_html
# Quick extraction
text = extract_text("report.docx")
markdown = to_markdown("spreadsheet.xlsx")
html = to_html("slides.pptx")
# Document object
doc = Document.open("presentation.pptx")
text = doc.plain_text()
md = doc.to_markdown()
html = doc.to_html()
ir = doc.to_ir() # Structured JSON intermediate representation
fmt = doc.format_name() # "pptx"
# From bytes
with open("file.docx", "rb") as f:
doc = Document.from_bytes(f.read(), "docx")All 6 formats supported. Works with str, pathlib.Path, or raw bytes.
use office_oxide::{Document, DocumentFormat};
// Open from path (format auto-detected from extension)
let doc = Document::open("report.docx")?;
// Open from reader with explicit format
let file = std::fs::File::open("data.xlsx")?;
let doc = Document::from_reader(file, DocumentFormat::Xlsx)?;
// Extract content
let text = doc.plain_text();
let markdown = doc.to_markdown();
let html = doc.to_html();
let ir = doc.to_ir(); // Format-agnostic DocumentIR
// Access format-specific types
if let Some(docx) = doc.as_docx() {
println!("Paragraphs: {}", docx.body.elements.len());
}
// Create documents from IR
use office_oxide::create::create_from_ir;
create_from_ir(&ir, DocumentFormat::Docx, "output.docx")?;Each format is available as a sub-module for direct access:
use office_oxide::docx::DocxDocument;
use office_oxide::xlsx::XlsxDocument;
use office_oxide::pptx::PptxDocument;
use office_oxide::doc::DocDocument;
use office_oxide::xls::XlsDocument;
use office_oxide::ppt::PptDocument;pip install office-oxideWheels available for Linux, macOS, and Windows. Python 3.8–3.14.
[dependencies]
office_oxide = "0.1.1"npm install office-oxide-wasm # portable WASM (browser + Node.js)
npm install office-oxide # native addon via koffi (Node.js only, no node-gyp)go get github.com/yfedoseev/office_oxide/goSee go/README.md for setup details.
dotnet add package OfficeOxideSee csharp/OfficeOxide/README.md for setup details.
cargo install office_oxide_cli # Cargo
cargo binstall office_oxide_cli # Pre-built binarycargo install office_oxide_mcpoffice-oxide provides fast Office document processing from your terminal:
office-oxide text report.docx # Extract plain text
office-oxide markdown data.xlsx # Convert to Markdown
office-oxide html slides.pptx # Convert to HTML
office-oxide ir document.docx # Dump IR as JSON
office-oxide info report.docx # Show format and metadataAll six formats supported (docx, xlsx, pptx, doc, xls, ppt). Use --help for all options.
office-oxide-mcp lets AI assistants (Claude, Cursor, etc.) read Office documents locally via the Model Context Protocol.
Add to your MCP client configuration:
{
"mcpServers": {
"office-oxide": { "command": "office-oxide-mcp" }
}
}The server exposes extract and info tools. All processing runs locally — no files leave your machine.
git clone https://github.com/yfedoseev/office_oxide
cd office_oxide
cargo build --release
cargo test
# Python bindings
maturin develop --features python
# Shared library for Go, JS/TS (koffi), and C# bindings
cargo build --release --lib
# Output: target/release/liboffice_oxide.{so,dylib} or office_oxide.dll- Documentation Site — Guides and examples
- Getting Started (Rust) — Rust guide
- Getting Started (Python) — Python guide
- Getting Started (Go) — Go guide
- Getting Started (JavaScript / TypeScript) — Node.js native guide
- Getting Started (WASM) — Browser and Node.js WASM guide
- Getting Started (C# / .NET) — .NET guide
- Getting Started (C FFI) — C FFI guide
- API Docs (Rust) — Full Rust API reference
- Architecture — System design and module structure
- RAG / LLM pipelines — Extract clean text or Markdown from Office documents for retrieval-augmented generation
- Document processing at scale — Parse thousands of documents in seconds
- Data extraction — Pull structured data from spreadsheets, tables, and presentations
- Format conversion — Convert between formats via the intermediate representation
- python-docx / openpyxl alternative — Up to 100× faster, supports all 6 formats in one library
I needed a library that could read all six Office formats at once — not six separate packages — and I needed it without pulling in a JVM, a Python runtime, or a GPL-licensed dependency. Nothing existed that combined speed, correctness, and a permissive license across the full DOCX / XLSX / PPTX / DOC / XLS / PPT surface, so I wrote it in Rust and wrapped it for every language I use day-to-day. The same binary powers Python via PyO3, Node.js via koffi, Go via cgo, C# via P/Invoke, and the browser via WASM — one fix lands everywhere.
If it saves you a dependency, a license audit, or a weekend, consider leaving a star. If something's broken or missing, open an issue — I read all of them.
— Yury
Dual-licensed under MIT or Apache-2.0 at your option. No AGPL, no GPL, no copyleft restrictions. Use freely in commercial and open-source projects.
We welcome contributions! See CONTRIBUTING.md for guidelines.
cargo build && cargo test && cargo fmt && cargo clippy -- -D warnings@software{office_oxide,
title = {Office Oxide: Fast Office Document Processing for Rust, Python, Go, JavaScript, and C#},
author = {Yury Fedoseev},
year = {2026},
url = {https://github.com/yfedoseev/office_oxide}
}Rust + Python + Go + JS/TS + C# + WASM + CLI + MCP | MIT/Apache-2.0 | 100% pass rate on valid Office files (6,062-file corpus) | Up to 100× faster than alternatives | 6 formats