Agent that converts office and web documents into structured markdown for knowledge bases.
any2md turns .xlsx / .docx / .pptx / .pdf / .html / .md into clean, agent-rendered markdown — preserving reading order, tables, charts, diagrams, and image captions. It is a general-purpose library: it runs on any machine with Python 3.11+, and because every dependency is pure-Python (no subprocess, no system binaries), it also runs unmodified on Databricks serverless compute.
Knowledge bases ingest messy office documents. A naive pdf→txt or pptx→html dump loses tables, scrambles reading order, drops images, and produces markdown nobody wants to read. any2md does the structural work as a deterministic pre-pass, then uses a small reasoning LLM (any OpenAI-compatible endpoint, Anthropic, Databricks Foundation Models, or Google) to render the markdown — with a quality-reviewer subagent that checks completeness before returning.
source file (.xlsx/.docx/.pptx/.pdf/.html/.md)
↓ extract pure-Python parsers per format
raw IR + image bytes
↓ clean strip redundancy, normalize bbox to px @ 96 DPI
cleaned IR
↓ caption async vision LLM, parallel per image
captions: dict[image_id → text]
↓ render deepagents main agent + quality-reviewer subagent
virtual FS at /workspace/ (CompositeBackend)
markdown
Top-level entry: convert(source, options) -> ConvertResult (src/any2md/convert.py).
| Format | Extractor | Status |
|---|---|---|
.xlsx / .xlsm |
extractors/xlsx.py | Done |
.pptx / .pptm |
extractors/pptx.py | Done |
.pdf |
extractors/pdf.py | Done |
.html / .htm |
extractors/html.py | Done |
.docx / .docm |
— | Pending |
.md / .markdown |
— | Pending |
py -m pip install -e ".[agent,openai]"Optional extras:
| Extra | When to install |
|---|---|
agent |
Required for the LLM-rendering layer (deepagents + langchain-core + python-dotenv) |
openai |
OpenAI-compatible endpoints — LiteLLM proxies, vLLM, DashScope, Qwen |
databricks |
ChatDatabricks via databricks_langchain (auto-picks workspace credentials) |
anthropic / google |
Direct provider clients |
all |
Everything above |
dev |
pytest, ruff, mypy |
-
Copy the env template and fill in your model credentials:
Copy-Item .env.example .envAt minimum set
LITELLM_API_KEY(or swaptext_model/vision_modelfor another provider — see 04-api-surface.md). -
Run the end-to-end smoke test:
py scripts/run_agent.py samples/test.html # in-memory only py scripts/run_agent.py samples/test.html --debug # also dump to ./workspace/
With
--debugthe agent's virtual workspace is mirrored to disk for inspection atworkspace/<stem>/(orworkspace/<stem>/chunk_NN/for multi-chunk runs):workspace/test/ ir.json # cleaned IR fed to the agent captions.json # pre-computed image captions final.md # returned markdown -
From a script (sync API, safe in or out of a running loop):
from any2md import convert, ConvertOptions from any2md.llm import litellm_text_model llm = litellm_text_model() # reads LITELLM_* env vars result = convert( "samples/test.html", options=ConvertOptions(text_model=llm, vision_model=llm), ) print(result.markdown)
-
From a notebook / async handler (preferred — no thread hop):
from any2md import aconvert, ConvertOptions from any2md.llm import litellm_text_model result = await aconvert( "samples/test.html", options=ConvertOptions(text_model=litellm_text_model()), )
All knobs live on ConvertOptions (src/any2md/options.py):
| Knob | Default | Effect |
|---|---|---|
text_model |
Databricks Claude Sonnet 4.6 | Main rendering LLM |
vision_model |
Databricks Claude Haiku 4.5 | Image captioner |
enable_image_captions |
True |
Skip pre-pass to use only alt_text |
enable_quality_review |
True |
Skip reviewer subagent to halve LLM calls |
caption_language |
"auto" |
"vi" / "en" to force |
max_chunk_tokens |
30_000 |
Budget gate via tiktoken.cl100k_base |
max_concurrent_image_captions |
4 |
asyncio.Semaphore cap |
image_caption_requests_per_minute |
60 |
Token-bucket rate limit |
return_ir |
False |
Include cleaned IR in ConvertResult |
write_to |
None |
Path to also write final markdown |
debug_workspace |
None |
Opt-in: mirror agent virtual state to disk (relative paths resolve under tempfile.gettempdir(); absolute paths honored) |
ModelConfig supports 5 providers (databricks, openai, anthropic, google, custom) plus a pre-built BaseChatModel instance. Any OpenAI-compatible endpoint (LiteLLM, vLLM, DashScope, Ollama, Qwen) works via provider="openai" with a custom base_url.
The library has no Databricks runtime dependency — installing the same wheel that runs locally works on serverless and classic clusters alike. Specifics:
- Compute: Serverless ✓ (recommended). Classic ✓. No init script, no
apt-get, noLibreOffice/Tesseract/Pandoc— all dependencies are PyPI manylinux wheels. - Storage: Unity Catalog Volumes (
/Volumes/<cat>/<sch>/<vol>/file.pptx). DBFS is deprecated and unsupported. - Auth: On the cluster, workspace credentials are auto-picked. Locally, set
DATABRICKS_HOST+DATABRICKS_TOKEN. - Models: Built-in support for Foundation Model API endpoints (
databricks-claude-sonnet-4-6,databricks-claude-haiku-4-5,databricks-claude-opus-4-7) viadatabricks_langchain.ChatDatabricks. Other providers work just as well —convert()doesn't care where the model lives.
Constraints documented in 02-platform.md. For step-by-step notebook usage — install, model choice, UC path conventions, troubleshooting — see docs/databricks.md.
If LANGSMITH_TRACING=true is set in .env, every agent run produces a full trace tree (main agent steps → tool calls → subagent invocations → token counts → latency) at https://smith.langchain.com under the project name configured in LANGSMITH_PROJECT.
| Path | Purpose |
|---|---|
| src/any2md/ | Library code |
| src/any2md/extractors/ | Per-format pure-Python parsers |
| src/any2md/ir_clean.py | Post-extraction cleanup pass |
| src/any2md/ir_chunk.py | Token-budgeted chunking |
| src/any2md/agent/ | Caption pre-pass, main agent, subagents |
| src/any2md/convert.py | convert() API |
| scripts/ | Pipeline driver scripts (dump_ir, clean_ir, chunk_ir, run_agent) |
| samples/ | Source fixtures |
| docs/superpowers/specs/ | Design spec (numbered 01-11) |
| AGENTS.md | Conventions for any agent (human or AI) editing this repo |
| STATE.md | Living status snapshot — updated on every milestone |
See STATE.md for the current pipeline state, completed items, and pending work. Full design rationale and decision log live in docs/superpowers/specs/2026-05-21-databricks-converter/.
TBD.