Skip to content

seonghobae/docvert

Repository files navigation

DocVert

DocVert is an intelligent, LLM-powered CLI tool and agent for converting DOCX and PDF documents into clean, semantic Markdown. It focuses heavily on preserving document structure, headings, lists, and visual elements, while extracting key document metadata into sidecar JSON files.

Features

  • Robust DOCX Parsing: Primary parsing using python-docx with heuristic heading detection. Fallback to mammoth for difficult layouts.
  • Advanced PDF Parsing: High-fidelity PDF extraction using docling as the primary engine. Fallback to unstructured for edge cases.
  • Rich Output Format:
    • Generates clean, semantic .md files.
    • Produces a sidecar .json file containing extraction confidence scores, metadata, and parsing warnings.
    • Automatically extracts and saves images referenced in the source documents.
  • Batch Processing & Caching: Efficiently process large directories of files with built-in caching to avoid redundant parsing.
  • Provider-Agnostic LLM Refinement: Uses litellm under the hood, natively supporting OpenAI, Vertex AI, Anthropic, Bedrock, and local models via Ollama.
  • Air-Gapped / Offline Deployment: Pre-built Docker images via GitHub Releases for secure, offline environments.
  • Developer Ready:
    • 100% test coverage.
    • Robust type hints powered by pydantic.
    • Built-in CLI using modern Python tooling (uv).

Quick Start (Docker — Recommended)

# Build from source
git clone https://github.com/seonghobae/docvert.git
cd docvert
docker build -t docvert:offline .

# Convert a file
docker run --rm -v $(pwd):/data \
    docvert:offline convert /data/input.pdf --output-dir /data/out

Or install natively — see the full Installation Guide.

Documentation

Full documentation is available at seonghobae.github.io/docvert

GitHub Releases (Offline Bundles)

Pre-built Docker images for air-gapped environments:

github.com/seonghobae/docvert/releases

License

MIT License

About

DOCX/PDF to Markdown conversion LLM Agent

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages