This project provides a MarkItDown plugin for converting PDF files to Markdown with an emphasis on extracting all content, including text, tables, and images—even from challenging or scanned documents. The plugin is designed for correctness over speed, ensuring that as much information as possible is preserved in the Markdown output.
- Text Extraction: Uses PyMuPDF and pymupdf4llm to extract text, tables, and images from PDFs when possible.
- Fallback to LLM-based OCR: If text extraction is incomplete (e.g., for scanned pages, complex tables, or images), the plugin automatically falls back to an LLM-based OCR service. This ensures that even non-searchable or image-based content is converted to Markdown.
- Enhanced Table Detection: Incorporates a
TableDetectionInterface
allowing for sophisticated table detection (e.g., using an LLM viaLLMBasedTableDetector
). If tables are detected on a page initially deemed readable by heuristics, that page is re-routed for full OCR to ensure accurate table extraction. - Table and Image Handling: Tables and images are extracted and, if necessary, described or transcribed using the LLM OCR backend.
- Correctness First: The plugin prioritizes extracting everything from the PDF, even if this means slower processing due to LLM calls for OCR or table/image understanding.
- Text Extraction: For each page, the plugin first attempts to extract text, tables, and images using PyMuPDF and pymupdf4llm.
- Heuristics and Table Detection for OCR: A page is sent for OCR if:
a. It's detected as non-readable by standard heuristics (e.g., scanned, mostly images, simulated text).
b. Or, if standard heuristics pass, but a configured
TableDetectionInterface
(like the defaultLLMBasedTableDetector
if an LLM is set up) detects the presence of tables on the page. This helps catch subtle text-based tables that might otherwise be missed. - LLM OCR Fallback: The plugin uses a configurable LLM client (such as OpenAI's GPT-4o) to perform OCR and generate Markdown for images, tables, and non-readable content.
- Combining Results: All extracted and OCR'd content is merged into a single Markdown output, with clear page and section boundaries.
- Python 3.11+
- PyMuPDF (
pymupdf
) - pymupdf4llm
- Pillow
- tqdm (for progress bars)
- markitdown
- An LLM client (e.g., OpenAI Python SDK) for OCR fallback
All dependencies are listed in pyproject.toml
.
-
Clone the repository:
git clone <this-repo-url> cd pdf-to-markdown
-
Install dependencies (using Poetry or pip):
Using Poetry (recommended):
poetry install
Using pip: If you are not using Poetry, you can install the package and its dependencies directly from the project directory (which reads
pyproject.toml
):pip install .
(If this package were published on PyPI, you would typically use
pip install pdf-to-markdown
.)
If you plan to use the synchronous API functions (like pdf_to_markdown_sync
or LLMBasedTableDetector.detect_tables_on_page
when it internally runs an async client) in an environment that already has an active asyncio event loop (such as Jupyter notebooks or certain GUI frameworks), you might encounter RuntimeError
s related to nested event loops.
To enable smoother operation in these environments, you can install an optional dependency when installing this package:
pip install .[async_sync_env]
Or, if installing from PyPI (once published):
pip install pdf-to-markdown[async_sync_env]
This installs nest_asyncio
, which helps the library manage nested event loops gracefully. The library will attempt to use nest_asyncio
if it's available, falling back to standard asyncio behavior otherwise (which may lead to the aforementioned RuntimeError
s in nested loop scenarios).
When selecting an LLM for OCR, consider factors like accuracy, speed, and cost. According to benchmark results from the Omni OCR Benchmark (as of the time of writing), Google's gemini-2.0-flash
(or newer flash versions) is often highlighted as a strong contender for its balance of good performance, speed, and cost-effectiveness for OCR tasks. However, always refer to the latest benchmark data and your specific project requirements when making a decision.
This plugin is designed to be used as a MarkItDown converter. After you install this package. You can use it programmatically:
from markitdown import MarkItDown
# Optionally, provide your LLM client and model for OCR fallback and default table detection
llm_client = ... # e.g., sync or async OpenAI client (note: for gemini usage see example below)
llm_model = "gpt-4.1-mini"
mid = MarkItDown(
enable_plugins=True # Mandatory
llm_client=llm_client, # Mandatory
llm_model=llm_model, # Mandatory
)
path_to_pdf = "your-path-to.pdf"
result = mid.convert(path_to_pdf)
print(result.markdown)
- You can provide your own OCR service (
custom_pdf_ocr_service
) or use the built-in LLM-based OCR (requires an OpenAI-compatible client and model). - Similarly, you can provide your own table detection service (
table_detection_service
) or use the built-inLLMBasedTableDetector
(which also uses the configured LLM client and model if provided). - The plugin will automatically decide when to use OCR based on page content heuristics and table detection results.
- Performance: Because the plugin prioritizes correctness and completeness, processing may be slower than pure text extractors, especially for scanned or image-heavy PDFs.
- LLM Costs: Using LLM-based OCR may incur API costs depending on your provider and the number of pages/images processed.
You can customize the behavior of the plugin by providing the following optional arguments when constructing the MarkItDown
instance or when calling convert
:
custom_pdf_ocr_service
: An instance of a custom OCR service implementing theOCRInterface
. If provided, this will be used for all OCR tasks instead of the built-in LLM-based OCR.table_detection_service
: An instance of a custom table detection service implementing theTableDetectionInterface
. If provided, this will be used for table detection.llm_client
: An OpenAI-compatible client instance (or similar). If provided (along withllm_model
), it will be used for:- The default
LLMBasedOCRService
(ifcustom_pdf_ocr_service
is not given). - The default
LLMBasedTableDetector
(iftable_detection_service
is not given).
- The default
llm_model
: The model name to use with the LLM client (e.g.,\"gpt-4.1-mini\"
).show_progress
: (bool, defaultFalse
) IfTrue
, shows progress bars during processing (requirestqdm
).
pages
: A list of page indices to process. If not provided, all pages are processed.force_ocr
: (bool, defaultFalse
) IfTrue
, forces OCR on all pages, even if text extraction is possible.show_progress
: (bool, defaultFalse
) IfTrue
, shows progress bars for this conversion. (requirestqdm
).
from markitdown import MarkItDown
from openai import OpenAI
llm_client = OpenAI(api_key="sk-...")
llm_model = "gpt-4.1-mini"
mid = MarkItDown(
enable_plugins=True, # mandatory
llm_client=llm_client, # mandatory
llm_model=llm_model, # mandatory
show_progress=True, # Not mandatory: default=False
ocr_service=None, # Not mandatory: default=None (your own OCR service that implements the OCRInterface)
table_detection_service=None, # Not mandatory: default=None (your own Table Detection service that implements TableDetectionInterface)
)
pdf_path = "path to your.pdf"
result = mid.convert(
pdf_path,
show_progress=True, # default to constructor set value
force_ocr=True, # default to False
pages=[0, 1, 2] # default to None = all pages
)
print(result.markdown)
Notes:
- If you provide both
custom_pdf_ocr_service
andllm_client
/llm_model
, the customcustom_pdf_ocr_service
takes precedence for OCR. - If you provide both
table_detection_service
andllm_client
/llm_model
, the customtable_detection_service
takes precedence for table detection. - If an
llm_client
andllm_model
are provided:- And no
custom_pdf_ocr_service
is given,LLMBasedOCRService
will be used. - And no
table_detection_service
is given,LLMBasedTableDetector
will be used.
- And no
- If neither OCR configuration (custom or LLM-based) is provided, the plugin will not be able to perform OCR and will return an error for non-readable pages.
- If no table detection configuration is provided (custom or LLM-based), table detection will rely solely on
pymupdf4llm
's capabilities (strategy='lines_strict'). show_progress
can be set globally (in the constructor) or per conversion.pages
allows you to process only a subset of pages.force_ocr
is useful for scanned PDFs or when you want to ensure all content is processed via OCR.
Besides its use as a MarkItDown plugin, pdf-to-markdown
can also be used as a standalone Python library to convert PDF content directly.
Here's how you can use the pdf_to_markdown
function with the LLMBasedOCRService
and LLMBasedTableDetector
:
import asyncio
from pdf_to_markdown import (
pdf_to_markdown,
pdf_to_markdown_sync,
LLMBasedOCRService
)
from pdf_to_markdown.table_detection_services import LLMBasedTableDetector
import os # For accessing environment variables
# --- Option 1: Setup LLM Client (Example using OpenAI) ---
# Make sure to install it: pip install openai
from openai import OpenAI, AsyncOpenAI
# It's recommended to set your API key as an environment variable for security.
# e.g., export OPENAI_API_KEY='your_api_key_here'
# If the OPENAI_API_KEY environment variable is set, the client will use it automatically.
# Otherwise, you can pass it directly: OpenAI(api_key="YOUR_API_KEY")
# Synchronous OpenAI Client
sync_openai_client = OpenAI()
# Asynchronous OpenAI Client
async_openai_client = AsyncOpenAI()
# print("OpenAI clients initialized. Ensure your OPENAI_API_KEY is set or passed directly.")
openai_model = "gpt-4.1-mini" # Or your preferred OpenAI model, e.g., "gpt-4-turbo"
# --- Option 2: Setup LLM Client (Example using Google Gemini via OpenAI-compatible API) ---
# This requires a service that provides an OpenAI-compatible endpoint for Gemini models.
# Replace placeholders with your actual endpoint URL, API key, and Gemini model name.
# Example using a hypothetical Gemini endpoint (replace with actual values):
gemini_api_key = os.environ.get("GEMINI_API_KEY") # Or your actual key
gemini_base_url = "https://generativelanguage.googleapis.com/v1beta/openai/"
# You can use either the synchronous or asynchronous OpenAI client, configured for the Gemini endpoint.
# Synchronous client for Gemini:
# sync_gemini_client = OpenAI(
# api_key=gemini_api_key,
# base_url=gemini_base_url,
# )
# Asynchronous client for Gemini:
# async_gemini_client = AsyncOpenAI(
# api_key=gemini_api_key,
# base_url=gemini_base_url,
# )
gemini_model_name = "gemini-2.0-flash-latest" # Example Gemini model name, replace with actual model string for your endpoint
# To use Gemini, you would then set:
# llm_client_to_use = async_gemini_client # or sync_gemini_client
# llm_model_to_use = gemini_model_name
# print(f"Gemini client configured for model: {gemini_model_name} at {gemini_base_url}")
llm_client_to_use = async_openai_client # Choose which client to use (e.g., async_openai_client or sync_openai_client)
# For best performance with the library's async core, AsyncOpenAI is generally preferred.
llm_model_to_use = openai_model # Choose the corresponding model
# --- Initialize OCR Service ---
ocr_service = LLMBasedOCRService(
llm_client=llm_client_to_use,
llm_model=llm_model_to_use,
show_progress=True
)
# --- Initialize Table Detection Service ---
table_detection_svc = LLMBasedTableDetector(
llm_client=llm_client_to_use,
llm_model=llm_model_to_use
# You can customize prompt, retries etc. via kwargs if needed
)
# --- Path to your PDF ---
pdf_file_path = "path/to/your/example.pdf" # Replace with your PDF file path
# --- Asynchronous Usage Example ---
async def run_async_conversion():
print("--- Running Asynchronous Conversion ---")
markdown_results = await pdf_to_markdown(
pdf_source=pdf_file_path,
image_ocr_service=ocr_service,
table_detection_service=table_detection_svc, # Pass the table detection service
show_progress=True
)
for page_num, md_content in sorted(markdown_results.items()):
print(f"\n--- Page (Async) {page_num + 1} ---\n{md_content}")
# --- Synchronous Usage Example ---
def run_sync_conversion():
print("\n--- Running Synchronous Conversion ---")
markdown_results = pdf_to_markdown_sync(
pdf_source=pdf_file_path,
image_ocr_service=ocr_service,
table_detection_service=table_detection_svc, # Pass the table detection service
show_progress=True
)
for page_num, md_content in sorted(markdown_results.items()):
print(f"\n--- Page (Sync) {page_num + 1} ---\n{md_content}")
if __name__ == "__main__":
# To run the async version:
# asyncio.run(run_async_conversion())
# To run the sync version:
run_sync_conversion()
Explanation:
- Import necessary modules:
asyncio
,pdf_to_markdown
,pdf_to_markdown_sync
,LLMBasedOCRService
,LLMBasedTableDetector
(frompdf_to_markdown.table_detection_services
), and your chosen LLM client library. - Set up LLM Client: (As before)
- Initialize
LLMBasedOCRService
: (As before) - Initialize
LLMBasedTableDetector
: Instantiate the table detector, also using the LLM client and model. This service is then passed to the conversion functions using thetable_detection_service
parameter. If you don't provide atable_detection_service
, but an LLM client/model is configured, the converter will try to useLLMBasedTableDetector
by default. - Call Conversion Functions:
pdf_to_markdown
(async): Passtable_detection_service=table_detection_svc
.pdf_to_markdown_sync
(sync): Passtable_detection_service=table_detection_svc
.
- Process Results: (As before)
This setup allows you to leverage the PDF parsing and OCR capabilities of the package in any Python application, choosing between asynchronous or synchronous execution and different LLM backends based on your needs.
MIT License