A Python library for extracting text and visual elements (tables, figures) from PDF documents using OpenAI's vision models. Perfect for RAG systems, document analysis, and automated data extraction.
- 📄 PDF Text Extraction: Extract clean text content from PDF documents
- 🖼️ PDF to Images: Convert PDF pages to high-quality images
- 🔍 Visual Element Detection: Use OpenAI's GPT-4o-mini to detect and extract tables and figures
- 📊 Structured Data Extraction: Extract tables in structured list format with all numerical data
- 📈 CSV Export: Convert extracted tables to CSV format using LLM
- 📝 Multiple Export Formats: Export to JSON, Markdown, and CSV
- 🚀 Simple API: Easy-to-use command line interface and Python API
- 💰 Cost-Effective: Uses GPT-4o-mini for optimal performance and cost
git clone https://github.com/yourusername/pdf-visual-extraction.git
cd pdf-visual-extraction
pip install -e .
pip install -r requirements.txt
# Basic extraction
pdf-visual-extract --pdf_path document.pdf --output_dir ./output
# With markdown export
pdf-visual-extract --pdf_path document.pdf --output_dir ./output --export_md
# With CSV export
pdf-visual-extract --pdf_path document.pdf --output_dir ./output --export_csv
# With text cleaning (remove redundant table/figure content)
pdf-visual-extract --pdf_path document.pdf --output_dir ./output --clean_text
# Full pipeline with all exports
pdf-visual-extract --pdf_path document.pdf --output_dir ./output --export_md --export_csv --clean_text --max_pages 5
# Batch processing multiple PDFs
python examples/batch_processing_example.py
from pdf_visual_extraction import (
extract_text_from_pdf,
convert_pdf_to_images,
process_images_openai,
inject_tables_into_text,
convert_tables_to_csv
)
# Extract text
text_data = extract_text_from_pdf("document.pdf", "output/text.json")
# Convert to images
convert_pdf_to_images("document.pdf", "output/images/")
# Detect visual elements
visual_data = process_images_openai("output/images/", "document", "output/visual.json")
# Inject tables into text
final_data = inject_tables_into_text(text_data, visual_data)
# Convert tables to CSV
csv_results = convert_tables_to_csv(final_data, "output", "document")
For processing multiple PDFs efficiently:
# Process all PDFs in a directory with parallel processing
python batch_process_all_pdfs.py --data_dir data --output_dir batch_output --parallel_jobs 4 --max_workers 3
from pdf_visual_extraction import run_pdf_visual_extraction
# Process multiple PDFs
pdf_files = ["doc1.pdf", "doc2.pdf", "doc3.pdf"]
for pdf_path in pdf_files:
success = run_pdf_visual_extraction(
pdf_path=pdf_path,
output_dir="batch_output",
export_md=True,
export_csv=True,
max_workers=3
)
- Parallel processing for multiple PDFs
- CSV export for all extracted tables
- Progress tracking and error handling
- Comprehensive summaries with statistics
- Skip existing files option
Set your OpenAI API key:
export OPENAI_API_KEY="your-api-key-here"
import os
os.environ["OPENAI_API_KEY"] = "your-api-key-here"
The library can convert extracted tables to CSV format using LLM:
- LLM-powered conversion using GPT-4o-mini
- Clean CSV formatting with proper headers
- Data type preservation (numbers, text, percentages)
- Hierarchical structure handling
- Descriptive filenames with page numbers and table descriptions
from pdf_visual_extraction import convert_tables_to_csv
# Convert all tables to CSV
csv_results = convert_tables_to_csv(final_data, "output", "document")
print(f"Converted {csv_results['converted_tables']} tables to CSV")
output/
├── csv_exports/
│ ├── document_page_001_Financial_Results.csv
│ ├── document_page_002_Revenue_Breakdown.csv
│ └── document_csv_conversion_summary.json
The library can intelligently clean extracted text by removing redundant table and figure content that has already been parsed by the VLM:
- LLM-powered cleaning using GPT-4o-mini
- Intelligent content removal of redundant table/figure data
- Preserves narrative content and document structure
- Maintains references to tables and figures
- Reduces duplication between text and structured data
from pdf_visual_extraction import clean_text_in_data
from openai import OpenAI
# Clean text to remove redundant content
client = OpenAI()
cleaned_data = clean_text_in_data(data, client)
- Cleaner text output without redundant table data
- Better RAG performance with focused content
- Reduced file sizes by eliminating duplication
- Improved readability of extracted text
The library generates structured JSON output with the following format:
{
"pdf_name": "document-name",
"total_pages": 10,
"pages": [
{
"page_number": 0,
"text": "Page text content...",
"tables": [
{
"description": "Table description",
"structured_data": "### Table: Title\n- **Category**:\n - Item: value",
"raw_text": "Raw table text",
"confidence": 0.95,
"bbox": [x1, y1, x2, y2]
}
]
}
]
}
pdf-visual-extract \
--pdf_path "earnings-report.pdf" \
--output_dir "./analysis" \
--export_md \
--export_pdf \
--max_pages 10
pdf-visual-extract \
--pdf_path "research-paper.pdf" \
--output_dir "./research" \
--export_md \
--max_pages 20
output/
├── document-name/
│ ├── text_extraction/
│ │ ├── document-name_text.json
│ │ └── document-name_with_tables.json
│ ├── images/
│ │ ├── page_000.png
│ │ └── page_001.png
│ ├── visual_detection/
│ │ └── document-name_tables_figures.json
│ ├── exports/
│ │ ├── document-name_report.md
│ │ └── document-name_report.pdf
│ └── document-name_pipeline_summary.md
extract_text_from_pdf(pdf_path, output_path)
: Extract text from PDFconvert_pdf_to_images(pdf_path, output_dir)
: Convert PDF to imagesprocess_images_openai(images_dir, pdf_name, output_file, max_pages)
: Detect visual elementsinject_tables_into_text(text_data, visual_data)
: Inject tables into text dataconvert_json_to_markdown(json_data)
: Convert JSON to Markdownconvert_markdown_to_pdf(markdown_file, pdf_file)
: Convert Markdown to PDF
- Python 3.8+
- OpenAI API key
- poppler-utils (for PDF to image conversion)
Ubuntu/Debian:
sudo apt-get install poppler-utils
macOS:
brew install poppler
Windows: Download poppler binaries and add to PATH.
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature
) - Commit your changes (
git commit -m 'Add amazing feature'
) - Push to the branch (
git push origin feature/amazing-feature
) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- OpenAI for providing the vision language models
- The open-source community for the underlying PDF processing libraries
For issues and questions:
- Open an issue on GitHub
- Check the documentation
- Review the examples
Made with ❤️ for the AI and document processing community