PDF Visual Extraction Library

A Python library for extracting text and visual elements (tables, figures) from PDF documents using OpenAI's vision models. Perfect for RAG systems, document analysis, and automated data extraction.

Features

📄 PDF Text Extraction: Extract clean text content from PDF documents
🖼️ PDF to Images: Convert PDF pages to high-quality images
🔍 Visual Element Detection: Use OpenAI's GPT-4o-mini to detect and extract tables and figures
📊 Structured Data Extraction: Extract tables in structured list format with all numerical data
📈 CSV Export: Convert extracted tables to CSV format using LLM
📝 Multiple Export Formats: Export to JSON, Markdown, and CSV
🚀 Simple API: Easy-to-use command line interface and Python API
💰 Cost-Effective: Uses GPT-4o-mini for optimal performance and cost

Installation

From Source

git clone https://github.com/yourusername/pdf-visual-extraction.git
cd pdf-visual-extraction
pip install -e .

Dependencies

pip install -r requirements.txt

Quick Start

Command Line Usage

# Basic extraction
pdf-visual-extract --pdf_path document.pdf --output_dir ./output

# With markdown export
pdf-visual-extract --pdf_path document.pdf --output_dir ./output --export_md

# With CSV export
pdf-visual-extract --pdf_path document.pdf --output_dir ./output --export_csv

# With text cleaning (remove redundant table/figure content)
pdf-visual-extract --pdf_path document.pdf --output_dir ./output --clean_text

# Full pipeline with all exports
pdf-visual-extract --pdf_path document.pdf --output_dir ./output --export_md --export_csv --clean_text --max_pages 5

# Batch processing multiple PDFs
python examples/batch_processing_example.py

Python API Usage

from pdf_visual_extraction import (
    extract_text_from_pdf,
    convert_pdf_to_images,
    process_images_openai,
    inject_tables_into_text,
    convert_tables_to_csv
)

# Extract text
text_data = extract_text_from_pdf("document.pdf", "output/text.json")

# Convert to images
convert_pdf_to_images("document.pdf", "output/images/")

# Detect visual elements
visual_data = process_images_openai("output/images/", "document", "output/visual.json")

# Inject tables into text
final_data = inject_tables_into_text(text_data, visual_data)

# Convert tables to CSV
csv_results = convert_tables_to_csv(final_data, "output", "document")

Batch Processing

For processing multiple PDFs efficiently:

Command Line:

# Process all PDFs in a directory with parallel processing
python batch_process_all_pdfs.py --data_dir data --output_dir batch_output --parallel_jobs 4 --max_workers 3

Python API:

from pdf_visual_extraction import run_pdf_visual_extraction

# Process multiple PDFs
pdf_files = ["doc1.pdf", "doc2.pdf", "doc3.pdf"]
for pdf_path in pdf_files:
    success = run_pdf_visual_extraction(
        pdf_path=pdf_path,
        output_dir="batch_output",
        export_md=True,
        export_csv=True,
        max_workers=3
    )

Features:

Parallel processing for multiple PDFs
CSV export for all extracted tables
Progress tracking and error handling
Comprehensive summaries with statistics
Skip existing files option

Configuration

Environment Variables

Set your OpenAI API key:

export OPENAI_API_KEY="your-api-key-here"

API Key in Code

import os
os.environ["OPENAI_API_KEY"] = "your-api-key-here"

CSV Export

The library can convert extracted tables to CSV format using LLM:

Features:

LLM-powered conversion using GPT-4o-mini
Clean CSV formatting with proper headers
Data type preservation (numbers, text, percentages)
Hierarchical structure handling
Descriptive filenames with page numbers and table descriptions

Usage:

from pdf_visual_extraction import convert_tables_to_csv

# Convert all tables to CSV
csv_results = convert_tables_to_csv(final_data, "output", "document")
print(f"Converted {csv_results['converted_tables']} tables to CSV")

Output Structure:

output/
├── csv_exports/
│   ├── document_page_001_Financial_Results.csv
│   ├── document_page_002_Revenue_Breakdown.csv
│   └── document_csv_conversion_summary.json

Text Cleaning

The library can intelligently clean extracted text by removing redundant table and figure content that has already been parsed by the VLM:

Features:

LLM-powered cleaning using GPT-4o-mini
Intelligent content removal of redundant table/figure data
Preserves narrative content and document structure
Maintains references to tables and figures
Reduces duplication between text and structured data

Usage:

from pdf_visual_extraction import clean_text_in_data
from openai import OpenAI

# Clean text to remove redundant content
client = OpenAI()
cleaned_data = clean_text_in_data(data, client)

Benefits:

Cleaner text output without redundant table data
Better RAG performance with focused content
Reduced file sizes by eliminating duplication
Improved readability of extracted text

Output Format

The library generates structured JSON output with the following format:

{
  "pdf_name": "document-name",
  "total_pages": 10,
  "pages": [
    {
      "page_number": 0,
      "text": "Page text content...",
      "tables": [
        {
          "description": "Table description",
          "structured_data": "### Table: Title\n- **Category**:\n  - Item: value",
          "raw_text": "Raw table text",
          "confidence": 0.95,
          "bbox": [x1, y1, x2, y2]
        }
      ]
    }
  ]
}

Examples

Financial Document Analysis

pdf-visual-extract \
  --pdf_path "earnings-report.pdf" \
  --output_dir "./analysis" \
  --export_md \
  --export_pdf \
  --max_pages 10

Research Paper Processing

pdf-visual-extract \
  --pdf_path "research-paper.pdf" \
  --output_dir "./research" \
  --export_md \
  --max_pages 20

File Structure

output/
├── document-name/
│   ├── text_extraction/
│   │   ├── document-name_text.json
│   │   └── document-name_with_tables.json
│   ├── images/
│   │   ├── page_000.png
│   │   └── page_001.png
│   ├── visual_detection/
│   │   └── document-name_tables_figures.json
│   ├── exports/
│   │   ├── document-name_report.md
│   │   └── document-name_report.pdf
│   └── document-name_pipeline_summary.md

API Reference

Core Functions

extract_text_from_pdf(pdf_path, output_path): Extract text from PDF
convert_pdf_to_images(pdf_path, output_dir): Convert PDF to images
process_images_openai(images_dir, pdf_name, output_file, max_pages): Detect visual elements
inject_tables_into_text(text_data, visual_data): Inject tables into text data
convert_json_to_markdown(json_data): Convert JSON to Markdown
convert_markdown_to_pdf(markdown_file, pdf_file): Convert Markdown to PDF

Requirements

Python 3.8+
OpenAI API key
poppler-utils (for PDF to image conversion)

System Dependencies

Ubuntu/Debian:

sudo apt-get install poppler-utils

macOS:

brew install poppler

Windows: Download poppler binaries and add to PATH.

Contributing

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

OpenAI for providing the vision language models
The open-source community for the underlying PDF processing libraries

Support

For issues and questions:

Open an issue on GitHub
Check the documentation
Review the examples

Made with ❤️ for the AI and document processing community

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.github/workflows		.github/workflows
docs		docs
examples		examples
pdf_visual_extraction		pdf_visual_extraction
tests		tests
.gitignore		.gitignore
CONCURRENT_PROCESSING.md		CONCURRENT_PROCESSING.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
README_OPENAI.md		README_OPENAI.md
__init__.py		__init__.py
demo_usage.ipynb		demo_usage.ipynb
pdf_visual_extract.py		pdf_visual_extract.py
requirements.txt		requirements.txt
setup.py		setup.py
setup_github.sh		setup_github.sh

License

tatevik-t/pdf_visual_extraction

Folders and files

Latest commit

History

Repository files navigation

PDF Visual Extraction Library

Features

Installation

From Source

Dependencies

Quick Start

Command Line Usage

Python API Usage

Batch Processing

Command Line:

Python API:

Features:

Configuration

Environment Variables

API Key in Code

CSV Export

Features:

Usage:

Output Structure:

Text Cleaning

Features:

Usage:

Benefits:

Output Format

Examples

Financial Document Analysis

Research Paper Processing

File Structure

API Reference

Core Functions

Requirements

System Dependencies

Contributing

License

Acknowledgments

Support

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages