Skip to content

tatevik-t/pdf_visual_extraction

Repository files navigation

PDF Visual Extraction Library

A Python library for extracting text and visual elements (tables, figures) from PDF documents using OpenAI's vision models. Perfect for RAG systems, document analysis, and automated data extraction.

Features

  • 📄 PDF Text Extraction: Extract clean text content from PDF documents
  • 🖼️ PDF to Images: Convert PDF pages to high-quality images
  • 🔍 Visual Element Detection: Use OpenAI's GPT-4o-mini to detect and extract tables and figures
  • 📊 Structured Data Extraction: Extract tables in structured list format with all numerical data
  • 📈 CSV Export: Convert extracted tables to CSV format using LLM
  • 📝 Multiple Export Formats: Export to JSON, Markdown, and CSV
  • 🚀 Simple API: Easy-to-use command line interface and Python API
  • 💰 Cost-Effective: Uses GPT-4o-mini for optimal performance and cost

Installation

From Source

git clone https://github.com/yourusername/pdf-visual-extraction.git
cd pdf-visual-extraction
pip install -e .

Dependencies

pip install -r requirements.txt

Quick Start

Command Line Usage

# Basic extraction
pdf-visual-extract --pdf_path document.pdf --output_dir ./output

# With markdown export
pdf-visual-extract --pdf_path document.pdf --output_dir ./output --export_md

# With CSV export
pdf-visual-extract --pdf_path document.pdf --output_dir ./output --export_csv

# With text cleaning (remove redundant table/figure content)
pdf-visual-extract --pdf_path document.pdf --output_dir ./output --clean_text

# Full pipeline with all exports
pdf-visual-extract --pdf_path document.pdf --output_dir ./output --export_md --export_csv --clean_text --max_pages 5

# Batch processing multiple PDFs
python examples/batch_processing_example.py

Python API Usage

from pdf_visual_extraction import (
    extract_text_from_pdf,
    convert_pdf_to_images,
    process_images_openai,
    inject_tables_into_text,
    convert_tables_to_csv
)

# Extract text
text_data = extract_text_from_pdf("document.pdf", "output/text.json")

# Convert to images
convert_pdf_to_images("document.pdf", "output/images/")

# Detect visual elements
visual_data = process_images_openai("output/images/", "document", "output/visual.json")

# Inject tables into text
final_data = inject_tables_into_text(text_data, visual_data)

# Convert tables to CSV
csv_results = convert_tables_to_csv(final_data, "output", "document")

Batch Processing

For processing multiple PDFs efficiently:

Command Line:

# Process all PDFs in a directory with parallel processing
python batch_process_all_pdfs.py --data_dir data --output_dir batch_output --parallel_jobs 4 --max_workers 3

Python API:

from pdf_visual_extraction import run_pdf_visual_extraction

# Process multiple PDFs
pdf_files = ["doc1.pdf", "doc2.pdf", "doc3.pdf"]
for pdf_path in pdf_files:
    success = run_pdf_visual_extraction(
        pdf_path=pdf_path,
        output_dir="batch_output",
        export_md=True,
        export_csv=True,
        max_workers=3
    )

Features:

  • Parallel processing for multiple PDFs
  • CSV export for all extracted tables
  • Progress tracking and error handling
  • Comprehensive summaries with statistics
  • Skip existing files option

Configuration

Environment Variables

Set your OpenAI API key:

export OPENAI_API_KEY="your-api-key-here"

API Key in Code

import os
os.environ["OPENAI_API_KEY"] = "your-api-key-here"

CSV Export

The library can convert extracted tables to CSV format using LLM:

Features:

  • LLM-powered conversion using GPT-4o-mini
  • Clean CSV formatting with proper headers
  • Data type preservation (numbers, text, percentages)
  • Hierarchical structure handling
  • Descriptive filenames with page numbers and table descriptions

Usage:

from pdf_visual_extraction import convert_tables_to_csv

# Convert all tables to CSV
csv_results = convert_tables_to_csv(final_data, "output", "document")
print(f"Converted {csv_results['converted_tables']} tables to CSV")

Output Structure:

output/
├── csv_exports/
│   ├── document_page_001_Financial_Results.csv
│   ├── document_page_002_Revenue_Breakdown.csv
│   └── document_csv_conversion_summary.json

Text Cleaning

The library can intelligently clean extracted text by removing redundant table and figure content that has already been parsed by the VLM:

Features:

  • LLM-powered cleaning using GPT-4o-mini
  • Intelligent content removal of redundant table/figure data
  • Preserves narrative content and document structure
  • Maintains references to tables and figures
  • Reduces duplication between text and structured data

Usage:

from pdf_visual_extraction import clean_text_in_data
from openai import OpenAI

# Clean text to remove redundant content
client = OpenAI()
cleaned_data = clean_text_in_data(data, client)

Benefits:

  • Cleaner text output without redundant table data
  • Better RAG performance with focused content
  • Reduced file sizes by eliminating duplication
  • Improved readability of extracted text

Output Format

The library generates structured JSON output with the following format:

{
  "pdf_name": "document-name",
  "total_pages": 10,
  "pages": [
    {
      "page_number": 0,
      "text": "Page text content...",
      "tables": [
        {
          "description": "Table description",
          "structured_data": "### Table: Title\n- **Category**:\n  - Item: value",
          "raw_text": "Raw table text",
          "confidence": 0.95,
          "bbox": [x1, y1, x2, y2]
        }
      ]
    }
  ]
}

Examples

Financial Document Analysis

pdf-visual-extract \
  --pdf_path "earnings-report.pdf" \
  --output_dir "./analysis" \
  --export_md \
  --export_pdf \
  --max_pages 10

Research Paper Processing

pdf-visual-extract \
  --pdf_path "research-paper.pdf" \
  --output_dir "./research" \
  --export_md \
  --max_pages 20

File Structure

output/
├── document-name/
│   ├── text_extraction/
│   │   ├── document-name_text.json
│   │   └── document-name_with_tables.json
│   ├── images/
│   │   ├── page_000.png
│   │   └── page_001.png
│   ├── visual_detection/
│   │   └── document-name_tables_figures.json
│   ├── exports/
│   │   ├── document-name_report.md
│   │   └── document-name_report.pdf
│   └── document-name_pipeline_summary.md

API Reference

Core Functions

  • extract_text_from_pdf(pdf_path, output_path): Extract text from PDF
  • convert_pdf_to_images(pdf_path, output_dir): Convert PDF to images
  • process_images_openai(images_dir, pdf_name, output_file, max_pages): Detect visual elements
  • inject_tables_into_text(text_data, visual_data): Inject tables into text data
  • convert_json_to_markdown(json_data): Convert JSON to Markdown
  • convert_markdown_to_pdf(markdown_file, pdf_file): Convert Markdown to PDF

Requirements

  • Python 3.8+
  • OpenAI API key
  • poppler-utils (for PDF to image conversion)

System Dependencies

Ubuntu/Debian:

sudo apt-get install poppler-utils

macOS:

brew install poppler

Windows: Download poppler binaries and add to PATH.

Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

  • OpenAI for providing the vision language models
  • The open-source community for the underlying PDF processing libraries

Support

For issues and questions:

  • Open an issue on GitHub
  • Check the documentation
  • Review the examples

Made with ❤️ for the AI and document processing community

About

A Python library for extracting text and visual elements (tables, figures) from PDF document

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published