A comprehensive tool for extracting structured data from various file formats using marker and LLM services. This tool supports PDFs, images, PowerPoint presentations, Word documents, Excel files, HTML, and EPUB files.
- Multi-format Support: Process PDF, images (PNG, JPG, TIFF, etc.), PPT/PPTX, DOC/DOCX, XLS/XLSX, HTML, and EPUB files
- Flexible LLM Integration: Support for OpenAI, Claude, Gemini, Vertex AI, and Ollama services
- Custom Extraction Schemas: Define custom Pydantic schemas for specific extraction needs
- Batch Processing: Efficient processing of multiple files with progress tracking
- Multiple Output Formats: Save results as CSV or JSON
- GPU Support: Optional GPU acceleration for faster processing
- Caching: Built-in caching to reduce API costs and processing time
- Clone this repository:
git clone <repository-url>
cd structural-data-extraction-tool
- Install dependencies:
pip install -e .
Note: For non-PDF files (PPTX, DOCX, XLSX, HTML, EPUB), ensure you have the full marker installation which is included in our dependencies. The base marker installation only supports PDFs and images.
- Set up your API keys (choose one or more):
# OpenAI
export OPENAI_API_KEY="your-openai-api-key"
# Claude
export CLAUDE_API_KEY="your-claude-api-key"
# Gemini
export GEMINI_API_KEY="your-gemini-api-key"
- PDFs: Academic papers, reports, documents
- Images: PNG, JPG, JPEG, TIFF, BMP, GIF
- Presentations: PPT, PPTX
- Documents: DOC, DOCX
- Spreadsheets: XLS, XLSX
- Web: HTML, HTM
- E-books: EPUB
from structural_extractor import StructuralDataExtractor
import os
# Initialize extractor
extractor = StructuralDataExtractor(
llm_config={"openai_api_key": os.getenv("OPENAI_API_KEY")}
)
# Extract from single file
result = extractor.extract_from_file("document.pdf")
# Extract from directory
results = extractor.extract_from_directory("input_folder")
# Save to CSV
extractor.save_to_csv(results, "output.csv")
# Extract academic papers
python structural_extractor.py --input ./papers --schema academic --output results.csv
# Extract general content
python structural_extractor.py --input ./documents --schema general --output content.csv
# Show file statistics
python structural_extractor.py --input ./documents --stats
from structural_extractor import extract_academic_papers, extract_general_content
# Extract academic papers with automatic schema
csv_file = extract_academic_papers(
input_dir="papers",
output_csv="academic_results.csv",
api_key=os.getenv("OPENAI_API_KEY")
)
# Extract general content from various formats
csv_file = extract_general_content(
input_dir="documents",
output_csv="general_results.csv",
llm_service="marker.services.claude.ClaudeService",
api_key=os.getenv("CLAUDE_API_KEY")
)
Define custom schemas for specific extraction needs:
from pydantic import BaseModel, Field
from typing import List, Optional
class PresentationMetadata(BaseModel):
title: Optional[str] = Field(description="Presentation title")
presenter: Optional[str] = Field(description="Presenter name")
topics: List[str] = Field(default_factory=list, description="Main topics")
slide_count: Optional[int] = Field(description="Number of slides")
# Use custom schema
extractor = StructuralDataExtractor(
extraction_schema=PresentationMetadata,
llm_config={"openai_api_key": os.getenv("OPENAI_API_KEY")}
)
extractor = StructuralDataExtractor(
llm_service="marker.services.openai.OpenAIService",
llm_config={
"openai_api_key": "your-key",
"openai_model": "gpt-4",
"openai_base_url": "https://api.openai.com/v1" # optional
}
)
extractor = StructuralDataExtractor(
llm_service="marker.services.claude.ClaudeService",
llm_config={
"claude_api_key": "your-key",
"claude_model_name": "claude-3-sonnet-20240229"
}
)
extractor = StructuralDataExtractor(
llm_service="marker.services.gemini.GeminiService",
llm_config={"gemini_api_key": "your-key"}
)
extractor = StructuralDataExtractor(
llm_service="marker.services.vertex.GoogleVertexService",
llm_config={"vertex_project_id": "your-project-id"}
)
extractor = StructuralDataExtractor(
llm_service="marker.services.ollama.OllamaService",
llm_config={
"ollama_base_url": "http://localhost:11434",
"ollama_model": "llama2"
}
)
extractor = StructuralDataExtractor(
llm_service="marker.services.openai.OpenAIService",
llm_config={"openai_api_key": "your-key"},
extraction_schema=CustomSchema,
output_dir="./results",
use_gpu=True, # Enable GPU acceleration
max_pages=10, # Limit pages processed
cache_dir="./.cache" # Cache directory
)
The tool flattens nested data structures for CSV compatibility:
- Lists are converted to semicolon-separated strings
- Dictionaries are converted to JSON strings
- Metadata includes source file information
Preserves full data structure with nested objects and arrays.
See example_usage.py
for comprehensive examples including:
- Basic usage patterns
- Custom schema definitions
- Multiple LLM service configurations
- Batch processing workflows
- Financial document extraction
- Presentation metadata extraction
- GPU Usage: Enable GPU for faster processing of large documents
- Batch Size: Adjust batch size based on memory constraints
- Page Limits: Set
max_pages
for documents where metadata is in early pages - Caching: Results are automatically cached to avoid duplicate API calls
The tool includes robust error handling:
- Unsupported files are skipped with warnings
- API errors are logged and processing continues
- Detailed logging for troubleshooting
python structural_extractor.py [OPTIONS]
Options:
--input, -i Input directory (required)
--output, -o Output CSV file
--llm-service LLM service to use
--api-key API key for LLM service
--schema Extraction schema (academic/general)
--stats Show file statistics only
OPENAI_API_KEY
: OpenAI API keyCLAUDE_API_KEY
: Anthropic Claude API keyGEMINI_API_KEY
: Google Gemini API key
marker-pdf
: Document processing and conversionpydantic
: Schema definition and validationopenai
: OpenAI API clienttqdm
: Progress trackingpandas
: Data manipulation (optional)
[Add your license information here]
Contributions welcome! Please read the contributing guidelines and submit pull requests for any improvements.