PDF Structured Output Server & CLI Tool

A CLI tool for processing PDF documents and generating structured JSON & CSV output using the BookWyrm API. Examples of it's use include enriching product data from brochures and exporting this to CSV files and processing invoices automate data entry.

The CLI tool extracts text from PDFs, uses AI to identify product information, and outputs data conforming to the OpenAI commerce feed specification.

Using the API server, you can specify your own Pydantic models to send with requests to output data in a format that matches your requirements.

Features

PDF Text Extraction: Parallel processing of multiple PDF documents
AI-Powered Analysis: Uses BookWyrm API for intelligent product data extraction
Structured Output: Generates CSV files following OpenAI commerce feed specification
Validation: Built-in validation for generated CSV files
Batch Processing: Configurable parallel processing for efficiency
Rich CLI: User-friendly command-line interface with progress indicators

Installation

This project uses uv for dependency management. Make sure you have uv installed, then:

uv sync

Configuration

Set your BookWyrm API key as an environment variable:

export BOOKWYRM_API_KEY=your_api_key_here

Alternatively, you can pass the API key directly using the --api-key option.

Usage

Process PDFs

Process all PDF files in a directory and generate a product feed CSV:

uv run pdf-server process ./docs --output products.csv

Options

--output, -o: Output CSV file path (default: product_feed.csv)
--api-key: BookWyrm API key (can also use BOOKWYRM_API_KEY env var)
--max-tokens: Maximum tokens for summarization (default: 2000)
--batch-size: Number of PDFs to process in parallel (default: 5)

Example

# Process PDFs with custom settings
uv run pdf-server process ./documents \
  --output my_products.csv \
  --max-tokens 3000 \
  --batch-size 3

Validate CSV

Validate a generated CSV file against the OpenAI commerce feed specification:

uv run pdf-server validate products.csv

This will check for:

Required fields presence
Field length constraints
Data type validation
Business rule compliance

Server API

Start the server for API-based processing:

uv run pdf-server serve --host 0.0.0.0 --port 8000

Alternatively, activate the virtual environment first:

source .venv/bin/activate
pdf-server serve --host 0.0.0.0 --port 8000

Process PDFs via API

Use curl to process PDF brochures and extract product information:

curl -X POST "http://localhost:8000/process" \
  -H "Authorization: Bearer $BOOKWYRM_API_KEY" \
  -F "files=@docs/WAN28259GB.pdf" \
  -F "files=@docs/WGB256A2GB.pdf" \
  -F "schema_name=ProductSummary" \
  -F 'json_schema={"type":"object","properties":{"title":{"type":"string","description":"Product name or title as mentioned in the document"},"price":{"type":"number","description":"Product price as a numeric value (without currency symbol)"},"dimensions":{"type":"string","description":"Product dimensions with units (e.g. 845x598x590 mm)"}},"required":["title"]}'

This example:

Uploads multiple PDF product brochures (WAN28259GB.pdf, WGB256A2GB.pdf)
Uses a custom schema named "ProductSummary"
Extracts title, price, and dimensions from each document
Returns streaming JSON responses with extracted product data

The API returns a streaming response where each line contains JSON data for a processed PDF.

Health Check

Check if the server is running:

curl http://localhost:8000/health

Output Format

The generated CSV includes fields from the OpenAI commerce feed specification:

Required Fields

enable_search: Controls ChatGPT search visibility
enable_checkout: Enables direct purchase in ChatGPT
id: Unique product identifier
title: Product name (max 150 chars)
description: Product description (max 5000 chars)
link: Product detail page URL
price: Price with currency code
availability: Stock status
inventory_quantity: Available quantity
seller_name: Merchant name
seller_url: Merchant page URL
return_policy: Return policy URL
return_window: Return period in days

Optional Fields

gtin: Universal product identifier
brand: Product brand
product_category: Category hierarchy
material: Primary materials
weight: Product weight
image_link: Main product image
color: Product color
size: Product size
And many more...

How It Works

PDF Processing: Extracts text from all PDF files in the specified directory
AI Analysis: Uses BookWyrm API to analyze text and extract structured product information
Data Mapping: Maps extracted data to OpenAI commerce feed schema
CSV Generation: Outputs validated CSV file with all required fields
Validation: Optionally validates output against specification

API Integration

This tool integrates with the BookWyrm API for intelligent text analysis and product data extraction. The API is used to:

Analyze PDF content for product information
Extract structured data like titles, descriptions, prices
Identify product categories and specifications
Generate marketing-friendly descriptions

Error Handling

The tool includes comprehensive error handling:

PDF Processing Errors: Continues processing other files if individual PDFs fail
API Errors: Graceful fallback with basic product data structure
Validation Errors: Clear error messages with specific field issues
File System Errors: Proper error reporting for missing files/directories

Development

Project Structure

src/product_enrichment/
├── __init__.py              # Package initialization
├── main.py                  # CLI application entry point
├── models.py                # Pydantic models for data validation
├── pdf_processor.py         # PDF text extraction logic
└── product_feed_generator.py # BookWyrm API integration and CSV generation

Adding New Fields

To add new fields to the product feed:

Update the ProductFeedItem model in models.py
Modify the extraction prompt in product_feed_generator.py
Update the field mapping in _convert_to_product_feed_item()
Add validation rules if needed

Testing

Run the CLI to test functionality:

# Test help output
uv run pdf-server --help

# Test with sample PDFs
mkdir -p test_docs
# Add some PDF files to test_docs/
uv run pdf-server process test_docs --output test_output.csv

# Validate the output
uv run pdf-server validate test_output.csv

Requirements

Python 3.12+
BookWyrm API key
PDF files for processing

Dependencies

bookwyrm: API client for text analysis
typer: CLI framework
rich: Terminal formatting and progress bars
pydantic: Data validation and serialization
pypdf: PDF text extraction
pandas: CSV generation and manipulation

License

[Add your license information here]

Contributing

[Add contribution guidelines here]

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
models		models
src/product_enrichment		src/product_enrichment
.envrc		.envrc
.gitignore		.gitignore
CONVENTIONS.md		CONVENTIONS.md
CURRENT_IMPLEMENTATION.md		CURRENT_IMPLEMENTATION.md
README.md		README.md
flake.lock		flake.lock
flake.nix		flake.nix
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PDF Structured Output Server & CLI Tool

Features

Installation

Configuration

Usage

Process PDFs

Options

Example

Validate CSV

Server API

Process PDFs via API

Health Check

Output Format

Required Fields

Optional Fields

How It Works

API Integration

Error Handling

Development

Project Structure

Adding New Fields

Testing

Requirements

Dependencies

License

Contributing

About

Uh oh!

Releases

Packages

Contributors 2

Languages

scidonia/pdf-structured-output-server

Folders and files

Latest commit

History

Repository files navigation

PDF Structured Output Server & CLI Tool

Features

Installation

Configuration

Usage

Process PDFs

Options

Example

Validate CSV

Server API

Process PDFs via API

Health Check

Output Format

Required Fields

Optional Fields

How It Works

API Integration

Error Handling

Development

Project Structure

Adding New Fields

Testing

Requirements

Dependencies

License

Contributing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages