A CLI tool for processing PDF documents and generating structured JSON & CSV output using the BookWyrm API. Examples of it's use include enriching product data from brochures and exporting this to CSV files and processing invoices automate data entry.
The CLI tool extracts text from PDFs, uses AI to identify product information, and outputs data conforming to the OpenAI commerce feed specification.
Using the API server, you can specify your own Pydantic models to send with requests to output data in a format that matches your requirements.
- PDF Text Extraction: Parallel processing of multiple PDF documents
- AI-Powered Analysis: Uses BookWyrm API for intelligent product data extraction
- Structured Output: Generates CSV files following OpenAI commerce feed specification
- Validation: Built-in validation for generated CSV files
- Batch Processing: Configurable parallel processing for efficiency
- Rich CLI: User-friendly command-line interface with progress indicators
This project uses uv for dependency management. Make sure you have uv installed, then:
uv syncSet your BookWyrm API key as an environment variable:
export BOOKWYRM_API_KEY=your_api_key_hereAlternatively, you can pass the API key directly using the --api-key option.
Process all PDF files in a directory and generate a product feed CSV:
uv run pdf-server process ./docs --output products.csv--output, -o: Output CSV file path (default:product_feed.csv)--api-key: BookWyrm API key (can also useBOOKWYRM_API_KEYenv var)--max-tokens: Maximum tokens for summarization (default: 2000)--batch-size: Number of PDFs to process in parallel (default: 5)
# Process PDFs with custom settings
uv run pdf-server process ./documents \
--output my_products.csv \
--max-tokens 3000 \
--batch-size 3Validate a generated CSV file against the OpenAI commerce feed specification:
uv run pdf-server validate products.csvThis will check for:
- Required fields presence
- Field length constraints
- Data type validation
- Business rule compliance
Start the server for API-based processing:
uv run pdf-server serve --host 0.0.0.0 --port 8000Alternatively, activate the virtual environment first:
source .venv/bin/activate
pdf-server serve --host 0.0.0.0 --port 8000Use curl to process PDF brochures and extract product information:
curl -X POST "http://localhost:8000/process" \
-H "Authorization: Bearer $BOOKWYRM_API_KEY" \
-F "files=@docs/WAN28259GB.pdf" \
-F "files=@docs/WGB256A2GB.pdf" \
-F "schema_name=ProductSummary" \
-F 'json_schema={"type":"object","properties":{"title":{"type":"string","description":"Product name or title as mentioned in the document"},"price":{"type":"number","description":"Product price as a numeric value (without currency symbol)"},"dimensions":{"type":"string","description":"Product dimensions with units (e.g. 845x598x590 mm)"}},"required":["title"]}'This example:
- Uploads multiple PDF product brochures (
WAN28259GB.pdf,WGB256A2GB.pdf) - Uses a custom schema named "ProductSummary"
- Extracts title, price, and dimensions from each document
- Returns streaming JSON responses with extracted product data
The API returns a streaming response where each line contains JSON data for a processed PDF.
Check if the server is running:
curl http://localhost:8000/healthThe generated CSV includes fields from the OpenAI commerce feed specification:
enable_search: Controls ChatGPT search visibilityenable_checkout: Enables direct purchase in ChatGPTid: Unique product identifiertitle: Product name (max 150 chars)description: Product description (max 5000 chars)link: Product detail page URLprice: Price with currency codeavailability: Stock statusinventory_quantity: Available quantityseller_name: Merchant nameseller_url: Merchant page URLreturn_policy: Return policy URLreturn_window: Return period in days
gtin: Universal product identifierbrand: Product brandproduct_category: Category hierarchymaterial: Primary materialsweight: Product weightimage_link: Main product imagecolor: Product colorsize: Product size- And many more...
- PDF Processing: Extracts text from all PDF files in the specified directory
- AI Analysis: Uses BookWyrm API to analyze text and extract structured product information
- Data Mapping: Maps extracted data to OpenAI commerce feed schema
- CSV Generation: Outputs validated CSV file with all required fields
- Validation: Optionally validates output against specification
This tool integrates with the BookWyrm API for intelligent text analysis and product data extraction. The API is used to:
- Analyze PDF content for product information
- Extract structured data like titles, descriptions, prices
- Identify product categories and specifications
- Generate marketing-friendly descriptions
The tool includes comprehensive error handling:
- PDF Processing Errors: Continues processing other files if individual PDFs fail
- API Errors: Graceful fallback with basic product data structure
- Validation Errors: Clear error messages with specific field issues
- File System Errors: Proper error reporting for missing files/directories
src/product_enrichment/
├── __init__.py # Package initialization
├── main.py # CLI application entry point
├── models.py # Pydantic models for data validation
├── pdf_processor.py # PDF text extraction logic
└── product_feed_generator.py # BookWyrm API integration and CSV generation
To add new fields to the product feed:
- Update the
ProductFeedItemmodel inmodels.py - Modify the extraction prompt in
product_feed_generator.py - Update the field mapping in
_convert_to_product_feed_item() - Add validation rules if needed
Run the CLI to test functionality:
# Test help output
uv run pdf-server --help
# Test with sample PDFs
mkdir -p test_docs
# Add some PDF files to test_docs/
uv run pdf-server process test_docs --output test_output.csv
# Validate the output
uv run pdf-server validate test_output.csv- Python 3.12+
- BookWyrm API key
- PDF files for processing
bookwyrm: API client for text analysistyper: CLI frameworkrich: Terminal formatting and progress barspydantic: Data validation and serializationpypdf: PDF text extractionpandas: CSV generation and manipulation
[Add your license information here]
[Add contribution guidelines here]