# 📓 The GenAI Revolution Cookbook

**Title:** Structured Data Extraction: How to Build a LangChain Pipeline

**Description:** Build an end-to-end LangChain + OpenAI pipeline for structured data extraction from long documents—schema validation, batching, retries, and production-ready code.

**📖 Read the full article:** [Structured Data Extraction: How to Build a LangChain Pipeline](https://blog.thegenairevolution.com/article/structured-data-extraction-how-to-build-a-langchain-pipeline)

---

*This jupyter notebook contains executable code examples. Run the cells below to try out the code yourself!*



Structured data extraction from invoices is one of those tasks that sounds simple until you actually try to build it. JSON formatting breaks, token limits hit you in the face, fields mysteriously disappear, and before you know it, you're burning through API credits like there's no tomorrow. I've been down this road more times than I care to admit, and this tutorial walks you through building a production\-ready invoice extraction pipeline that actually handles these challenges using LangChain, OpenAI's structured output, and Pydantic validation.

By the time we're done, you'll have a complete system that extracts invoice fields reliably at scale, tracks where every piece of data came from, and gives you clear cost estimates so there are no surprises.

## Why This Approach Actually Works

Here's the thing about invoice extraction \- it's never just one LLM call and you're done. Real invoices come in every format imaginable, from pristine digital documents to barely\-legible scans that look like they went through a fax machine from 1987\. A robust pipeline needs to:

* **Enforce schema compliance** – Pydantic models validate your output structure and types, catching malformed JSON before it breaks something downstream.
* **Handle long documents** – Chunking with overlap makes sure you don't lose that one line item that happens to fall right at a chunk boundary. Aggregation brings it all back together.
* **Retry transient failures** – Rate limits and timeouts are just part of life with APIs. Exponential backoff with tenacity keeps things running when the inevitable hiccup occurs.
* **Scale with async batching** – Concurrent requests with semaphore\-based rate limiting let you maximize throughput without getting yourself rate\-limited into oblivion.
* **Cache results** – Content\-based caching skips redundant extractions. Why pay twice for the same invoice?
* **Track cost and usage** – Token counting and cost estimation help you figure out if you're using the right model or if your prompts are getting out of hand.

This design balances accuracy, speed, and cost \- the three things that actually matter when you're processing hundreds or thousands of invoices in production.

## Why LangChain with Structured Output?

LangChain's `with_structured_output` wraps OpenAI's JSON schema enforcement with Pydantic validation, which gives you:

* **Type safety** – Pydantic models catch type mismatches and missing fields at runtime, not three weeks later when accounting notices something's wrong.
* **Schema evolution** – Need to add a new field? Just update your Pydantic model. No need to rewrite all your prompt logic.
* **Ergonomics** – Way cleaner than manually parsing JSON and validating every single field yourself.

Now, there's a trade\-off here. LangChain adds another dependency to your stack. If you're trying to keep things minimal or need maximum control over every API call, OpenAI's SDK with `response_format={"type": "json_schema"}` is lighter. But honestly, for this tutorial, LangChain's validation and retry integrations more than justify the extra dependency.

Quick note on scope: This pipeline assumes you're starting with text. Most invoices in the real world are PDFs or scans, so you'll need OCR or PDF\-to\-text preprocessing (think pdfminer, pypdf, unstructured, or AWS Textract). That's a whole other can of worms that's out of scope here, but you'll definitely need it in production.

## How It Works (The Big Picture)

The pipeline follows this flow:

1. **Input text** → Split into overlapping chunks (if it's too long)
2. **Per\-chunk extraction** → Extract fields with strict schema enforcement and retries
3. **Aggregation** → Merge partial results, deduplicate line items, track where everything came from
4. **Post\-processing** → Normalize dates and currency, validate totals
5. **Batch processing** → Async extraction with concurrency control
6. **Caching** → Store results keyed by content hash to skip redundant calls
7. **Metrics** → Count tokens and estimate cost per run

Each component is modular, so you can use just the core extraction logic if that's all you need, or scale up to batches with caching and metrics when you're ready.

## Setup \& Installation

First things first, let's get our dependencies installed (this works in Colab too):

In [None]:
!pip install -U langchain langchain-openai langchain-text-splitters pydantic==2.* tenacity tiktoken python-dotenv openai

Set your OpenAI API key. If you're working locally, use a `.env` file. For Colab, you can set it directly or use Colab secrets:

In [None]:
import os

# Option 1: Set directly (not recommended for shared notebooks)
# os.environ["OPENAI_API_KEY"] = "sk-..."

# Option 2: Use Colab userdata (recommended for Colab)
try:
    from google.colab import userdata
    os.environ["OPENAI_API_KEY"] = userdata.get("OPENAI_API_KEY")
except ImportError:
    # Option 3: Load from .env for local development
    from dotenv import load_dotenv
    load_dotenv()

## Define the Schema

We use Pydantic to define what an invoice looks like. Each field gets a description to guide the LLM, and optional fields allow for partial extraction when the invoice is incomplete or weird.

In [None]:
from typing import List, Optional
from pydantic import BaseModel, Field

class LineItem(BaseModel):
    """
    Represents a single line item in an invoice.
    """
    description: str = Field(..., description="Item description as written")
    quantity: Optional[float] = Field(None, description="Numeric quantity if available")
    unit_price: Optional[float] = Field(None, description="Unit price, numeric")
    amount: Optional[float] = Field(None, description="Line total amount, numeric")

class InvoiceEntities(BaseModel):
    """
    Represents the extracted entities from an invoice.
    """
    invoice_number: Optional[str] = Field(None, description="Invoice identifier")
    vendor_name: Optional[str] = Field(None, description="Company or person issuing the invoice")
    vendor_address: Optional[str] = Field(None, description="Address block if present")
    bill_to_name: Optional[str] = Field(None, description="Recipient name")
    bill_to_address: Optional[str] = Field(None, description="Recipient address")
    invoice_date: Optional[str] = Field(None, description="Date string as found")
    due_date: Optional[str] = Field(None, description="Due date string as found")
    currency: Optional[str] = Field(None, description="Currency code/symbol if inferable")
    subtotal: Optional[float] = Field(None, description="Subtotal before tax/fees")
    tax: Optional[float] = Field(None, description="Total tax amount")
    total: Optional[float] = Field(None, description="Grand total")
    line_items: List[LineItem] = Field(default_factory=list, description="List of individual line items")

## Core Extraction with Structured Output

Here's where we configure the LLM to return structured output matching our schema. Temperature is set to 0 because we want deterministic results, not creative interpretations of invoice data.

In [None]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    model="gpt-4o-mini",  # Use "gpt-4o" for higher accuracy if needed
    temperature=0,        # Deterministic output for reproducibility
    api_key=os.getenv("OPENAI_API_KEY"),
)

structured_llm = llm.with_structured_output(InvoiceEntities)

SYSTEM_PROMPT = (
    "You are an information extraction engine. "
    "Extract invoice fields precisely as a JSON object matching the schema. "
    "Do not include any text that is not part of the JSON."
)

def extract_invoice(text: str) -> InvoiceEntities:
    """
    Extracts invoice fields from text using the structured LLM.
    """
    return structured_llm.invoke([
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": text}
    ])

Let's test it with a sample invoice:

In [None]:
sample_doc = """
INVOICE #INV-1028
Vendor: ACME Supplies Co.
Address: 123 Market Street, Springfield
Bill To: Northwind Traders
Date: 2024-08-12
Due: 2024-09-11
Items:
- Paper Reams (QTY 10) @ $4.50 each = $45.00
- Pens (QTY 20) @ $1.25 each = $25.00
Subtotal: $70.00
Tax: $6.30
Total Due: $76.30
"""

result = extract_invoice(sample_doc)
print(result.model_dump())

## Handling Long Documents with Chunking

Token limits are real, and they'll bite you when you least expect it. We chunk with overlap to preserve context between chunks using LangChain's RecursiveCharacterTextSplitter. Actually, if you're curious about how position bias can cause models to miss details in lengthy prompts, check out our analysis on placing critical info in long prompts \- it's eye\-opening.

In [None]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

def split_document(text: str, chunk_size: int = 2000, chunk_overlap: int = 200):
    """
    Splits a document into overlapping chunks for LLM processing.
    Tracks source indices for provenance.
    """
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        separators=["\n\n", "\n", " ", ""],
    )
    chunks = splitter.split_text(text)
    
    # Track source indices for debugging and provenance
    offsets = []
    start = 0
    for chunk in chunks:
        idx = text.find(chunk, start)
        if idx == -1:
            # Guard against find failure; fall back to chunk index only
            offsets.append((start, start + len(chunk)))
        else:
            offsets.append((idx, idx + len(chunk)))
            start = idx + len(chunk)
    
    return [{"id": i, "text": c, "start": s, "end": e} for i, (c, (s, e)) in enumerate(zip(chunks, offsets))]

## Aggregating Partial Results

Now we extract from each chunk and merge the results. For scalar fields like invoice number or date, we take the first non\-null value. For line items, we deduplicate by description and amount \- because sometimes the same item shows up in multiple chunks.

In [None]:
from collections import defaultdict
from pydantic import ValidationError

def extract_invoice_chunked(text: str) -> dict:
    """
    Extracts invoice data from each chunk and aggregates results.
    Deduplicates line items and tracks provenance.
    """
    chunks = split_document(text)
    partials = []
    
    for ch in chunks:
        try:
            r = extract_invoice(ch["text"]).model_dump()
            r["_chunk"] = {"id": ch["id"], "start": ch["start"], "end": ch["end"]}
            partials.append(r)
        except ValidationError as e:
            # Log error for this chunk; will add retries/logging later
            partials.append({"_chunk": {"id": ch["id"], "start": ch["start"], "end": ch["end"]}, "_error": str(e)})

    merged = {
        "invoice_number": None,
        "vendor_name": None,
        "vendor_address": None,
        "bill_to_name": None,
        "bill_to_address": None,
        "invoice_date": None,
        "due_date": None,
        "currency": None,
        "subtotal": None,
        "tax": None,
        "total": None,
        "line_items": [],
        "_provenance": [],  # Track which chunk contributed
    }

    # For scalar fields, prefer the first non-null value found
    scalar_fields = [k for k in merged.keys() if k not in ("line_items", "_provenance")]
    for p in partials:
        if "_error" in p:
            continue
        merged["_provenance"].append(p["_chunk"])
        for f in scalar_fields:
            if merged[f] is None and p.get(f) not in (None, "", []):
                merged[f] = p.get(f)

    # Merge and deduplicate line_items by (description, amount)
    seen = set()
    for p in partials:
        if "_error" in p:
            continue
        for li in p.get("line_items", []):
            key = (li.get("description", "").strip().lower(), li.get("amount"))
            if key not in seen and li.get("description"):
                merged["line_items"].append(li)
                seen.add(key)

    return merged

## Strict Schema Enforcement with Fallback

Sometimes the first extraction fails validation. Maybe the model returns malformed JSON or gets the types wrong. We add a fallback prompt that basically says "hey, try again, but actually follow the schema this time."

In [None]:
STRICT_SYSTEM_PROMPT = (
    "You are an extraction engine. Output MUST be valid JSON only, strictly matching the schema. "
    "Do not include comments or extra keys. Numeric fields must be numbers, not strings."
)

def extract_invoice_strict(text: str) -> InvoiceEntities:
    """
    Extracts invoice fields with strict schema enforcement and fallback prompt.
    """
    try:
        # First attempt with standard prompt
        return structured_llm.invoke([
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": text}
        ])
    except Exception:
        # Second attempt with stricter prompt for schema compliance
        return llm.with_structured_output(InvoiceEntities).invoke([
            {"role": "system", "content": STRICT_SYSTEM_PROMPT},
            {"role": "user", "content": text}
        ])

## Adding Retries with Exponential Backoff

Transient errors happen. Rate limits, timeouts, the occasional cosmic ray flipping a bit somewhere. We use tenacity to retry with exponential backoff, because hammering the API immediately after a rate limit error is not a winning strategy.

In [None]:
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type
from langchain_core.exceptions import OutputParserException

# Retry on LangChain-surfaced exceptions or OpenAI SDK exceptions if propagated
@retry(
    reraise=True,
    stop=stop_after_attempt(4),  # Maximum 4 attempts
    wait=wait_exponential(multiplier=1, min=1, max=8),  # Exponential backoff
    retry=retry_if_exception_type((OutputParserException, Exception))  # Adjust based on error surface
)
def extract_invoice_with_retry(text: str) -> InvoiceEntities:
    """
    Extracts invoice fields with retry logic for transient errors.
    """
    return extract_invoice_strict(text)

Note: If you're using OpenAI SDK exceptions directly (like RateLimitError or APIError), make sure openai is installed and exceptions propagate through LangChain properly. You might need to adjust the retry decorator depending on your setup.

## Logging Failures for Debugging

When extraction fails, you want to know why. We log failures with document ID, input hash, and error details. Save a sample of the failed input too \- trust me, you'll thank yourself later when debugging.

In [None]:
import logging
import hashlib
from pathlib import Path

logging.basicConfig(level=logging.INFO)
FAIL_DIR = Path("failures")
FAIL_DIR.mkdir(exist_ok=True)

def log_failure(doc_id: str, text: str, error: Exception):
    """
    Logs extraction failure details and saves a sample of the failed input.
    """
    h = hashlib.sha256(text.encode("utf-8")).hexdigest()[:12]
    logging.error(f"[extract-fail] doc={doc_id} hash={h} error={type(error).__name__}: {error}")
    (FAIL_DIR / f"{doc_id}_{h}.txt").write_text(text[:4000], encoding="utf-8")

## Async Batch Processing with Concurrency Control

Processing one invoice at a time is fine for testing, but in production? You need concurrency. We use asyncio and semaphores to limit concurrent requests, integrating strict extraction, retries, and error logging all in one go.

In [None]:
import time
import asyncio
from typing import Tuple

class Extractor:
    """
    Async batch extractor for invoice documents.
    Integrates strict schema enforcement, retries, and error logging.
    """
    def __init__(self, model: str = "gpt-4o-mini", concurrency: int = 8):
        self.llm = ChatOpenAI(model=model, temperature=0, api_key=os.getenv("OPENAI_API_KEY"))
        self.structured_llm = self.llm.with_structured_output(InvoiceEntities)
        self.sem = asyncio.Semaphore(concurrency)  # Limit concurrent requests

    async def extract_one(self, doc_id: str, text: str) -> Tuple[str, dict, dict]:
        """
        Extracts a single document asynchronously with retries and error handling.
        """
        async with self.sem:
            t0 = time.perf_counter()
            try:
                # Use strict extraction with retries (adapt for async if needed)
                res = await self.structured_llm.ainvoke([
                    {"role": "system", "content": SYSTEM_PROMPT},
                    {"role": "user", "content": text}
                ])
                elapsed = time.perf_counter() - t0
                meta = getattr(res, "response_metadata", {}) or {}
                return doc_id, res.model_dump(), {"ok": True, "elapsed": elapsed, "meta": meta}
            except Exception as e:
                log_failure(doc_id, text, e)
                elapsed = time.perf_counter() - t0
                return doc_id, {}, {"ok": False, "elapsed": elapsed, "error": str(e)}

async def process_batch(docs: list, model="gpt-4o-mini", concurrency=8):
    """
    Processes a batch of documents asynchronously.
    """
    ex = Extractor(model=model, concurrency=concurrency)
    tasks = [ex.extract_one(doc_id, text) for doc_id, text in docs]
    return await asyncio.gather(*tasks)

## Caching Results to Save Cost

Why pay to extract the same invoice twice? We cache results on disk keyed by a content hash. And if you want to get really fancy with caching \- including semantic caching with embeddings to further reduce LLM spend \- we've got a walkthrough on implementing semantic cache with Redis Vector that goes deep on this.

In [None]:
import json
import hashlib
from pathlib import Path

CACHE_DIR = Path("cache")
CACHE_DIR.mkdir(exist_ok=True)

def cache_key(text: str, model: str) -> str:
    """
    Generates a cache key based on model and text content.
    """
    h = hashlib.sha256((model + "||" + text).encode("utf-8")).hexdigest()
    return h

def get_cached(key: str):
    """
    Retrieves cached extraction result if available.
    """
    p = CACHE_DIR / f"{key}.json"
    return json.loads(p.read_text()) if p.exists() else None

def set_cached(key: str, data: dict):
    """
    Stores extraction result in cache, including provenance and metadata.
    """
    p = CACHE_DIR / f"{key}.json"
    p.write_text(json.dumps(data), encoding="utf-8")

async def process_batch_with_cache(docs: list, model="gpt-4o-mini", concurrency=8):
    """
    Processes a batch of documents with disk cache.
    """
    ex = Extractor(model=model, concurrency=concurrency)
    results = []
    
    for doc_id, text in docs:
        key = cache_key(text, model)
        cached = get_cached(key)
        if cached:
            results.append((doc_id, cached, {"ok": True, "elapsed": 0.0, "cached": True, "meta": cached.get("_meta", {})}))
            continue
        results.append((doc_id, None, None))

    async_tasks = []
    for (doc_id, text), (rid, cached, meta) in zip(docs, results):
        if cached is None:
            async_tasks.append(ex.extract_one(doc_id, text))
    
    fresh = await asyncio.gather(*async_tasks)

    # Stitch back, set cache
    out = []
    fresh_iter = iter(fresh)
    for (doc_id, text), (rid, cached, meta) in zip(docs, results):
        if cached is None:
            did, data, info = next(fresh_iter)
            if info["ok"]:
                data["_meta"] = info.get("meta", {})
                set_cached(cache_key(text, model), data)
            out.append((did, data, info))
        else:
            out.append((doc_id, cached, meta))
    
    return out

## Monitoring Token Usage and Cost

Count tokens using tiktoken and estimate cost based on model pricing. Always double\-check current pricing at OpenAI's pricing page because these things change.

In [None]:
import tiktoken

def count_tokens(text: str, model_encoding: str = "cl100k_base"):
    """
    Counts the number of tokens in a text string.
    """
    enc = tiktoken.get_encoding(model_encoding)
    return len(enc.encode(text))

# Example price map; always verify with OpenAI pricing page
PRICE_PER_1K = {
    "gpt-4o-mini": {"input": 0.00015, "output": 0.0006},
    "gpt-4o": {"input": 0.005, "output": 0.015},
}

def estimate_cost(input_tokens: int, output_tokens: int, model: str) -> float:
    """
    Estimates the cost of a request based on token usage and model.
    """
    p = PRICE_PER_1K.get(model)
    if not p:
        return 0.0
    return (input_tokens / 1000.0) * p["input"] + (output_tokens / 1000.0) * p["output"]

## Post\-Processing: Normalize and Validate

This is where we clean up the extracted data. Normalize currency symbols, parse dates consistently, and validate that invoice totals actually add up. This catches a surprising number of extraction errors.

In [None]:
import re
from datetime import datetime

def normalize_currency(s: str | None) -> str | None:
    """
    Normalizes currency symbols to standard codes.
    """
    if not s:
        return None
    s2 = s.strip().upper()
    if s2 in {"$", "USD"}:
        return "USD"
    if s2 in {"€", "EUR"}:
        return "EUR"
    if s2 in {"£", "GBP"}:
        return "GBP"
    return s2

def normalize_date(s: str | None) -> str | None:
    """
    Normalizes date strings to YYYY-MM-DD format.
    """
    if not s:
        return None
    for fmt in ("%Y-%m-%d", "%Y/%m/%d", "%d-%m-%Y", "%m/%d/%Y", "%Y-%m-%dT%H:%M:%S"):
        try:
            return datetime.strptime(s.strip()[:19], fmt).strftime("%Y-%m-%d")
        except Exception:
            continue
    return s  # Return as-is if unknown

def postprocess_invoice(inv: dict) -> dict:
    """
    Postprocesses invoice dict: normalizes currency, dates, and checks totals.
    """
    inv["currency"] = normalize_currency(inv.get("currency"))
    inv["invoice_date"] = normalize_date(inv.get("invoice_date"))
    inv["due_date"] = normalize_date(inv.get("due_date"))
    
    # Sanity check totals
    try:
        subtotal = float(inv["subtotal"]) if inv.get("subtotal") is not None else None
        tax = float(inv["tax"]) if inv.get("tax") is not None else None
        total = float(inv["total"]) if inv.get("total") is not None else None
        if subtotal is not None and tax is not None and total is not None:
            if abs((subtotal + tax) - total) > 0.05:
                inv["_warning"] = "Totals may not add up"
    except Exception:
        inv["_warning"] = "Totals not numeric"
    
    return inv

## Run and Validate

Let's generate some synthetic invoices for testing and run the full pipeline with caching and metrics.

In [None]:
import random

def synthetic_invoice(i: int) -> str:
    """
    Generates a synthetic invoice for testing.
    """
    base = f"""
INVOICE #INV-{1000+i}
Vendor: Vendor {i} LLC
Address: {i} Main Street, Metropolis
Bill To: Customer {i%7}
Date: 2024-09-{(i%28)+1:02d}
Due: 2024-10-{(i%28)+1:02d}
Items:
- Widget A (QTY {1+i%5}) @ ${3.5 + (i%3)*0.75:.2f} each = ${((1+i%5)*(3.5 + (i%3)*0.75)):.2f}
- Widget B (QTY {2+i%3}) @ ${1.25 + (i%4)*0.5:.2f} each = ${((2+i%3)*(1.25 + (i%4)*0.5)):.2f}
Subtotal: $100.00
Tax: $8.25
Total Due: $108.25
"""
    # Inject noise for some invoices
    if i % 9 == 0:
        base += "\nNote: Prices include a 5% eco-fee."
    if i % 11 == 0:
        base = base.replace("Total Due", "Grand Total")
    return base

async def run_demo(n_docs=60, model="gpt-4o-mini", concurrency=8):
    """
    Runs a demo batch extraction and prints metrics.
    """
    docs = [(f"doc-{i}", synthetic_invoice(i)) for i in range(n_docs)]
    t0 = time.perf_counter()
    results = await process_batch_with_cache(docs, model=model, concurrency=concurrency)
    elapsed = time.perf_counter() - t0

    ok = sum(1 for _, _, info in results if info.get("ok"))
    fail = n_docs - ok
    total_input_tokens = 0
    total_output_tokens = 0
    est_cost = 0.0

    # Prefer usage from metadata if available; otherwise estimate input tokens only
    for (doc_id, data, info), (_, text) in zip(results, docs):
        meta = info.get("meta", {})
        usage = meta.get("token_usage", {}) if isinstance(meta, dict) else {}
        in_tok = usage.get("input_tokens", 0) or count_tokens(text)
        out_tok = usage.get("output_tokens", 0)
        total_input_tokens += in_tok
        total_output_tokens += out_tok
        est_cost += estimate_cost(in_tok, out_tok, model)

    print(f"Processed: {n_docs} docs")
    print(f"Success: {ok}, Failures: {fail}")
    print(f"Elapsed: {elapsed:.2f}s, Avg per doc: {elapsed/n_docs:.2f}s")
    print(f"Tokens ~ input: {total_input_tokens}, output: {total_output_tokens}")
    print(f"Estimated cost: ${est_cost:.4f}")
    print("⚠️ Verify current pricing at https://openai.com/pricing and adjust PRICE_PER_1K accordingly.")
    
    return results

Run the demo in a notebook cell:

In [None]:
# For notebooks/Colab, use await directly or nest_asyncio if needed
results = await run_demo(n_docs=60, model="gpt-4o-mini", concurrency=8)

And inspect a single result with provenance tracking:

In [None]:
doc_id, data, info = results[0]
print(f"Document: {doc_id}")
print(f"Extracted: {data}")
print(f"Provenance: {data.get('_provenance', [])}")
print(f"Warnings: {data.get('_warning', 'None')}")

## Model Selection: Accuracy vs. Cost

Picking the right model is all about balancing accuracy and cost:

* **gpt\-4 class** (like gpt\-4o, gpt\-4\-turbo): Best accuracy for complex schemas, noisy text, and multi\-entity extraction. Check the OpenAI Models docs for the latest.
* **gpt\-3\.5\-turbo class**: Cheaper and faster, but honestly, less reliable for nuanced fields and maintaining JSON structure.

If you're not sure how to evaluate which model fits your needs, our guide on how to choose an AI model for your app breaks down performance, context length, and pricing considerations in detail.

Here are my recommended defaults after way too much experimentation:

* **Model**: gpt\-4o\-mini
* **Temperature**: 0
* **Max concurrency**: 8
* **Chunk size**: 2000 tokens
* **Chunk overlap**: 200 tokens
* **Retry attempts**: 4 with exponential backoff

If you're seeing frequent validation errors or missing fields, bite the bullet and upgrade to gpt\-4o. The extra cost is usually worth it.

## Conclusion

And there you have it \- a production\-ready invoice extraction pipeline that handles schema validation, long documents, retries, async batching, caching, and cost tracking. The system is modular, so use just the core extraction logic if that's all you need, or scale to thousands of documents with the batch processor when you're ready for the big leagues.

Where to go from here:

* **Add OCR/PDF preprocessing** – Integrate pdfminer, pypdf, or AWS Textract to handle those scanned invoices that are 90% of what you'll see in the real world.
* **Improve aggregation** – Use semantic similarity (embeddings) to merge duplicate line items more intelligently.
* **Deploy as an API** – Wrap the batch processor in FastAPI or Flask and you've got yourself a production service.
* **Monitor in production** – Add structured logging, alerting, and dashboards to track extraction accuracy and cost over time. Because what gets measured gets managed.