# 📓 The GenAI Revolution Cookbook

**Title:** How to Use GPT-4o for High-Quality PDF Transcription with Python

**Description:** Build a reliable, layout-preserving PDF transcription pipeline in Python with PyMuPDF and GPT-4o: step-by-step code, hybrid image+text, clean Markdown outputs.

**📖 Read the full article:** [How to Use GPT-4o for High-Quality PDF Transcription with Python](https://blog.thegenairevolution.com/article/how-to-use-gpt-4o-for-high-quality-pdf-transcription-with-python)

---

*This jupyter notebook contains executable code examples. Run the cells below to try out the code yourself!*



## What You'll Build

A Python pipeline that converts any PDF into clean, structured Markdown by combining PyMuPDF for rendering and text extraction with GPT\-4o's vision capabilities. You'll preserve headings, tables, lists, and multi\-column layouts—no broken OCR output.

**Prerequisites:** Python 3\.10\+, an OpenAI API key, and a Colab or local environment. Expect around $0\.01–0\.05 per page at 200 DPI depending on content complexity. Runtime is roughly 5–10 seconds per page.

**Scope note:** This pipeline works best with digital\-native PDFs. Scanned PDFs with no embedded text will rely entirely on vision, which may reduce accuracy for dense or low\-quality scans.

## Why This Approach Works

Pure OCR tools like Tesseract miss layout and semantics—tables break, headings flatten, and multi\-column text scrambles. Text\-only extraction loses visual cues like borders and column flow.

This hybrid approach feeds GPT\-4o both the embedded text (for accuracy) and the page image (for layout), letting the model reconstruct structure faithfully. The result: Markdown that mirrors the original document's hierarchy and reading order. If you've struggled with invisible characters or tokenization quirks breaking your pipeline, see our guide on tokenization pitfalls and invisible characters in prompts and RAG for normalization strategies.

For readers interested in extracting structured data directly from documents (e.g., invoices, forms), check out our walkthrough on building a structured data extraction pipeline with LLMs.

**Trade\-offs:** GPT\-4o costs more than OCR but delivers higher fidelity. At 200 DPI, a 10\-page document costs around $0\.10–0\.50\. For large batches, consider caching and parallel processing (covered in Next Steps).

## How It Works

1. **Render pages to images** – PyMuPDF converts each page to PNG at 200 DPI, preserving visual layout.
2. **Extract embedded text** – PyMuPDF pulls any text layer from the PDF for accuracy.
3. **Transcribe with GPT\-4o** – Send both image and text to GPT\-4o with a structured prompt. The model outputs Markdown.
4. **Assemble final document** – Concatenate per\-page Markdown into one file.

## Setup

Install dependencies in a single cell:

In [None]:
!pip install --quiet pymupdf pillow "openai>=1.40.0,<2"

Set your OpenAI API key. For Colab, use this cell:

In [None]:
import os
import getpass

if "OPENAI_API_KEY" not in os.environ:
    os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API key: ")

For local environments, export the key in your shell:

In [None]:
export OPENAI_API_KEY='your-key-here'

## Configuration

Centralize settings in a configuration cell for easy tuning:

In [None]:
CONFIG = {
    "dpi": 200,
    "model": "gpt-4o",
    "temperature": 0.0,
    "max_retries": 5,
    "initial_backoff": 2.0,
}

## Step 1: Render Pages to Images

PyMuPDF renders each page as a PNG at 200 DPI. This resolution balances quality and payload size—higher DPI increases cost and latency without much fidelity gain for most documents.

In [None]:
import logging
from pathlib import Path
from typing import List

import fitz  # PyMuPDF
from PIL import Image

logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s: %(message)s")

def ensure_dir(path: Path) -> None:
    """Create directory if it doesn't exist."""
    path.mkdir(parents=True, exist_ok=True)

def prepare_output_dirs(pdf_path: Path):
    """Set up output directories for images, text, and cache."""
    pdf_stem = pdf_path.stem
    base_dir = Path("output") / pdf_stem
    images_dir = base_dir / "images"
    txt_dir = base_dir / "txt"
    cache_dir = base_dir / ".cache"
    ensure_dir(images_dir)
    ensure_dir(txt_dir)
    ensure_dir(cache_dir)
    return base_dir, images_dir, txt_dir, cache_dir

def convert_pages_to_images(pdf_path: Path, images_dir: Path, dpi: int = 200) -> List[Path]:
    """Render each PDF page as a PNG at the specified DPI."""
    doc = fitz.open(pdf_path)
    images = []
    scale = dpi / 72  # PDF default is 72 DPI
    matrix = fitz.Matrix(scale, scale)
    
    for page_index in range(len(doc)):
        page = doc[page_index]
        # Render without alpha channel to reduce payload size
        pix = page.get_pixmap(matrix=matrix, alpha=False)
        
        # Guard against blank or failed renders
        if pix.width == 0 or pix.height == 0:
            logging.warning(f"Page {page_index+1}: Rendering failed or empty page, skipping.")
            continue
        
        img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
        out_path = images_dir / f"page_{page_index + 1:03d}.png"
        img.save(out_path, format="PNG", optimize=True)
        images.append(out_path)
        logging.info(f"Saved image: {out_path} ({out_path.stat().st_size // 1024} KB)")
    
    doc.close()
    return images

## Step 2: Extract Embedded Text

Extract any text layer from the PDF. This gives GPT\-4o accurate character data to anchor its transcription.

In [None]:
def extract_page_texts(pdf_path: Path, txt_dir: Path) -> List[Path]:
    """Extract embedded text from each page and save as .txt files."""
    doc = fitz.open(pdf_path)
    txt_files = []
    
    for page_index in range(len(doc)):
        page = doc[page_index]
        text = page.get_text("text") or ""
        text = text.replace("\r\n", "\n").strip()
        
        # Avoid polluting the prompt with placeholder text
        if not text:
            text = ""
        
        out_path = txt_dir / f"page_{page_index + 1:03d}.txt"
        out_path.write_text(text, encoding="utf-8")
        txt_files.append(out_path)
        logging.info(f"Saved text: {out_path}")
    
    doc.close()
    return txt_files

## Step 3: Transcribe with GPT\-4o

Send both the page image and extracted text to GPT\-4o. The model uses the image for layout and the text for accuracy.

In [None]:
import base64
import time
from openai import OpenAI

# Initialize client with environment variable
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

def encode_image_to_data_url(image_path: Path) -> str:
    """Encode PNG as base64 data URL for API input."""
    with open(image_path, "rb") as f:
        b64 = base64.b64encode(f.read()).decode("utf-8")
    return f"data:image/png;base64,{b64}"

SYSTEM_PROMPT = (
    "You are a meticulous document transcription engine. "
    "Transcribe each page into clean, well-structured Markdown. "
    "Preserve headings, lists, tables, and reading order. "
    "Use the extracted text for accuracy but follow the visual layout from the image. "
    "Do not hallucinate content. If content is illegible, mark it clearly. "
    "If extracted text is empty or indicates no embedded text, rely solely on the image."
)

USER_TEMPLATE = (
    "Use both the page image and the extracted text below. "
    "Reconstruct the document faithfully into Markdown.\n\n"
    "Extracted text:\n\n{page_text}"
)

def transcribe_page_with_gpt4o(
    client: OpenAI,
    image_path: Path,
    text_path: Path,
    model: str = "gpt-4o",
    temperature: float = 0.0,
    max_retries: int = 5,
    initial_backoff: float = 2.0,
) -> str:
    """Send multimodal request to GPT-4o for Markdown transcription with retry logic."""
    data_url = encode_image_to_data_url(image_path)
    page_text = text_path.read_text(encoding="utf-8")
    user_text = USER_TEMPLATE.format(page_text=page_text if page_text else "[No embedded text]")
    
    for attempt in range(max_retries):
        try:
            resp = client.chat.completions.create(
                model=model,
                temperature=temperature,
                messages=[
                    {"role": "system", "content": [{"type": "text", "text": SYSTEM_PROMPT}]},
                    {
                        "role": "user",
                        "content": [
                            {"type": "text", "text": user_text},
                            {"type": "image_url", "image_url": {"url": data_url}},
                        ],
                    },
                ],
            )
            content = resp.choices[0].message.content
            return content.strip()
        except Exception as e:
            wait = initial_backoff * (2 ** attempt)
            logging.warning(
                f"GPT-4o request failed (attempt {attempt+1}/{max_retries}): {type(e).__name__} - {e}. "
                f"Retrying in {wait:.1f}s."
            )
            if attempt == max_retries - 1:
                logging.error("Max retries reached. Raising exception.")
                raise
            time.sleep(wait)

## Step 4: Assemble the Final Document

Loop through all pages, transcribe each, and concatenate into one Markdown file.

In [None]:
def process_pdf(pdf_path_str: str, dpi: int = 200) -> Path:
    """End-to-end pipeline: render, extract, transcribe, assemble."""
    pdf_path = Path(pdf_path_str).expanduser().resolve()
    base_dir, images_dir, txt_dir, _ = prepare_output_dirs(pdf_path)
    
    # Render and extract
    image_files = convert_pages_to_images(pdf_path, images_dir, dpi=dpi)
    text_files = extract_page_texts(pdf_path, txt_dir)
    
    if len(image_files) != len(text_files):
        raise RuntimeError("Mismatch between number of images and text files.")
    
    # Transcribe each page
    page_markdowns = []
    for idx, (img, txt) in enumerate(zip(image_files, text_files), start=1):
        logging.info(f"Transcribing page {idx}/{len(image_files)}: {img.name}")
        try:
            md = transcribe_page_with_gpt4o(client, img, txt, model=CONFIG["model"], temperature=CONFIG["temperature"])
        except Exception as e:
            logging.error(f"Transcription failed for page {idx}: {e}")
            md = "[Transcription failed for this page.]"
        page_markdowns.append(f"---\n\n## Page {idx}\n\n{md}\n")
    
    # Write final transcript
    transcript_path = base_dir / "transcript.md"
    transcript_path.write_text("\n".join(page_markdowns), encoding="utf-8")
    logging.info(f"Wrote transcript: {transcript_path}")
    return transcript_path

## Run and Validate

Upload a sample PDF or use a local file. For Colab, upload with:

In [None]:
from google.colab import files
uploaded = files.upload()
pdf_path = list(uploaded.keys())[0]

Run the pipeline:

In [None]:
transcript = process_pdf(pdf_path, dpi=CONFIG["dpi"])
print(f"Transcript saved to: {transcript}")

Inspect outputs:

In [None]:
from itertools import islice

def list_dir(p: Path, limit: int = 10) -> None:
    """List up to `limit` files in a directory."""
    files_list = sorted(p.glob("*"))
    for f in islice(files_list, 0, limit):
        print(f.name)
    if len(files_list) > limit:
        print(f"... and {len(files_list) - limit} more")

base = Path("output") / Path(pdf_path).stem
print("Images:")
list_dir(base / "images")
print("\nText files:")
list_dir(base / "txt")

Display a rendered page:

In [None]:
from IPython.display import display

img_path = base / "images" / "page_001.png"
img = Image.open(img_path)
display(img)

Print extracted text sample:

In [None]:
txt_path = base / "txt" / "page_001.txt"
print(txt_path.read_text(encoding="utf-8")[:1000])

Print the first 80 lines of the final Markdown:

In [None]:
transcript_md = (base / "transcript.md").read_text(encoding="utf-8")
print("\n".join(transcript_md.splitlines()[:80]))

Add lightweight validation:

In [None]:
# Assert at least one heading is present
assert "##" in transcript_md or "#" in transcript_md, "No headings found in transcript"
logging.info("Validation passed: headings detected.")

## Add Caching to Reduce Costs

Cache transcriptions to avoid redundant API calls during re\-runs or iterative testing.

In [None]:
import hashlib
import json

def page_cache_key(image_path: Path, text_path: Path, model: str, prompt_hash: str) -> str:
    """Generate cache key from image, text, model, and prompt."""
    h = hashlib.sha256()
    h.update(image_path.read_bytes())
    h.update(text_path.read_text(encoding="utf-8").encode("utf-8"))
    h.update(model.encode("utf-8"))
    h.update(prompt_hash.encode("utf-8"))
    return h.hexdigest()

def transcribe_with_cache(
    client: OpenAI,
    image_path: Path,
    text_path: Path,
    cache_dir: Path,
    model: str = "gpt-4o",
    prompt_hash: str = "",
) -> str:
    """Transcribe with caching to avoid redundant API calls."""
    key = page_cache_key(image_path, text_path, model, prompt_hash)
    cache_file = cache_dir / f"{key}.json"
    
    if cache_file.exists():
        logging.info(f"Cache hit for {image_path.name}")
        return json.loads(cache_file.read_text(encoding="utf-8"))["markdown"]
    
    md = transcribe_page_with_gpt4o(client, image_path, text_path, model=model)
    cache_file.write_text(json.dumps({"markdown": md}), encoding="utf-8")
    logging.info(f"Cache saved for {image_path.name}")
    return md

Update the pipeline to use caching:

In [None]:
def process_pdf_with_cache(pdf_path_str: str, dpi: int = 200) -> Path:
    """Pipeline with per-page caching."""
    pdf_path = Path(pdf_path_str).expanduser().resolve()
    base_dir, images_dir, txt_dir, cache_dir = prepare_output_dirs(pdf_path)
    
    # Hash prompts to invalidate cache if prompts change
    prompt_hash = hashlib.sha256((SYSTEM_PROMPT + USER_TEMPLATE).encode("utf-8")).hexdigest()
    
    image_files = convert_pages_to_images(pdf_path, images_dir, dpi=dpi)
    text_files = extract_page_texts(pdf_path, txt_dir)
    
    if len(image_files) != len(text_files):
        raise RuntimeError("Mismatch between number of images and text files.")
    
    page_markdowns = []
    for idx, (img, txt) in enumerate(zip(image_files, text_files), start=1):
        logging.info(f"Transcribing page {idx}/{len(image_files)}: {img.name}")
        try:
            md = transcribe_with_cache(client, img, txt, cache_dir, model=CONFIG["model"], prompt_hash=prompt_hash)
        except Exception as e:
            logging.error(f"Transcription failed for page {idx}: {e}")
            md = "[Transcription failed for this page.]"
        page_markdowns.append(f"---\n\n## Page {idx}\n\n{md}\n")
    
    transcript_path = base_dir / "transcript.md"
    transcript_path.write_text("\n".join(page_markdowns), encoding="utf-8")
    logging.info(f"Wrote transcript: {transcript_path}")
    return transcript_path

Run with caching:

In [None]:
transcript = process_pdf_with_cache(pdf_path, dpi=CONFIG["dpi"])
print(f"Transcript saved to: {transcript}")

## What You Get

A per\-page Markdown transcription assembled into one document. Headings map to original styles, lists are preserved, tables stay coherent, and multi\-column content reads in the right order. The output is ready for indexing, search, or publishing.

For more on how LLMs handle context and why managing memory is crucial for large documents, see our article on context rot and why LLMs "forget" as their memory grows.

## Next Steps

* **Parallel processing:** Use concurrent.futures.ThreadPoolExecutor to transcribe multiple pages simultaneously and reduce total runtime.
* **Payload optimization:** Downscale images above a byte\-size threshold to stay under API limits and reduce costs.
* **Observability:** Add structured logging with error types, status codes, and per\-page cost estimates for production monitoring.
* **Deployment:** Wrap the pipeline in a FastAPI endpoint or Streamlit app for team use.