# **Vision OCR Benchmark — Sarvam Vision vs pytesseract on Indic Documents**

This notebook benchmarks Sarvam Vision Document Intelligence against pytesseract on five
synthetic Indic documents, measuring word accuracy and processing time for each engine.

### **Pipeline**
1. **Generate:** Create 5 synthetic test documents (Hindi bill, Tamil prescription,
   English form, mixed-script invoice, handwritten-style note) using Pillow, each with
   known ground-truth text.
2. **Extract:** Run Sarvam Vision Document Intelligence on each document.
3. **Baseline:** Run pytesseract on the same documents.
4. **Score:** Compute word-level accuracy for both engines against ground truth.
5. **Report:** Export results as an Excel file and matplotlib bar charts.

In [None]:
# Pinning versions for reproducibility
!pip install -Uqq sarvamai==0.1.24 "pytesseract>=0.3.10" "Pillow>=11.0.0" "openpyxl>=3.1.0" "matplotlib>=3.7.0" "python-dotenv>=1.0.0"

### **1. Setup & API Key**

Obtain your API key from the [Sarvam AI Dashboard](https://dashboard.sarvam.ai).
Create a `.env` file in this directory with `SARVAM_API_KEY=your_key_here`, or set the
environment variable directly.

pytesseract requires Tesseract OCR to be installed on your system:
- macOS: `brew install tesseract tesseract-lang`
- Ubuntu/Debian: `sudo apt-get install tesseract-ocr tesseract-ocr-hin tesseract-ocr-tam`

In [None]:
from __future__ import annotations

import io
import os
import re
import time
import zipfile
import tempfile
import shutil
from dataclasses import dataclass, field
from pathlib import Path

import pytesseract
import matplotlib.pyplot as plt
import openpyxl
from openpyxl.styles import Font, PatternFill, Alignment
from PIL import Image, ImageDraw, ImageFont
from dotenv import load_dotenv
from sarvamai import SarvamAI

load_dotenv()

SARVAM_API_KEY = os.environ.get("SARVAM_API_KEY", "")
if not SARVAM_API_KEY or SARVAM_API_KEY == "YOUR_SARVAM_API_KEY":
    raise RuntimeError(
        "SARVAM_API_KEY is not set. Add it to your .env file or set the environment variable."
    )

client = SarvamAI(api_subscription_key=SARVAM_API_KEY)

print("Client initialised.")

### **2. Step 1 — GENERATE: Synthetic Test Documents**

`generate_test_documents` creates five PNG images in `sample_data/` using Pillow, each
representing a different Indic document type. Each document comes with a ground-truth
word list used for accuracy scoring.

Indic script rendering requires a Unicode-capable font. The function searches for Noto
fonts in common system paths and falls back to the default PIL font if none are found.
Documents with Indic scripts that cannot be rendered due to missing fonts will use the
fallback font, which may reduce OCR accuracy for those scripts.

**Document types:**
- `hindi_bill.png` — Hindi electricity bill (Devanagari)
- `tamil_prescription.png` — Tamil doctor's prescription (Tamil script)
- `english_form.png` — English government application form (Latin)
- `mixed_invoice.png` — Mixed-script invoice (Hindi headers, English amounts)
- `handwritten_note.png` — Handwritten-style note (Latin, small irregular font)

In [None]:
_IMAGE_EXTENSIONS = {".jpg", ".jpeg", ".png"}

SAMPLE_DATA_DIR = Path("sample_data")
SAMPLE_DATA_DIR.mkdir(parents=True, exist_ok=True)


@dataclass
class TestDocument:
    name: str
    path: Path
    ground_truth_words: list[str]
    script: str


def _find_font(
    candidates: list[str], size: int
) -> ImageFont.FreeTypeFont | ImageFont.ImageFont:
    """Try each candidate font path; return default PIL font if none found."""
    for candidate in candidates:
        p = Path(candidate)
        if p.exists():
            try:
                return ImageFont.truetype(str(p), size)
            except Exception:
                continue
    return ImageFont.load_default()


_NOTO_DEVANAGARI = [
    "/usr/share/fonts/truetype/noto/NotoSansDevanagari-Regular.ttf",
    "/System/Library/Fonts/Supplemental/NotoSansDevanagari-Regular.ttf",
    "/Library/Fonts/NotoSansDevanagari-Regular.ttf",
    str(Path.home() / ".fonts/NotoSansDevanagari-Regular.ttf"),
]
_NOTO_TAMIL = [
    "/usr/share/fonts/truetype/noto/NotoSansTamil-Regular.ttf",
    "/System/Library/Fonts/Supplemental/NotoSansTamil-Regular.ttf",
    "/Library/Fonts/NotoSansTamil-Regular.ttf",
    str(Path.home() / ".fonts/NotoSansTamil-Regular.ttf"),
]
_LATIN = [
    "/System/Library/Fonts/Helvetica.ttc",
    "/usr/share/fonts/truetype/dejavu/DejaVuSans.ttf",
    "/usr/share/fonts/truetype/liberation/LiberationSans-Regular.ttf",
]


def _draw_lines(
    draw: ImageDraw.ImageDraw,
    lines: list[str],
    font: ImageFont.FreeTypeFont | ImageFont.ImageFont,
    x: int = 40,
    y_start: int = 40,
    line_gap: int = 34,
    fill: str = "black",
) -> None:
    for i, line in enumerate(lines):
        draw.text((x, y_start + i * line_gap), line, font=font, fill=fill)


def _make_image(
    width: int = 640, height: int = 800
) -> tuple[Image.Image, ImageDraw.ImageDraw]:
    img = Image.new("RGB", (width, height), color="white")
    return img, ImageDraw.Draw(img)


def _generate_hindi_bill(path: Path) -> list[str]:
    img, draw = _make_image()
    font = _find_font(_NOTO_DEVANAGARI, 22)

    lines = [
        "\u0909\u0924\u094d\u0924\u0930 \u092a\u094d\u0930\u0926\u0947\u0936 \u092a\u093e\u0935\u0930 \u0915\u0949\u0930\u094d\u092a\u094b\u0930\u0947\u0936\u0928 \u0932\u093f\u092e\u093f\u091f\u0947\u0921",
        "\u092c\u093f\u091c\u0932\u0940 \u092c\u093f\u0932 \u2014 \u091c\u0928\u0935\u0930\u0940 2025",
        "",
        "\u0909\u092a\u092d\u094b\u0915\u094d\u0924\u093e \u0938\u0902\u0916\u094d\u092f\u093e: UP-2024-88741",
        "\u0928\u093e\u092e: \u0930\u093e\u092e \u092a\u094d\u0930\u0938\u093e\u0926 \u0936\u0930\u094d\u092e\u093e",
        "\u092a\u0924\u093e: 14 \u0917\u093e\u0902\u0927\u0940 \u0928\u0917\u0930, \u0932\u0916\u0928\u090a",
        "",
        "\u092a\u093f\u091b\u0932\u093e \u092e\u0940\u091f\u0930 \u0930\u0940\u0921\u093f\u0902\u0917: 4520 \u092f\u0942\u0928\u093f\u091f",
        "\u0935\u0930\u094d\u0924\u092e\u093e\u0928 \u092e\u0940\u091f\u0930 \u0930\u0940\u0921\u093f\u0902\u0917: 4780 \u092f\u0942\u0928\u093f\u091f",
        "\u0916\u092a\u0924: 260 \u092f\u0942\u0928\u093f\u091f",
        "",
        "\u092c\u093f\u091c\u0932\u0940 \u0936\u0941\u0932\u094d\u0915: \u0930\u0941 1820",
        "\u092b\u093f\u0915\u094d\u0938\u094d\u0921 \u091a\u093e\u0930\u094d\u091c: \u0930\u0941 120",
        "\u091c\u0940\u090f\u0938\u091f\u0940 (18%): \u0930\u0941 351",
        "\u0915\u0941\u0932 \u0926\u0947\u092f \u0930\u093e\u0936\u093f: \u0930\u0941 2291",
        "",
        "\u092d\u0941\u0917\u0924\u093e\u0928 \u0915\u0940 \u0905\u0902\u0924\u093f\u092e \u0924\u093f\u0925\u093f: 15 \u092b\u0930\u0935\u0930\u0940 2025",
    ]
    _draw_lines(draw, lines, font)
    img.save(path)

    ground_truth = [
        "\u0909\u0924\u094d\u0924\u0930", "\u092a\u094d\u0930\u0926\u0947\u0936", "\u092a\u093e\u0935\u0930",
        "\u0915\u0949\u0930\u094d\u092a\u094b\u0930\u0947\u0936\u0928", "\u0932\u093f\u092e\u093f\u091f\u0947\u0921",
        "\u092c\u093f\u091c\u0932\u0940", "\u092c\u093f\u0932", "\u091c\u0928\u0935\u0930\u0940", "2025",
        "\u0909\u092a\u092d\u094b\u0915\u094d\u0924\u093e", "\u0938\u0902\u0916\u094d\u092f\u093e", "UP-2024-88741",
        "\u0928\u093e\u092e", "\u0930\u093e\u092e", "\u092a\u094d\u0930\u0938\u093e\u0926", "\u0936\u0930\u094d\u092e\u093e",
        "\u092a\u0924\u093e", "14", "\u0917\u093e\u0902\u0927\u0940", "\u0928\u0917\u0930", "\u0932\u0916\u0928\u090a",
        "\u092a\u093f\u091b\u0932\u093e", "\u092e\u0940\u091f\u0930", "\u0930\u0940\u0921\u093f\u0902\u0917", "4520", "\u092f\u0942\u0928\u093f\u091f",
        "\u0935\u0930\u094d\u0924\u092e\u093e\u0928", "\u092e\u0940\u091f\u0930", "\u0930\u0940\u0921\u093f\u0902\u0917", "4780", "\u092f\u0942\u0928\u093f\u091f",
        "\u0916\u092a\u0924", "260", "\u092f\u0942\u0928\u093f\u091f",
        "\u092c\u093f\u091c\u0932\u0940", "\u0936\u0941\u0932\u094d\u0915", "\u0930\u0941", "1820",
        "\u092b\u093f\u0915\u094d\u0938\u094d\u0921", "\u091a\u093e\u0930\u094d\u091c", "\u0930\u0941", "120",
        "\u091c\u0940\u090f\u0938\u091f\u0940", "18%", "\u0930\u0941", "351",
        "\u0915\u0941\u0932", "\u0926\u0947\u092f", "\u0930\u093e\u0936\u093f", "\u0930\u0941", "2291",
        "\u092d\u0941\u0917\u0924\u093e\u0928", "\u0915\u0940", "\u0905\u0902\u0924\u093f\u092e", "\u0924\u093f\u0925\u093f", "15", "\u092b\u0930\u0935\u0930\u0940", "2025",
    ]
    return ground_truth


def _generate_tamil_prescription(path: Path) -> list[str]:
    img, draw = _make_image()
    font = _find_font(_NOTO_TAMIL, 22)

    lines = [
        "\u0b9f\u0bbe\u0b95\u0bcd\u0b9f\u0bb0\u0bcd \u0bb8\u0bcd\u0bb0\u0bc0\u0ba8\u0bbf\u0bb5\u0bbe\u0bb8\u0bcd \u0b95\u0bbf\u0bb3\u0bbf\u0ba9\u0bbf\u0b95\u0bcd",
        "44 \u0b85\u0ba3\u0bcd\u0ba3\u0bbe \u0b9a\u0bbe\u0bb2\u0bc8, \u0b9a\u0bc6\u0ba9\u0bcd\u0ba9\u0bc8 - 600002",
        "",
        "\u0ba8\u0bcb\u0baf\u0bbe\u0bb3\u0bbf: \u0bae\u0bc1\u0ba4\u0bcd\u0ba4\u0bc1\u0bb2\u0b9f\u0bcd\u0b9a\u0bc1\u0bae\u0bbf",
        "\u0ba4\u0bc7\u0ba4\u0bbf: 10-01-2025",
        "",
        "\u0bae\u0bb0\u0bc1\u0ba8\u0bcd\u0ba4\u0bc1:",
        "1. \u0baa\u0bbe\u0bb0\u0bbe\u0b9a\u0bbf\u0b9f\u0bcd\u0b9f\u0bbe\u0bae\u0bbe\u0bb2\u0bcd 500 \u0bae\u0bbf.\u0b95\u0bbf \u2014 \u0ba4\u0bbf\u0ba9\u0bae\u0bcd 3 \u0bae\u0bc1\u0bb1\u0bc8",
        "2. \u0b85\u0bae\u0bcb\u0b95\u0bcd\u0b9a\u0bbf\u0b9a\u0bbf\u0bb2\u0bbf\u0ba9\u0bcd 250 \u0bae\u0bbf.\u0b95\u0bbf \u2014 \u0ba4\u0bbf\u0ba9\u0bae\u0bcd 2 \u0bae\u0bc1\u0bb1\u0bc8",
        "3. \u0bb5\u0bc8\u0b9f\u0bcd\u0b9f\u0bae\u0bbf\u0ba9\u0bcd \u0b9a\u0bbf \u2014 \u0ba4\u0bbf\u0ba9\u0bae\u0bcd \u0b92\u0bb0\u0bc1\u0bae\u0bc1\u0bb1\u0bc8",
        "",
        "\u0bae\u0bb1\u0bc1\u0baa\u0bb0\u0bbf\u0b9a\u0bc0\u0bb2\u0ba9\u0bc8: 3 \u0ba8\u0bbe\u0b9f\u0bcd\u0b95\u0bb3\u0bbf\u0bb2\u0bcd",
    ]
    _draw_lines(draw, lines, font)
    img.save(path)

    ground_truth = [
        "\u0b9f\u0bbe\u0b95\u0bcd\u0b9f\u0bb0\u0bcd", "\u0bb8\u0bcd\u0bb0\u0bc0\u0ba8\u0bbf\u0bb5\u0bbe\u0bb8\u0bcd", "\u0b95\u0bbf\u0bb3\u0bbf\u0ba9\u0bbf\u0b95\u0bcd",
        "44", "\u0b85\u0ba3\u0bcd\u0ba3\u0bbe", "\u0b9a\u0bbe\u0bb2\u0bc8", "\u0b9a\u0bc6\u0ba9\u0bcd\u0ba9\u0bc8", "600002",
        "\u0ba8\u0bcb\u0baf\u0bbe\u0bb3\u0bbf", "\u0bae\u0bc1\u0ba4\u0bcd\u0ba4\u0bc1\u0bb2\u0b9f\u0bcd\u0b9a\u0bc1\u0bae\u0bbf",
        "\u0ba4\u0bc7\u0ba4\u0bbf", "10-01-2025",
        "\u0bae\u0bb0\u0bc1\u0ba8\u0bcd\u0ba4\u0bc1",
        "1", "\u0baa\u0bbe\u0bb0\u0bbe\u0b9a\u0bbf\u0b9f\u0bcd\u0b9f\u0bbe\u0bae\u0bbe\u0bb2\u0bcd", "500", "\u0bae\u0bbf.\u0b95\u0bbf", "\u0ba4\u0bbf\u0ba9\u0bae\u0bcd", "3", "\u0bae\u0bc1\u0bb1\u0bc8",
        "2", "\u0b85\u0bae\u0bcb\u0b95\u0bcd\u0b9a\u0bbf\u0b9a\u0bbf\u0bb2\u0bbf\u0ba9\u0bcd", "250", "\u0bae\u0bbf.\u0b95\u0bbf", "\u0ba4\u0bbf\u0ba9\u0bae\u0bcd", "2", "\u0bae\u0bc1\u0bb1\u0bc8",
        "3", "\u0bb5\u0bc8\u0b9f\u0bcd\u0b9f\u0bae\u0bbf\u0ba9\u0bcd", "\u0b9a\u0bbf", "\u0ba4\u0bbf\u0ba9\u0bae\u0bcd", "\u0b92\u0bb0\u0bc1\u0bae\u0bc1\u0bb1\u0bc8",
        "\u0bae\u0bb1\u0bc1\u0baa\u0bb0\u0bbf\u0b9a\u0bc0\u0bb2\u0ba9\u0bc8", "3", "\u0ba8\u0bbe\u0b9f\u0bcd\u0b95\u0bb3\u0bbf\u0bb2\u0bcd",
    ]
    return ground_truth


def _generate_english_form(path: Path) -> list[str]:
    img, draw = _make_image()
    font = _find_font(_LATIN, 20)

    lines = [
        "GOVERNMENT OF INDIA",
        "Application Form -- Passport Renewal",
        "",
        "Full Name:  Arun Kumar Verma",
        "Date of Birth:  12-05-1985",
        "Place of Birth:  Bengaluru, Karnataka",
        "Nationality:  Indian",
        "",
        "Current Passport No.:  P1234567",
        "Issued at:  Bengaluru",
        "Date of Issue:  15-03-2015",
        "Date of Expiry:  14-03-2025",
        "",
        "Address:  No. 7, MG Road, Bengaluru - 560001",
        "Mobile:  9876543210",
        "Email:  arun.verma@email.com",
        "",
        "Signature:  _______________",
        "Date:  10-01-2025",
    ]
    _draw_lines(draw, lines, font)
    img.save(path)

    ground_truth = [
        "GOVERNMENT", "OF", "INDIA",
        "Application", "Form", "Passport", "Renewal",
        "Full", "Name", "Arun", "Kumar", "Verma",
        "Date", "of", "Birth", "12-05-1985",
        "Place", "of", "Birth", "Bengaluru", "Karnataka",
        "Nationality", "Indian",
        "Current", "Passport", "No", "P1234567",
        "Issued", "at", "Bengaluru",
        "Date", "of", "Issue", "15-03-2015",
        "Date", "of", "Expiry", "14-03-2025",
        "Address", "No", "7", "MG", "Road", "Bengaluru", "560001",
        "Mobile", "9876543210",
        "Email", "arun.verma@email.com",
        "Signature",
        "Date", "10-01-2025",
    ]
    return ground_truth


def _generate_mixed_invoice(path: Path) -> list[str]:
    img, draw = _make_image()
    font_hi = _find_font(_NOTO_DEVANAGARI, 20)
    font_en = _find_font(_LATIN, 20)

    rows: list[tuple[str, ImageFont.FreeTypeFont | ImageFont.ImageFont]] = [
        ("\u0936\u094d\u0930\u0940 \u0917\u0923\u0947\u0936 \u091f\u094d\u0930\u0947\u0921\u0930\u094d\u0938", font_hi),
        ("\u091c\u0940\u090f\u0938\u091f\u0940 \u0907\u0928\u0935\u0949\u0907\u0938", font_hi),
        ("", font_en),
        ("Invoice No: SGT-2025-0042", font_en),
        ("Date: 08-01-2025", font_en),
        ("", font_en),
        ("\u0935\u0938\u094d\u0924\u0941             \u092e\u093e\u0924\u094d\u0930\u093e    \u092e\u0942\u0932\u094d\u092f", font_hi),
        ("", font_en),
        ("Basmati Rice 5kg      2    Rs 480", font_en),
        ("Toor Dal 1kg          3    Rs 360", font_en),
        ("Sunflower Oil 1L      2    Rs 260", font_en),
        ("", font_en),
        ("\u0909\u092a-\u0915\u0941\u0932: Rs 1100", font_hi),
        ("CGST (9%): Rs 99", font_en),
        ("SGST (9%): Rs 99", font_en),
        ("\u0915\u0941\u0932: Rs 1298", font_hi),
    ]
    y = 40
    for text, fnt in rows:
        draw.text((40, y), text, font=fnt, fill="black")
        y += 34
    img.save(path)

    ground_truth = [
        "\u0936\u094d\u0930\u0940", "\u0917\u0923\u0947\u0936", "\u091f\u094d\u0930\u0947\u0921\u0930\u094d\u0938",
        "\u091c\u0940\u090f\u0938\u091f\u0940", "\u0907\u0928\u0935\u0949\u0907\u0938",
        "Invoice", "No", "SGT-2025-0042",
        "Date", "08-01-2025",
        "\u0935\u0938\u094d\u0924\u0941", "\u092e\u093e\u0924\u094d\u0930\u093e", "\u092e\u0942\u0932\u094d\u092f",
        "Basmati", "Rice", "5kg", "2", "Rs", "480",
        "Toor", "Dal", "1kg", "3", "Rs", "360",
        "Sunflower", "Oil", "1L", "2", "Rs", "260",
        "\u0909\u092a-\u0915\u0941\u0932", "Rs", "1100",
        "CGST", "9%", "Rs", "99",
        "SGST", "9%", "Rs", "99",
        "\u0915\u0941\u0932", "Rs", "1298",
    ]
    return ground_truth


def _generate_handwritten_note(path: Path) -> list[str]:
    img, draw = _make_image(640, 480)
    font = _find_font(_LATIN, 16)

    lines = [
        "Meeting Notes - Budget Review",
        "Date: 12 January 2025",
        "",
        "Attendees: Priya, Rahul, Sunita, Mohan",
        "",
        "Action Items:",
        "1. Priya to send Q4 report by Friday",
        "2. Rahul to confirm vendor quotes",
        "3. Sunita to schedule follow-up meeting",
        "4. Mohan to review revised budget",
        "",
        "Next meeting: 20 January 2025 at 10am",
    ]
    _draw_lines(draw, lines, font, y_start=30, line_gap=30)
    img.save(path)

    ground_truth = [
        "Meeting", "Notes", "Budget", "Review",
        "Date", "12", "January", "2025",
        "Attendees", "Priya", "Rahul", "Sunita", "Mohan",
        "Action", "Items",
        "1", "Priya", "to", "send", "Q4", "report", "by", "Friday",
        "2", "Rahul", "to", "confirm", "vendor", "quotes",
        "3", "Sunita", "to", "schedule", "follow-up", "meeting",
        "4", "Mohan", "to", "review", "revised", "budget",
        "Next", "meeting", "20", "January", "2025", "at", "10am",
    ]
    return ground_truth


def generate_test_documents() -> list[TestDocument]:
    """Generate 5 synthetic test documents with known ground-truth word lists."""
    docs: list[TestDocument] = []
    specs = [
        ("hindi_bill",         "Devanagari",         _generate_hindi_bill),
        ("tamil_prescription", "Tamil",              _generate_tamil_prescription),
        ("english_form",       "Latin",              _generate_english_form),
        ("mixed_invoice",      "Devanagari + Latin", _generate_mixed_invoice),
        ("handwritten_note",   "Latin",              _generate_handwritten_note),
    ]
    for name, script, generator_fn in specs:
        path = SAMPLE_DATA_DIR / f"{name}.png"
        try:
            gt_words = generator_fn(path)
            docs.append(
                TestDocument(name=name, path=path, ground_truth_words=gt_words, script=script)
            )
            print(f"Generated: {path.name} ({script})")
        except Exception as exc:
            print(f"Warning: could not generate {name}: {exc}")
    return docs

### **3. Step 2 — EXTRACT: Sarvam Vision OCR**

`run_sarvam_ocr` wraps the Sarvam Vision Document Intelligence async job workflow:
create -> upload -> start -> poll -> download (ZIP) -> unzip -> return text.

PNG images are wrapped in a ZIP before upload, as the API only accepts `.pdf` or `.zip`.
Processing time is measured wall-clock from job creation to text extraction.

In [None]:
def _wrap_image_in_zip(image_path: Path) -> tuple[str, str]:
    """Wrap a PNG/JPG in a temporary ZIP. Returns (zip_path, tmp_dir)."""
    tmp_dir = tempfile.mkdtemp()
    zip_path = str(Path(tmp_dir) / f"{image_path.stem}.zip")
    with zipfile.ZipFile(zip_path, "w") as zf:
        zf.write(image_path, image_path.name)
    return zip_path, tmp_dir


def run_sarvam_ocr(image_path: Path, poll_interval: float = 3.0) -> tuple[str, float]:
    """Run Sarvam Vision Document Intelligence on an image file.

    Returns:
        (extracted_text, elapsed_seconds)
    """
    upload_path = str(image_path)
    tmp_dir: str | None = None

    if image_path.suffix.lower() in _IMAGE_EXTENSIONS:
        upload_path, tmp_dir = _wrap_image_in_zip(image_path)

    start = time.perf_counter()
    try:
        with open(upload_path, "rb") as fh:
            create_resp = client.documents.create(file=fh)
        job_id = create_resp.request_id

        client.documents.start(request_id=job_id)

        while True:
            status_resp = client.documents.status(request_id=job_id)
            state = str(getattr(status_resp, "state", status_resp))
            if "Completed" in state or "completed" in state:
                break
            if "Failed" in state or "failed" in state:
                raise RuntimeError(f"Sarvam job failed: {state}")
            time.sleep(poll_interval)

        result_resp = client.documents.get(request_id=job_id)
        elapsed = time.perf_counter() - start

        if hasattr(result_resp, "read"):
            raw = result_resp.read()
            with zipfile.ZipFile(io.BytesIO(raw)) as zf:
                text_files = [n for n in zf.namelist() if n.endswith(".md") or n.endswith(".txt")]
                extracted = "\n".join(
                    zf.read(n).decode("utf-8", errors="replace") for n in text_files
                )
        else:
            extracted = str(result_resp)

        return extracted, elapsed
    finally:
        if tmp_dir:
            shutil.rmtree(tmp_dir, ignore_errors=True)

### **4. Step 3 — BASELINE: pytesseract OCR**

`run_tesseract_ocr` runs Tesseract via the pytesseract Python wrapper.

Language codes passed to Tesseract:
- Devanagari documents: `hin`
- Tamil documents: `tam`
- Latin / mixed documents: `eng`

If a language pack is missing, Tesseract falls back to `eng` with a warning.
Processing time is measured wall-clock.

In [None]:
_SCRIPT_TO_TESS_LANG: dict[str, str] = {
    "Devanagari":         "hin",
    "Tamil":              "tam",
    "Latin":              "eng",
    "Devanagari + Latin": "hin+eng",
}


def run_tesseract_ocr(image_path: Path, script: str) -> tuple[str, float]:
    """Run pytesseract on an image file.

    Returns:
        (extracted_text, elapsed_seconds)
    """
    lang = _SCRIPT_TO_TESS_LANG.get(script, "eng")
    img = Image.open(image_path)

    start = time.perf_counter()
    try:
        text = pytesseract.image_to_string(img, lang=lang)
    except pytesseract.TesseractError as exc:
        if "Failed loading language" in str(exc):
            print(f"  Warning: language pack '{lang}' not found, falling back to 'eng'.")
            text = pytesseract.image_to_string(img, lang="eng")
        else:
            raise
    elapsed = time.perf_counter() - start
    return text, elapsed

### **5. Step 4 — SCORE: Word Accuracy**

`compute_word_accuracy` computes word-level recall: the fraction of ground-truth words
that appear in the OCR output.

- Text is lowercased and punctuation is stripped before comparison.
- Duplicate words in ground truth are counted separately.
- Score range: 0.0 (no words matched) to 1.0 (all ground-truth words found).

In [None]:
def _normalise_words(text: str) -> set[str]:
    """Lowercase and strip punctuation from OCR output; return set of unique words."""
    text = text.lower()
    text = re.sub(r"[^\w\s\u0900-\u097f\u0b80-\u0bff-]", " ", text)
    return set(text.split())


def compute_word_accuracy(ocr_text: str, ground_truth_words: list[str]) -> float:
    """Compute word-level recall: |predicted ∩ ground_truth| / |ground_truth|.

    Args:
        ocr_text: Raw text returned by the OCR engine.
        ground_truth_words: Known correct words for this document.

    Returns:
        Accuracy score in [0.0, 1.0].
    """
    if not ground_truth_words:
        return 0.0
    predicted_set = _normalise_words(ocr_text)
    gt_normalised = [w.lower() for w in ground_truth_words]
    matched = sum(1 for w in gt_normalised if w in predicted_set)
    return matched / len(gt_normalised)

### **6. End-to-End Benchmark Pipeline**

`run_benchmark` ties all steps together. For each test document it:
1. Runs Sarvam Vision OCR and records text + time.
2. Runs pytesseract and records text + time.
3. Scores both against the ground truth.
4. Collects results into a list of `BenchmarkResult` dataclass instances.

In [None]:
@dataclass
class BenchmarkResult:
    doc_name: str
    script: str
    sarvam_accuracy: float
    sarvam_time_s: float
    tesseract_accuracy: float
    tesseract_time_s: float
    sarvam_text: str = field(repr=False, default="")
    tesseract_text: str = field(repr=False, default="")


def run_benchmark(documents: list[TestDocument]) -> list[BenchmarkResult]:
    """Run both OCR engines on every document and return scored results."""
    results: list[BenchmarkResult] = []

    for doc in documents:
        print(f"\nBenchmarking: {doc.name} ({doc.script})")

        print("  Running Sarvam Vision...")
        try:
            sarvam_text, sarvam_time = run_sarvam_ocr(doc.path)
            sarvam_acc = compute_word_accuracy(sarvam_text, doc.ground_truth_words)
        except Exception as exc:
            print(f"  Sarvam error: {exc}")
            sarvam_text, sarvam_time, sarvam_acc = "", 0.0, 0.0

        print("  Running pytesseract...")
        try:
            tess_text, tess_time = run_tesseract_ocr(doc.path, doc.script)
            tess_acc = compute_word_accuracy(tess_text, doc.ground_truth_words)
        except Exception as exc:
            print(f"  Tesseract error: {exc}")
            tess_text, tess_time, tess_acc = "", 0.0, 0.0

        result = BenchmarkResult(
            doc_name=doc.name,
            script=doc.script,
            sarvam_accuracy=round(sarvam_acc, 4),
            sarvam_time_s=round(sarvam_time, 2),
            tesseract_accuracy=round(tess_acc, 4),
            tesseract_time_s=round(tess_time, 2),
            sarvam_text=sarvam_text,
            tesseract_text=tess_text,
        )
        results.append(result)
        print(
            f"  Sarvam  accuracy: {sarvam_acc:.1%}  time: {sarvam_time:.1f}s"
            f" | Tesseract accuracy: {tess_acc:.1%}  time: {tess_time:.2f}s"
        )

    return results

### **7. Demo — Run the Benchmark**

This cell generates all 5 synthetic documents and runs the full benchmark pipeline.
Sarvam Vision jobs are async and typically take 10-30 seconds per document.
Total runtime is approximately 2-5 minutes depending on network latency.

In [None]:
documents = generate_test_documents()
results = run_benchmark(documents)

### **8. Results — Export to Excel and Charts**

`export_results` writes three output files to `outputs/`:
- `benchmark_results.xlsx` — tabular results with per-document scores
- `accuracy_comparison.png` — grouped bar chart of word accuracy by document
- `latency_comparison.png` — grouped bar chart of processing time by document

In [None]:
OUTPUTS_DIR = Path("outputs")
OUTPUTS_DIR.mkdir(parents=True, exist_ok=True)


def _export_excel(results: list[BenchmarkResult], path: Path) -> None:
    wb = openpyxl.Workbook()
    ws = wb.active
    ws.title = "Benchmark Results"

    headers = [
        "Document", "Script",
        "Sarvam Accuracy", "Sarvam Time (s)",
        "Tesseract Accuracy", "Tesseract Time (s)",
    ]
    header_fill = PatternFill(start_color="1F4E79", end_color="1F4E79", fill_type="solid")
    header_font = Font(color="FFFFFF", bold=True)

    for col_idx, header in enumerate(headers, start=1):
        cell = ws.cell(row=1, column=col_idx, value=header)
        cell.fill = header_fill
        cell.font = header_font
        cell.alignment = Alignment(horizontal="center")

    for row_idx, r in enumerate(results, start=2):
        ws.cell(row=row_idx, column=1, value=r.doc_name)
        ws.cell(row=row_idx, column=2, value=r.script)
        ws.cell(row=row_idx, column=3, value=r.sarvam_accuracy)
        ws.cell(row=row_idx, column=4, value=r.sarvam_time_s)
        ws.cell(row=row_idx, column=5, value=r.tesseract_accuracy)
        ws.cell(row=row_idx, column=6, value=r.tesseract_time_s)

    for col in ws.columns:
        max_len = max(len(str(c.value or "")) for c in col)
        ws.column_dimensions[col[0].column_letter].width = max_len + 4

    wb.save(path)
    print(f"Excel saved: {path}")


def _plot_accuracy(results: list[BenchmarkResult], path: Path) -> None:
    names = [r.doc_name.replace("_", "\n") for r in results]
    sarvam_scores = [r.sarvam_accuracy for r in results]
    tess_scores = [r.tesseract_accuracy for r in results]
    x = range(len(names))
    width = 0.35

    fig, ax = plt.subplots(figsize=(10, 5))
    ax.bar([i - width / 2 for i in x], sarvam_scores, width, label="Sarvam Vision")
    ax.bar([i + width / 2 for i in x], tess_scores, width, label="pytesseract")
    ax.set_xlabel("Document")
    ax.set_ylabel("Word Accuracy")
    ax.set_title("Word Accuracy Comparison: Sarvam Vision vs pytesseract")
    ax.set_xticks(list(x))
    ax.set_xticklabels(names, ha="center")
    ax.set_ylim(0, 1.1)
    ax.legend()
    fig.tight_layout()
    fig.savefig(path, dpi=150)
    plt.close(fig)
    print(f"Chart saved: {path}")


def _plot_latency(results: list[BenchmarkResult], path: Path) -> None:
    names = [r.doc_name.replace("_", "\n") for r in results]
    sarvam_times = [r.sarvam_time_s for r in results]
    tess_times = [r.tesseract_time_s for r in results]
    x = range(len(names))
    width = 0.35

    fig, ax = plt.subplots(figsize=(10, 5))
    ax.bar([i - width / 2 for i in x], sarvam_times, width, label="Sarvam Vision")
    ax.bar([i + width / 2 for i in x], tess_times, width, label="pytesseract")
    ax.set_xlabel("Document")
    ax.set_ylabel("Processing Time (s)")
    ax.set_title("Processing Time Comparison: Sarvam Vision vs pytesseract")
    ax.set_xticks(list(x))
    ax.set_xticklabels(names, ha="center")
    ax.legend()
    fig.tight_layout()
    fig.savefig(path, dpi=150)
    plt.close(fig)
    print(f"Chart saved: {path}")


def export_results(results: list[BenchmarkResult], output_dir: Path = OUTPUTS_DIR) -> None:
    """Write benchmark results to Excel and two matplotlib bar charts."""
    _export_excel(results, output_dir / "benchmark_results.xlsx")
    _plot_accuracy(results, output_dir / "accuracy_comparison.png")
    _plot_latency(results, output_dir / "latency_comparison.png")
    print("Results written to outputs/")


if results:
    export_results(results)
else:
    print("No results to export.")

### **9. Error Reference**

| Error | Cause | Solution |
| :--- | :--- | :--- |
| `RuntimeError: SARVAM_API_KEY is not set` | Missing or placeholder API key | Add `SARVAM_API_KEY=...` to `.env` |
| `invalid_api_key_error` (403) | Invalid API key | Verify key at [dashboard.sarvam.ai](https://dashboard.sarvam.ai) |
| `insufficient_quota_error` (429) | Quota exceeded | Check usage limits on dashboard |
| `TesseractNotFoundError` | Tesseract binary not installed | Run `brew install tesseract tesseract-lang` (macOS) or `apt-get install tesseract-ocr` |
| `Failed loading language 'hin'` | Hindi language pack missing | Run `brew install tesseract-lang` or `apt-get install tesseract-ocr-hin` |
| `Failed loading language 'tam'` | Tamil language pack missing | Run `apt-get install tesseract-ocr-tam` |
| Sarvam job state `Failed` | Unsupported file format or server error | Confirm file is `.zip` or `.pdf`; retry |
| `OSError: cannot open resource` | Noto font not found | Install `fonts-noto` package; Indic text falls back to default font |