Module 4 — Image-to-Text with Tesseract OCR
Objectives
- Install/verify Tesseract OCR.
- Extract text from images (JPG/PNG) and scanned PDFs.
- Clean and normalize OCR output for better downstream performance.
- Chunk and add OCR outputs to a FAISS vector store (extend Module 3).

Dependencies
- Native: Tesseract OCR binary (required)
- macOS: brew install tesseract
- Ubuntu/Debian: sudo apt-get install tesseract-ocr
- Windows: Install from https://github.com/tesseract-ocr/tesseract and note the install path.
- Python: pytesseract, pillow, pymupdf (for scanned PDFs), sentence-transformers, faiss-cpu, tqdm
If needed, install Python libs:
%pip install -q pillow pytesseract pymupdf sentence-transformers faiss-cpu tqdm

In [4]:
import os
import sys
import json
import time
import math
from pathlib import Path
from typing import Any, Dict, List, Tuple, Optional
import numpy as np
from PIL import Image, ImageOps, ImageFilter
import pytesseract
from tqdm import tqdm
import fitz  # PyMuPDF
try:
    import faiss  # noqa: F401
except Exception:
    import faiss_cpu as faiss  # type: ignore
print(f"Python: {sys.version.split()[0]} | FAISS ok | PyMuPDF ok")

Python: 3.12.9 | FAISS ok | PyMuPDF ok


Tesseract setup and validation
- If on Windows, set pytesseract.pytesseract.tesseract_cmd to the installed path, e.g.:
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
- We validate availability by fetching the version.

In [5]:
TESSERACT_CMD = os.getenv("TESSERACT_CMD", r"C:\Program Files\Tesseract-OCR\tesseract.exe")  # fill if needed
if TESSERACT_CMD:
    pytesseract.pytesseract.tesseract_cmd = TESSERACT_CMD
try:
    print("Tesseract version:", pytesseract.get_tesseract_version())
except Exception as e:
    print("ERROR: Tesseract is not found. Install it and/or set TESSERACT_CMD.")
    raise e

Tesseract version: 5.5.0.20241111


Configuration
- INPUT_IMG_DIR: folder of JPG/PNG images to OCR
- INPUT_PDF_DIR: folder of scanned PDFs (or PDFs you want to force OCR)
- OUTPUT_DIR: where JSONL of OCR chunks will be written
- OCR settings: language(s), OEM/PSM, DPI for PDF rasterization
- Chunking: max_chars and overlap; matches downstream RAG usage
- Index: path to existing FAISS index (from Module 3) or create new

In [6]:
INPUT_IMG_DIR = Path("./../data/images")
INPUT_PDF_DIR = Path("./../data/scanned_pdfs")
OUTPUT_DIR = Path("./../data/ocr_outputs")
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
OCR_LANG = "eng"
OCR_OEM = 3
OCR_PSM = 3
PDF_DPI = 300
MAX_CHARS = 1200
OVERLAP = 200
INDEX_DIR = Path("./../data/indexes/my_corpus")
INDEX_DIR.mkdir(parents=True, exist_ok=True)
EMBEDDING_BACKEND = "gemini"  # "minilm" or "gemini"
GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY", "")

Embedding backends (reuse Module 3 approach)
- MiniLM (local, default)
- Gemini (optional, requires GOOGLE_API_KEY)

In [7]:
class MiniLMBackend:
    def __init__(self, model_name: str = "sentence-transformers/all-MiniLM-L6-v2"):
        from sentence_transformers import SentenceTransformer
        import torch
        device = "cuda" if torch.cuda.is_available() else "cpu"
        self.model = SentenceTransformer(model_name, device=device)
        self.dim = self.model.get_sentence_embedding_dimension()
        self.model_name = model_name
    def encode(self, texts: List[str], batch_size: int = 64) -> np.ndarray:
        return self.model.encode(texts, batch_size=batch_size, convert_to_numpy=True, normalize_embeddings=False)

class GeminiBackend:
    def __init__(self, model_name: str = "text-embedding-004", api_key: Optional[str] = None):
        assert api_key, "GOOGLE_API_KEY required for Gemini embeddings."
        import google.generativeai as genai
        genai.configure(api_key=api_key)
        self.genai = genai
        self.dim = 768
        self.model_name = model_name
    def encode(self, texts: List[str], batch_size: int = 16) -> np.ndarray:
        vecs: List[List[float]] = []
        for i in tqdm(range(0, len(texts), batch_size), desc="Gemini embedding"):
            for t in texts[i:i+batch_size]:
                resp = self.genai.embed_content(model=self.model_name, content=t[:4000])
                vecs.append(resp["embedding"])
        return np.asarray(vecs, dtype=np.float32)

def get_backend(name: str):
    if name.lower() == "minilm":
        be = MiniLMBackend()
        print(f"Using MiniLM backend: {be.model_name}, dim={be.dim}")
        return be
    elif name.lower() == "gemini":
        be = GeminiBackend(api_key=GOOGLE_API_KEY)
        print(f"Using Gemini backend: text-embedding-004, dim={be.dim}")
        return be
    else:
        raise ValueError("EMBEDDING_BACKEND must be 'minilm' or 'gemini'.")

backend = get_backend(EMBEDDING_BACKEND)

Using Gemini backend: text-embedding-004, dim=768


Image preprocessing helpers
- Convert to grayscale
- Optional upscaling
- Contrast/threshold (simple binary threshold)
- Deskew using Tesseract OSD (if available)

In [8]:
import re
import numpy as np
def to_grayscale(img: Image.Image) -> Image.Image:
    return ImageOps.grayscale(img)
def upscale(img: Image.Image, factor: float = 1.5) -> Image.Image:
    if factor == 1.0:
        return img
    w, h = img.size
    return img.resize((int(w*factor), int(h*factor)), Image.LANCZOS)
def binarize(img: Image.Image, thresh: int = 180) -> Image.Image:
    return img.point(lambda p: 255 if p > thresh else 0)
def deskew(img: Image.Image) -> Image.Image:
    try:
        osd = pytesseract.image_to_osd(img)
        m = re.search(r"Rotate: (\d+)", osd)
        if m:
            angle = int(m.group(1))
            if angle != 0:
                return img.rotate(-angle, expand=True, fillcolor="white")
    except Exception:
        pass
    return img
def preprocess_image(img: Image.Image, upscale_factor: float = 1.5, use_binarize: bool = True, do_deskew: bool = True) -> Image.Image:
    x = to_grayscale(img)
    x = upscale(x, factor=upscale_factor)
    if do_deskew:
        x = deskew(x)
    if use_binarize:
        x = binarize(x, thresh=180)
    x = x.filter(ImageFilter.MedianFilter(size=3))
    return x

OCR functions
- image_to_text: OCR a PIL image to text and average confidence
- pdf_page_to_image: rasterize a PDF page at the configured DPI
- ocr_image_path: end-to-end for an image file
- ocr_pdf_path: end-to-end for a PDF (all pages)

In [9]:
def image_to_text(img: Image.Image, lang: str = OCR_LANG, oem: int = OCR_OEM, psm: int = OCR_PSM) -> Tuple[str, float]:
    config = f"--oem {oem} --psm {psm}"
    txt = pytesseract.image_to_string(img, lang=lang, config=config)
    try:
        data = pytesseract.image_to_data(img, lang=lang, config=config, output_type=pytesseract.Output.DICT)
        confs = [int(c) for c in data.get("conf", []) if c not in ("-1", -1)]
        avg_conf = float(sum(confs) / max(1, len(confs))) if confs else 0.0
    except Exception:
        avg_conf = 0.0
    return txt, avg_conf

def pdf_page_to_image(pdf: fitz.Document, page_index: int, dpi: int = PDF_DPI) -> Image.Image:
    page = pdf.load_page(page_index)
    mat = fitz.Matrix(dpi/72, dpi/72)
    pix = page.get_pixmap(matrix=mat, alpha=False)
    img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
    return img

def ocr_image_path(path: Path) -> Dict[str, Any]:
    img = Image.open(path)
    proc = preprocess_image(img)
    text, conf = image_to_text(proc)
    return {
        "source_path": str(path),
        "doc_id": path.stem,
        "page": None,
        "text": text,
        "avg_conf": conf,
        "lang": OCR_LANG,
        "psm": OCR_PSM,
        "oem": OCR_OEM,
        "dpi": None,
        "kind": "image"
    }

def ocr_pdf_path(path: Path) -> List[Dict[str, Any]]:
    out: List[Dict[str, Any]] = []
    with fitz.open(str(path)) as doc:
        for pno in tqdm(range(doc.page_count), desc=f"OCR {path.name}"):
            img = pdf_page_to_image(doc, pno, dpi=PDF_DPI)
            proc = preprocess_image(img)
            text, conf = image_to_text(proc)
            out.append({
                "source_path": str(path),
                "doc_id": path.stem,
                "page": pno + 1,
                "text": text,
                "avg_conf": conf,
                "lang": OCR_LANG,
                "psm": OCR_PSM,
                "oem": OCR_OEM,
                "dpi": PDF_DPI,
                "kind": "pdf_page"
            })
    return out

Cleaning OCR output
- Normalize whitespace and line breaks
- Fix common hyphenation across line breaks (word-)
- Remove obvious junk lines (very short, all punctuation)

In [10]:
def clean_ocr_text(s: str) -> str:
    if not s:
        return ""
    s = re.sub(r"(\w)-\s*\n\s*(\w)", r"\1\2", s)
    s = re.sub(r"[ \t]\n[ \t]", "\n", s)
    s = re.sub(r"\n{3,}", "\n\n", s)
    lines = s.splitlines()
    keep = []
    for ln in lines:
        ln2 = ln.strip()
        if not ln2:
            keep.append("")
            continue
        alnum = sum(ch.isalnum() for ch in ln2)
        if alnum < max(3, int(0.2 * len(ln2))):
            continue
        keep.append(ln2)
    s = "\n".join(keep)
    s = re.sub(r"[ \t]{2,}", " ", s)
    return s.strip()

Chunking (simple, overlap-aware)
- Split cleaned text to chunks with overlap
- Record char spans for traceability

In [11]:
def chunk_text(text: str, max_chars: int = MAX_CHARS, overlap: int = OVERLAP) -> List[Dict[str, Any]]:
    text = text.strip()
    if not text:
        return []
    chunks = []
    i = 0
    while i < len(text):
        end = min(len(text), i + max_chars)
        t = text[i:end]
        chunks.append({"text": t, "char_start": i, "char_end": i + len(t)})
        if end >= len(text):
            break
        i = max(0, end - overlap)
    return chunks

OCR runners
- Process all images in INPUT_IMG_DIR (jpg/png)
- Process all PDFs in INPUT_PDF_DIR
- Produce chunk records and write a JSONL

In [12]:
def ocr_images_in_dir(img_dir: Path) -> List[Dict[str, Any]]:
    records: List[Dict[str, Any]] = []
    if not img_dir.exists():
        return records
    files = [p for p in img_dir.glob("*") if p.suffix.lower() in {".jpg", ".jpeg", ".png", ".tiff", ".bmp"}]
    for p in tqdm(files, desc="OCR images"):
        rec = ocr_image_path(p)
        clean = clean_ocr_text(rec["text"])
        chs = chunk_text(clean)
        for idx, ch in enumerate(chs):
            records.append({
                "id": f"{rec['doc_id']}img{idx}",
                "text": ch["text"],
                "metadata": {
                    "doc_id": rec["doc_id"],
                    "source_path": rec["source_path"],
                    "page": None,
                    "char_start": ch["char_start"],
                    "char_end": ch["char_end"],
                    "ocr_avg_conf": rec["avg_conf"],
                    "ocr_lang": rec["lang"],
                    "ocr_psm": rec["psm"],
                    "ocr_oem": rec["oem"],
                    "dpi": rec["dpi"],
                    "kind": rec["kind"]
                }
            })
    return records

def ocr_pdfs_in_dir(pdf_dir: Path) -> List[Dict[str, Any]]:
    records: List[Dict[str, Any]] = []
    if not pdf_dir.exists():
        return records
    files = [p for p in pdf_dir.glob("*.pdf")]
    for p in files:
        page_recs = ocr_pdf_path(p)
        for pr in page_recs:
            clean = clean_ocr_text(pr["text"])
            chs = chunk_text(clean)
            for idx, ch in enumerate(chs):
                records.append({
                    "id": f"{pr['doc_id']}p{pr['page']}{idx}",
                    "text": ch["text"],
                    "metadata": {
                        "doc_id": pr["doc_id"],
                        "source_path": pr["source_path"],
                        "page": pr["page"],
                        "char_start": ch["char_start"],
                        "char_end": ch["char_end"],
                        "ocr_avg_conf": pr["avg_conf"],
                        "ocr_lang": pr["lang"],
                        "ocr_psm": pr["psm"],
                        "ocr_oem": pr["oem"],
                        "dpi": pr["dpi"],
                        "kind": pr["kind"]
                    }
                })
    return records

def save_jsonl(records: List[Dict[str, Any]], out_path: Path):
    out_path.parent.mkdir(parents=True, exist_ok=True)
    with out_path.open("w", encoding="utf-8") as f:
        for r in records:
            f.write(json.dumps(r, ensure_ascii=False) + "\n")

Run OCR and write chunks
- Produces ocr_chunks.jsonl in OUTPUT_DIR
- You can merge these with Module 2 chunks if desired

In [13]:
ocr_records = []
ocr_records += ocr_images_in_dir(INPUT_IMG_DIR)
ocr_records += ocr_pdfs_in_dir(INPUT_PDF_DIR)
out_ocr = OUTPUT_DIR / "ocr_chunks.jsonl"
save_jsonl(ocr_records, out_ocr)
print(f"OCR chunks written: {len(ocr_records)} -> {out_ocr}")
for r in ocr_records[:3]:
    print(r["id"], "| conf:", r["metadata"]["ocr_avg_conf"], "| page:", r["metadata"]["page"])
    print(r["text"][:200].replace("\n"," "), "...\n")

OCR images: 100%|██████████| 3/3 [00:06<00:00,  2.26s/it]

OCR chunks written: 3 -> ..\data\ocr_outputs\ocr_chunks.jsonl
factsimg0 | conf: 94.96551724137932 | page: None
indiatoday®  Heart has its own electrical impulse so, it can continue to beat even when separated from the body, as long has it has a supply of oxygen ...

spandanimg0 | conf: 84.87755102040816 | page: None
4 -spandap’ How to Keep Heart Healthy A Comprehensive Guide for Every Age les! Eat heart-friendly, balanced diet kK Exercise regularly for stronger heart ar Manage stress with mindfulness techniques @ ...

spandan_factsimg0 | conf: 93.02272727272727 | page: None
O04   15 Facts About the Heart That You Didn't Know  Can beat outside body 08 Heart cells stop dividing early in life Laughter benefits O09 Heart can enlarge due to exercise or disease Represents love ...






Vector store integration (FAISS)
- Load existing FAISS index from Module 3 (if present) and append OCR chunks, or create a new index.
- We store embeddings (L2-normalized), FAISS index, docstore.jsonl, vectors.npy, and manifest.json.

In [14]:
def l2_normalize(vecs: np.ndarray) -> np.ndarray:
    norms = np.linalg.norm(vecs, axis=1, keepdims=True)
    norms[norms == 0] = 1.0
    return vecs / norms

def build_faiss_index(vectors: np.ndarray) -> faiss.Index:
    dim = vectors.shape[1]
    index = faiss.IndexFlatIP(dim)
    index.add(vectors)
    return index

def load_index(in_dir: Path) -> Tuple[faiss.Index, np.ndarray, List[Dict[str, Any]], Dict[str, Any]]:
    index = faiss.read_index(str(in_dir / "index.faiss"))
    vectors = np.load(in_dir / "vectors.npy")
    docstore = []
    with (in_dir / "docstore.jsonl").open("r", encoding="utf-8") as f:
        for line in f:
            if line.strip():
                docstore.append(json.loads(line))
    manifest = json.loads((in_dir / "manifest.json").read_text(encoding="utf-8"))
    return index, vectors, docstore, manifest

def save_index(index: faiss.Index, vectors: np.ndarray, docstore: List[Dict[str, Any]], backend_name: str, dim: int, out_dir: Path):
    out_dir.mkdir(parents=True, exist_ok=True)
    faiss.write_index(index, str(out_dir / "index.faiss"))
    np.save(out_dir / "vectors.npy", vectors)
    with (out_dir / "docstore.jsonl").open("w", encoding="utf-8") as f:
        for d in docstore:
            f.write(json.dumps(d, ensure_ascii=False) + "\n")
    manifest = {"backend": backend_name, "dim": dim, "count": int(vectors.shape[0]), "created_at": time.time()}
    with (out_dir / "manifest.json").open("w", encoding="utf-8") as f:
        json.dump(manifest, f, ensure_ascii=False, indent=2)

def prepare_texts(records: List[Dict[str, Any]], max_len: int = 4000) -> List[str]:
    out = []
    for r in records:
        t = (r["text"] or "").strip()
        out.append(t[:max_len] if max_len and len(t) > max_len else t)
    return out

In [15]:
texts = prepare_texts(ocr_records)
if not texts:
    print("No OCR texts to embed. Skipping index update.")
else:
    X = backend.encode(texts, batch_size=64).astype(np.float32)
    X = l2_normalize(X)
    new_docs = [{"id": r["id"], "text": r["text"], "metadata": r["metadata"]} for r in ocr_records]
    index_path = INDEX_DIR / "index.faiss"
    if index_path.exists():
        index, vectors, docstore, manifest = load_index(INDEX_DIR)
        if vectors.shape[1] != X.shape[1]:
            raise ValueError(f"Dim mismatch: existing {vectors.shape[1]} vs new {X.shape[1]}. Use the same embedding backend.")
        index.add(X)
        vectors = np.concatenate([vectors, X], axis=0)
        docstore.extend(new_docs)
        save_index(index, vectors, docstore, backend_name=EMBEDDING_BACKEND, dim=vectors.shape[1], out_dir=INDEX_DIR)
        print(f"Index updated. Total vectors: {vectors.shape[0]}")
    else:
        index = build_faiss_index(X)
        vectors = X
        docstore = new_docs
        save_index(index, vectors, docstore, backend_name=EMBEDDING_BACKEND, dim=vectors.shape[1], out_dir=INDEX_DIR)
        print(f"New index created at {INDEX_DIR} with {vectors.shape[0]} vectors.")

Gemini embedding: 100%|██████████| 1/1 [00:02<00:00,  2.62s/it]

Index updated. Total vectors: 19





Quick search to validate OCR content is indexed
- Simple cosine search using FAISS IndexFlatIP and normalized vectors

In [18]:
def embed_query(q: str, backend) -> np.ndarray:
    v = backend.encode([q], batch_size=1).astype(np.float32)
    return l2_normalize(v)

def search(index: faiss.Index, vectors: np.ndarray, docstore: List[Dict[str, Any]], query: str, top_k: int = 5) -> List[Dict[str, Any]]:
    qv = embed_query(query, backend)
    D, I = index.search(qv, top_k)
    res = []
    for score, idx in zip(D[0], I[0]):
        d = docstore[idx]
        res.append({"score": float(score), "id": d["id"], "text": d["text"], "metadata": d["metadata"]})
    return res

if (INDEX_DIR / "index.faiss").exists():
    idx, vecs, store, mani = load_index(INDEX_DIR)
    for q in ["electrical impulse", "heart cells"]:
        print(f"\nQuery: {q}")
        results = search(idx, vecs, store, q, top_k=3)
        for r in results:
            m = r["metadata"]
            loc = f"{m.get('doc_id')} p.{m.get('page')}" if m.get("page") else m.get("doc_id")
            # print(f"  score={r['score']:.3f} | {loc} | kind={m.get('kind')} | conf={m.get('ocr_avg_conf'):.1f}")
            print("   ", r["text"][:200].replace("\n"," "), "...")
else:
    print("Index not found; skipped query demo.")


Query: electrical impulse


Gemini embedding: 100%|██████████| 1/1 [00:00<00:00,  1.84it/s]


    indiatoday®  Heart has its own electrical impulse so, it can continue to beat even when separated from the body, as long has it has a supply of oxygen ...
    O04   15 Facts About the Heart That You Didn't Know  Can beat outside body 08 Heart cells stop dividing early in life Laughter benefits O09 Heart can enlarge due to exercise or disease Represents love ...
    olic (lower number) is the pressure against the walls of the arteries during  the relaxation phase (the heart is at rest between beats).  •  High blood pressure is called the “silent killer” because t ...

Query: heart cells


Gemini embedding: 100%|██████████| 1/1 [00:01<00:00,  1.19s/it]

    O04   15 Facts About the Heart That You Didn't Know  Can beat outside body 08 Heart cells stop dividing early in life Laughter benefits O09 Heart can enlarge due to exercise or disease Represents love ...
    indiatoday®  Heart has its own electrical impulse so, it can continue to beat even when separated from the body, as long has it has a supply of oxygen ...
    Heart Health  In the United States, heart disease is the leading cause of death for both men and women. People with  pre-diabetes and/or the metabolic syndrome are at higher risk for developing heart  ...





Mini Task
1) Put a scanned PDF in data/scanned_pdfs or images in data/images.
2) Run OCR cells to produce ocr_chunks.jsonl.
3) Run embedding + FAISS update cells to add OCR chunks to your existing index.
4) Run the quick search to verify OCR text is retrievable (with metadata like page and avg confidence).

Notes and troubleshooting
- Multi-language OCR: install language packs (e.g., tesseract-ocr-deu) and set OCR_LANG="eng+deu".
- PSM tuning: PSM 6 or 4 can help for single-column docs; 11/13 for sparse text.
- DPI: 300 improves accuracy for low-res scans; higher DPI increases time and memory.
- Deskew: OSD may fail on very noisy pages; consider OpenCV-based deskew for robust rotation correction.
- Junk removal: Adjust cleaning heuristics if you see important lines removed.
- Confidence: Average conf is approximate; use to flag low-quality OCR for reprocessing.
- Index mixing: Keep the same embedding backend/dimension across updates (don’t mix MiniLM and Gemini in one index).
- Pipeline: Load → Preprocess → OCR → Clean → Chunk → Embed → Index. This clean separation follows the MCP-style flow and makes it easy to swap steps later.
This sets you up for Module 5 (RAG pipeline) where we’ll wire retrieval to an LLM with prompts that cite page numbers and sources, and Module 6 (agent memory + multi-step reasoning).