# Module 1 — Understanding Document AI & MCP Flow
## Objectives
- Understand what **Document Intelligence** is and common real-world use cases. 🧠
- Learn the idea of **MCP-style workflows** for multi-step AI pipelines.
- See the role of tools you'll use later: **LangChain, Gemini API, FAISS, and Tesseract OCR**.
- Run a small, end-to-end “MCP Flow” skeleton you'll extend in Modules 2 and 3. 🚀

## What is Document Intelligence?
- The practice of automatically **parsing, understanding, and acting** on information in documents (PDFs, images, scans, HTML).
- Goes beyond raw text extraction to include **structure** (headings, tables, figures), **semantics** (entities, relations), and **reasoning** (summaries, answers, compliance checks).

### Real-world use cases
- **Contracts**: clause extraction, risk flags, change tracking, negotiation assistants.
- **Research papers**: literature mapping, citation graphing, key findings extraction.
- **Manuals/SOPs**: task-aware search, troubleshooting assistants, training content generation.
- **Invoices/receipts**: field extraction, validation, and accounting workflows.
- **Healthcare**: de-identification, clinical fact extraction, coding support (with strict compliance).

### Quality pillars
- **Accuracy** (OCR quality, parsing fidelity)
- **Traceability** (citations, page refs)
- **Latency/cost** (chunking, caching)
- **Safety/Compliance** (PII, PHI handling)
- **Maintainability** (clear, observable pipelines)

## MCP-style Workflows: Why and How

We'll use **MCP** here to mean **“Model-Context-Protocol-style”** workflows: modular, tool-driven pipelines that pass context through well-defined steps (load → parse → chunk → embed → index → retrieve → reason → output).

### Benefits
- **Composability**: Swap implementations per step without breaking the pipeline.
- **Observability**: Log and inspect each stage (inputs, outputs, metadata).
- **Persistence**: Save intermediate artifacts (chunks, embeddings, indexes).
- **Idempotence**: Re-run safely on the same inputs.
- **Reusability**: The same flow works for contracts or research papers with different adapters.

### Orchestration options
- **LangChain/LangGraph**: chain and graph-based control, memory, tools, retrievers.
- **Custom orchestrators**: simple Python runners (we'll build a minimal one here).

### Where tools fit
- **Parsing**: PyMuPDF (PDF text + layout), Tesseract OCR (for scanned PDFs/images).
- **Embeddings/Index**: `sentence-transformers` or Gemini Embeddings → FAISS for similarity search.
- **Reasoning**: Gemini API or other LLMs via LangChain for Q&A, summaries, extraction.

## Environment setup
- This module has no hard dependencies on APIs. You can optionally set up Gemini for later modules.
- Tesseract is a native binary; we'll provide install tips but not use it here.

In [1]:
import sys
!{sys.executable} -m pip install -q python-dotenv


[notice] A new release of pip is available: 24.3.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
import os
import sys
import time
import json
import math
import hashlib
from dataclasses import dataclass, field
from typing import Any, Dict, List, Optional, Callable

print(f"Python {sys.version}")
print("Module 1 setup complete (no external API required).")

Python 3.12.9 (main, Mar 17 2025, 21:06:20) [MSC v.1943 64 bit (AMD64)]
Module 1 setup complete (no external API required).


### Optional: Configure API keys for later modules (not used in Module 1)
- **Gemini**: `export GOOGLE_API_KEY` in your environment.
- **Tesseract OCR** is a native dependency:
  - macOS (Homebrew): `brew install tesseract`
  - Ubuntu/Debian: `sudo apt-get install tesseract-ocr`
  - Windows: install from https://github.com/tesseract-ocr/tesseract

We'll verify environment variables if you want to pre-configure (safe to skip).

In [3]:
from dotenv import load_dotenv
load_dotenv()
GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY", "")
if GOOGLE_API_KEY:
    print("✅ Gemini API key detected in environment (ready for later modules).")
else:
    print("⚠️ Gemini API key not set (OK for Module 1).")

✅ Gemini API key detected in environment (ready for later modules).


## Tools Overview
- **LangChain**: Toolkit to build RAG, agents, and multi-step pipelines. Provides document loaders, chunkers, retrievers, vector store adapters, and LLM interfaces.
- **Gemini API**: Fast, multimodal LLMs and embeddings; great for reasoning and document tasks. We'll use it later for Q&A and summarization.
- **FAISS**: High-performance vector similarity search index. We'll integrate it for semantic search in Module 3.
- **Tesseract OCR**: Extracts text from scanned PDFs and images; often used alongside PyMuPDF to handle both digital and scanned pages.

## Minimal MCP-style pipeline skeleton

We'll model a document pipeline as a sequence of steps. Each step:
- Declares a name
- Reads/writes to a shared `context` object
- Is pure and idempotent (given the same inputs, it produces the same outputs)

#### Steps in this demo:
1) `load_source`: Simulates loading a document (string + metadata)
2) `parse_text`: “Parsing” stage (a no-op here; will be real in Module 2)
3) `chunk_text`: Splits into manageable segments (with overlap)
4) `embed_stub`: Creates toy embeddings (hash trick) for demo
5) `index_stub`: Builds a simple in-memory index and search (cosine similarity)

We'll log each step's inputs/outputs to show observability.

In [4]:
@dataclass
class PipelineContext:
    # Raw source
    source_name: Optional[str] = None
    source_bytes: Optional[bytes] = None
    source_text: Optional[str] = None
    metadata: Dict[str, Any] = field(default_factory=dict)

    # Parsed
    text: Optional[str] = None

    # Chunking
    chunks: List[Dict[str, Any]] = field(default_factory=list)  # {id, text, meta}

    # Embeddings
    vectors: Optional[List[List[float]]] = None  # aligned with chunks

    # Index/storage (stub)
    index: Optional[Dict[str, Any]] = None

    # Logs
    events: List[Dict[str, Any]] = field(default_factory=list)


def log_event(ctx: PipelineContext, step: str, detail: Dict[str, Any]):
    ctx.events.append({
        "ts": time.time(),
        "step": step,
        "detail": detail
    })


def run_step(ctx: PipelineContext, name: str, fn: Callable[[PipelineContext], PipelineContext]) -> PipelineContext:
    t0 = time.time()
    before = json.dumps({"keys": list(ctx.__dict__.keys())})
    ctx = fn(ctx)
    elapsed = round((time.time() - t0) * 1000, 2)
    log_event(ctx, name, {"elapsed_ms": elapsed})
    print(f"[{name}] done in {elapsed} ms")
    return ctx

### Step 1 — `load_source`
- In real pipelines: read file from disk, cloud storage, or a doc store.
- Here: we inject a short sample text and basic metadata.

In [5]:
def load_source(ctx: PipelineContext) -> PipelineContext:
    sample = (
        "Master Service Agreement (MSA)\n"
        "1. Scope: This Agreement governs the services provided by Vendor X.\n"
        "2. Termination: Either party may terminate with 30 days' notice for convenience.\n"
        "3. Confidentiality: Both parties shall protect confidential information.\n"
        "4. Liability: Vendor X is not liable for indirect damages.\n"
    )
    ctx.source_name = "sample_contract.txt"
    ctx.source_text = sample
    ctx.metadata.update({
        "doc_type": "contract",
        "business_unit": "legal",
        "language": "en",
    })
    log_event(ctx, "load_source", {"source_name": ctx.source_name, "bytes": bool(ctx.source_bytes)})
    return ctx

### Step 2 — `parse_text`
- In Module 2, this will use real parsers (e.g., PyMuPDF for PDFs, OCR for images).
- Here: treat `source_text` as already parsed.

In [6]:
def parse_text(ctx: PipelineContext) -> PipelineContext:
    if ctx.source_text:
        ctx.text = ctx.source_text
    elif ctx.source_bytes:
        # Placeholder: decode if bytes; real implementation would inspect format
        ctx.text = ctx.source_bytes.decode("utf-8", errors="ignore")
    else:
        ctx.text = ""

    log_event(ctx, "parse_text", {"text_len": len(ctx.text or "")})
    return ctx

### Step 3 — `chunk_text`
- Splits long text into overlapping chunks to help later embeddings and retrieval.
- We keep it simple here: split by lines, then merge small lines; add overlap.

In [7]:
def chunk_text(ctx: PipelineContext, max_chars: int = 120, overlap: int = 30) -> PipelineContext:
    if not ctx.text:
        ctx.chunks = []
        log_event(ctx, "chunk_text", {"chunks": 0})
        return ctx

    lines = [ln.strip() for ln in ctx.text.splitlines() if ln.strip()]
    chunks = []
    buf = ""
    for ln in lines:
        if len(buf) + 1 + len(ln) <= max_chars:
            buf = (buf + " " + ln).strip()
        else:
            chunks.append(buf)
            # create overlap by keeping last N chars
            buf = (buf[-overlap:] + " " + ln).strip()
    if buf:
        chunks.append(buf)

    ctx.chunks = [
        {"id": f"chunk-{i}", "text": ch, "meta": {"order": i}} for i, ch in enumerate(chunks)
    ]
    log_event(ctx, "chunk_text", {"chunks": len(ctx.chunks), "max_chars": max_chars, "overlap": overlap})
    return ctx

### Step 4 — `embed_stub`
- Real embeddings (Module 3) will use `sentence-transformers` or Gemini Embeddings.
- For Module 1, we'll create a **“hash trick”** embedding:
  - Map each word to a bucket using a stable hash.
  - Produce a fixed-length vector of counts (L2-normalized).
- This is **only for demonstration** and testing the pipeline shape.

In [8]:
def _hash_vector(text: str, dim: int = 64) -> List[float]:
    vec = [0.0] * dim
    for token in text.lower().split():
        h = int(hashlib.md5(token.encode("utf-8")).hexdigest(), 16)
        vec[h % dim] += 1.0
    # L2 normalize
    norm = math.sqrt(sum(x*x for x in vec)) or 1.0
    return [x / norm for x in vec]

def embed_stub(ctx: PipelineContext, dim: int = 64) -> PipelineContext:
    vectors = []
    for ch in ctx.chunks:
        vectors.append(_hash_vector(ch["text"], dim=dim))
    ctx.vectors = vectors
    log_event(ctx, "embed_stub", {"dim": dim, "vectors": len(vectors)})
    return ctx

### Step 5 — `index_stub` and simple search
- In Module 3, we'll integrate **FAISS** for efficient vector search.
- Here: store vectors in memory and compute cosine similarity at query time.

In [9]:
def _cosine(a: List[float], b: List[float]) -> float:
    return sum(x*y for x, y in zip(a, b))

def index_stub(ctx: PipelineContext) -> PipelineContext:
    if not ctx.vectors or not ctx.chunks:
        ctx.index = None
        return ctx
    ctx.index = {
        "vectors": ctx.vectors,
        "chunks": ctx.chunks,
        "dim": len(ctx.vectors[0]) if ctx.vectors else 0
    }
    log_event(ctx, "index_stub", {"count": len(ctx.vectors)})
    return ctx

def search_stub(ctx: PipelineContext, query: str, top_k: int = 3) -> List[Dict[str, Any]]:
    if not ctx.index:
        return []
    qv = _hash_vector(query, dim=ctx.index["dim"])
    sims = [(_cosine(qv, v), i) for i, v in enumerate(ctx.index["vectors"])]
    sims.sort(reverse=True)
    results = []
    for score, i in sims[:top_k]:
        results.append({
            "score": float(score),
            "chunk": ctx.index["chunks"][i]
        })
    return results

## Run the pipeline

We'll execute all steps and then run a simple semantic search query over the stub index.

In [10]:
ctx = PipelineContext()

for name, fn in [
    ("load_source", load_source),
    ("parse_text", parse_text),
    ("chunk_text", chunk_text),
    ("embed_stub", embed_stub),
    ("index_stub", index_stub),
]:
    ctx = run_step(ctx, name, fn)

print("\n---\n")
print(f"Chunks created: {len(ctx.chunks)}")
for ch in ctx.chunks:
    print(f"- {ch['id']}: {ch['text']}")

[load_source] done in 0.0 ms
[parse_text] done in 0.0 ms
[chunk_text] done in 0.0 ms
[embed_stub] done in 33.5 ms
[index_stub] done in 0.0 ms

---

Chunks created: 4
- chunk-0: Master Service Agreement (MSA) 1. Scope: This Agreement governs the services provided by Vendor X.
- chunk-1: services provided by Vendor X. 2. Termination: Either party may terminate with 30 days' notice for convenience.
- chunk-2: days' notice for convenience. 3. Confidentiality: Both parties shall protect confidential information.
- chunk-3: tect confidential information. 4. Liability: Vendor X is not liable for indirect damages.


### Try a query (semantic-ish search on the stub index)
Example queries:
- `termination notice`
- `confidentiality information`
- `indirect damages`

In [11]:
queries = [
    "termination notice",
    "confidentiality information",
    "indirect damages"
]

for q in queries:
    results = search_stub(ctx, q, top_k=2)
    print(f"\nQuery: '{q}'")
    for r in results:
        print(f"  score={r['score']:.3f} | {r['chunk']['id']} | {r['chunk']['text']}")


Query: 'termination notice'
  score=0.548 | chunk-3 | tect confidential information. 4. Liability: Vendor X is not liable for indirect damages.
  score=0.189 | chunk-2 | days' notice for convenience. 3. Confidentiality: Both parties shall protect confidential information.

Query: 'confidentiality information'
  score=0.365 | chunk-3 | tect confidential information. 4. Liability: Vendor X is not liable for indirect damages.
  score=0.162 | chunk-1 | services provided by Vendor X. 2. Termination: Either party may terminate with 30 days' notice for convenience.

Query: 'indirect damages'
  score=0.183 | chunk-3 | tect confidential information. 4. Liability: Vendor X is not liable for indirect damages.
  score=0.162 | chunk-1 | services provided by Vendor X. 2. Termination: Either party may terminate with 30 days' notice for convenience.


## Observability: Inspect the event log

The log shows each step completion and timing. In production, you'd also capture inputs/outputs, trace IDs, errors, and persist artifacts.

In [12]:
print("--- Pipeline Event Log ---")
for e in ctx.events:
    print(f"{time.strftime('%H:%M:%S', time.localtime(e['ts']))} | {e['step']:<12} | {e['detail']}")

--- Pipeline Event Log ---
19:29:29 | load_source  | {'source_name': 'sample_contract.txt', 'bytes': False}
19:29:29 | load_source  | {'elapsed_ms': 0.0}
19:29:29 | parse_text   | {'text_len': 312}
19:29:29 | parse_text   | {'elapsed_ms': 0.0}
19:29:29 | chunk_text   | {'chunks': 4, 'max_chars': 120, 'overlap': 30}
19:29:29 | chunk_text   | {'elapsed_ms': 0.0}
19:29:29 | embed_stub   | {'dim': 64, 'vectors': 4}
19:29:29 | embed_stub   | {'elapsed_ms': 33.5}
19:29:29 | index_stub   | {'count': 4}
19:29:29 | index_stub   | {'elapsed_ms': 0.0}


## Architecture Preview: How this grows in Modules 2 and 3
- **Module 2 (Parsing Large PDFs)**:
  - Replace `load_source`/`parse_text` with PyMuPDF-based PDF loader and text extractor.
  - Add page-level metadata (page number, section headers).
  - Robust chunking with hierarchy and overlaps; support multiple PDFs.
- **Module 3 (Embeddings & Vector Store)**:
  - Replace `embed_stub` with real embeddings (`all-MiniLM-L6-v2` or Gemini Embeddings).
  - Replace `index_stub`/`search_stub` with FAISS index and retrieval.
  - Persist vector DB locally for reuse.

### Orchestration Options:
- Keep the simple runner pattern and add caching/persistence.
- Move to LangChain/LangGraph for tool calls, memory, and branching logic (e.g., OCR fallback if page has no text).

### Notes on Tesseract OCR:
- Use when PDFs are scanned images; run OCR per page and merge with PyMuPDF outputs.
- Validate results with confidence scores; consider language packs and layout-aware OCR for better accuracy.

## Appendix: Minimal LangChain-style conceptual mapping (no API calls)
- `Loader`: converts bytes/files → `Documents` with metadata
- `TextSplitter`: produces chunks with overlap
- `Embeddings`: transforms chunks → vectors
- `VectorStore`: builds index (e.g., FAISS) and provides retriever
- `Chain/Agent`: uses retriever + LLM to answer questions with citations

In Module 3 we'll replace the stub embedding and index with actual LangChain components.

## End of Module 1
You now have:
- A clear mental model of Document Intelligence and MCP-style flows. ✅
- A small, observable pipeline skeleton ready to be upgraded in Modules 2 and 3. ✅

### Next: Module 2 — Parsing Large PDFs Efficiently
- PyMuPDF extraction
- Smarter chunking with overlaps and hierarchy
- Multi-PDF support and metadata persistence