# **Aadhaar / PAN Form Filler**

This notebook demonstrates how to build an end-to-end pipeline for reading Indian
government-issued identity documents and auto-filling an HTML form using Sarvam AI.

### **Use Case**
Automate KYC / onboarding form-filling for banks, fintechs, and government portals.
1. **Extract:** Use **Sarvam Vision Document Intelligence** to OCR the Aadhaar or PAN card.
2. **Parse:** Use **Sarvam-M** to extract name, date of birth, gender, ID number, and address.
3. **Fill:** Populate `templates/sample_form_template.html` and save `filled_form.html`.

### **Supported Formats**
- Images: `.jpg`, `.jpeg`, `.png`
- Documents: `.pdf`

---

> **IMPORTANT — Privacy & Legal Disclaimer**
>
> This notebook is for **authorized form-filling workflows only** and is provided for
> **demo and educational purposes**.
>
> - **Do not** store, log, cache, or transmit extracted Aadhaar or PAN data beyond what
>   is strictly necessary to complete the form-filling action.
> - **Comply** with all applicable laws including the **DPDP Act 2023**, UIDAI circulars,
>   and RBI/SEBI KYC guidelines.
> - The pipeline immediately deletes extracted data from memory after the form is written.
> - The demo card is entirely synthetic and clearly labelled **SPECIMEN — NOT VALID**.

In [None]:
# Pinning versions for reproducibility
!pip install -Uqq sarvamai>=0.1.24 python-dotenv>=1.0.0 Pillow>=12.1.1

### **1. Setup & API Key**

Obtain your API key from the [Sarvam AI Dashboard](https://dashboard.sarvam.ai).
Create a `.env` file in this directory with `SARVAM_API_KEY=your_key_here`, or set the
environment variable directly.

In [None]:
from __future__ import annotations

import os
import json
import re
import zipfile
import tempfile
from pathlib import Path

from dotenv import load_dotenv
from sarvamai import SarvamAI

load_dotenv()

SARVAM_API_KEY = os.environ.get("SARVAM_API_KEY", "")
if not SARVAM_API_KEY or SARVAM_API_KEY == "YOUR_SARVAM_API_KEY":
    raise RuntimeError(
        "SARVAM_API_KEY is not set. Add it to your .env file or set the environment variable."
    )

client = SarvamAI(api_subscription_key=SARVAM_API_KEY)

print("Client initialised.")

### **2. Step 1 — EXTRACT: Document Intelligence Helper**

`extract_id_text` sends the ID card file to Sarvam Vision Document Intelligence and returns
the extracted text as a Markdown string.

The API uses an async job workflow: create → upload → start → wait → download (ZIP) → unzip.

> **Note:** The API accepts `.pdf` or `.zip` only. PNG/JPG images are automatically wrapped
> in a ZIP before upload.

In [None]:
_IMAGE_EXTENSIONS = {'.jpg', '.jpeg', '.png'}


def extract_id_text(file_path: str) -> str:
    """Extract text from an Aadhaar or PAN card image or PDF using Sarvam Document Intelligence.

    Images (.jpg, .png) are automatically wrapped in a ZIP archive before upload,
    as the API only accepts PDF or ZIP files directly.
    """
    path = Path(file_path)
    upload_path = file_path
    tmp_zip: str | None = None

    if path.suffix.lower() in _IMAGE_EXTENSIONS:
        # Wrap image in a flat ZIP — required by the Document Intelligence API
        with tempfile.NamedTemporaryFile(suffix='.zip', delete=False) as tmp:
            tmp_zip = tmp.name
        with zipfile.ZipFile(tmp_zip, 'w', zipfile.ZIP_DEFLATED) as zf:
            zf.write(file_path, arcname=path.name)
        upload_path = tmp_zip

    try:
        job = client.document_intelligence.create_job(
            language="en-IN",
            output_format="md"
        )
        job.upload_file(upload_path)
        job.start()

        status = job.wait_until_complete()
        if status.job_state != "Completed":
            raise RuntimeError(
                f"Document Intelligence job ended with state: {status.job_state}. "
                f"Details: {status}"
            )

        with tempfile.NamedTemporaryFile(suffix='.zip', delete=False) as tmp:
            out_zip = tmp.name

        try:
            job.download_output(out_zip)
            with zipfile.ZipFile(out_zip, 'r') as zf:
                md_files = [f for f in zf.namelist() if f.endswith('.md')]
                if not md_files:
                    raise RuntimeError(
                        "No markdown output found in Document Intelligence result. "
                        f"ZIP contents: {zf.namelist()}"
                    )
                with zf.open(md_files[0]) as f:
                    return f.read().decode('utf-8')
        finally:
            os.unlink(out_zip)

    finally:
        if tmp_zip:
            os.unlink(tmp_zip)


print("extract_id_text defined.")

### **3. Step 2 — PARSE: Structured JSON Extraction**

`parse_id_card` sends the raw OCR markdown to **Sarvam-M** and returns a validated Python dict.

A `confidence` score below **0.85** triggers a warning — flag those documents for manual review
before acting on the extracted data.

In [None]:
PARSE_SYSTEM_PROMPT = """You are a precise Indian identity document data extractor. Extract the following fields from the document text and return ONLY valid JSON with no other text, no markdown fences, no explanation.

Required JSON schema:
{
  "document_type": "aadhaar" or "pan",
  "name": "<string or null>",
  "dob": "<DD-MM-YYYY format or null>",
  "gender": "<Male | Female | Other | null>",
  "id_number": "<string or null — Aadhaar: 12 digits with spaces; PAN: 10 alphanumeric chars>",
  "address": "<string or null — Aadhaar only; always null for PAN>",
  "language_detected": "<primary language on the document, e.g. English, Hindi, Tamil>",
  "confidence": <float between 0.0 and 1.0>
}

Rules:
- Use null (not "null") for fields not present in the document
- For PAN cards, address is always null
- confidence reflects how completely all fields could be read (1.0 = perfect, 0.0 = unreadable)
- Return ONLY the JSON object"""


def parse_id_card(raw_text: str) -> dict:
    """Parse raw OCR text from an ID card into structured JSON using Sarvam-M."""
    response = client.chat.completions(
        messages=[
            {"role": "system", "content": PARSE_SYSTEM_PROMPT},
            {"role": "user", "content": f"Extract data from this identity document:\n\n{raw_text}"}
        ]
    )

    if not response or not response.choices:
        raise ValueError("Sarvam-M returned no response. Check your API quota.")

    content = response.choices[0].message.content
    if content is None:
        raise ValueError("Sarvam-M returned an empty message content.")

    raw_json = content.strip()
    # Strip markdown code fences if the model wraps output anyway
    raw_json = re.sub(r'^```(?:json)?\s*|\s*```$', '', raw_json, flags=re.DOTALL).strip()

    try:
        parsed = json.loads(raw_json)
    except json.JSONDecodeError:
        print(f"ERROR: Could not parse JSON from model response:\n{raw_json}")
        raise

    confidence = parsed.get("confidence", 1.0)
    if confidence < 0.85:
        print(
            f"WARNING: Low confidence ({confidence:.2f}) — review the extracted data manually "
            "before using it to fill any form."
        )

    return parsed


print("parse_id_card defined.")

### **4. Step 3 — FILL: HTML Form Population**

`fill_form` replaces `{{placeholder}}` tokens in the HTML template with extracted values
and writes the result to `filled_form.html`.

After the file is saved, the caller **immediately deletes** the parsed dict from memory —
the extracted personal data is not retained beyond the form-writing step.

In [None]:
def fill_form(
    parsed: dict,
    template_path: str = "templates/sample_form_template.html",
    output_path: str = "filled_form.html",
) -> str:
    """Fill an HTML form template with extracted ID card data and save the output.

    Placeholder tokens in the template (e.g. {{name}}) are replaced with values
    from the parsed dict. The caller should delete the parsed dict from memory
    immediately after this function returns to minimise the time PII is held in
    the Python process.
    """
    with open(template_path, encoding="utf-8") as f:
        html = f.read()

    doc_type = parsed.get("document_type") or ""
    placeholders = {
        "{{document_type}}":       doc_type.upper(),
        "{{document_type_lower}}": doc_type.lower(),
        "{{name}}":                parsed.get("name") or "",
        "{{dob}}":                 parsed.get("dob") or "",
        "{{gender}}":              parsed.get("gender") or "",
        "{{id_number}}":           parsed.get("id_number") or "",
        "{{address}}":             parsed.get("address") or "",
    }

    for placeholder, value in placeholders.items():
        html = html.replace(placeholder, str(value))

    Path(output_path).write_text(html, encoding="utf-8")
    print(f"Filled form saved to: {output_path}")
    return output_path


print("fill_form defined.")

### **5. End-to-End Pipeline**

`process_id_card` ties all three steps together. It extracts, parses, fills the form, then
**immediately deletes the parsed data** from memory.

In [None]:
def process_id_card(
    file_path: str,
    template_path: str = "templates/sample_form_template.html",
    output_path: str = "filled_form.html",
) -> str | None:
    """End-to-end pipeline: extract -> parse -> fill for a single ID card file.

    Args:
        file_path:     Path to the ID card image (.jpg, .png) or PDF (.pdf).
        template_path: Path to the HTML form template.
        output_path:   Path for the filled HTML output file.

    Returns:
        Path to the filled HTML file, or None if processing failed.
    """
    print(f"Processing: {file_path}")
    try:
        print("  Step 1/3 — Extracting text via Document Intelligence...")
        raw_text = extract_id_text(file_path)

        print("  Step 2/3 — Parsing structured data with Sarvam-M...")
        parsed_data = parse_id_card(raw_text)

        print("  Step 3/3 — Filling HTML form template...")
        result_path = fill_form(parsed_data, template_path, output_path)

        print(
            f"\n  Type: {parsed_data.get('document_type', 'unknown').upper()} | "
            f"Name: {parsed_data.get('name')} | "
            f"ID: {parsed_data.get('id_number')} | "
            f"Confidence: {parsed_data.get('confidence', 0):.2f}"
        )

        # Privacy: clear extracted PII from memory immediately after the form is written.
        # Do not log, store, or transmit this data further.
        del parsed_data
        return result_path

    except Exception as e:
        print(f"ERROR: Failed to process {file_path}: {e}")
        return None


print("process_id_card defined.")

### **6. Demo — Run the Pipeline**

Cell 8 generates a synthetic **SPECIMEN — NOT VALID** Aadhaar card using Pillow (no real card
required), then runs the full pipeline.

All data is fabricated:
- Aadhaar number: `1234 5678 9012` (not a real number)
- Name: Arjun Sharma | DOB: 01-01-1990 | Gender: Male
- Address: 42 MG Road, Bengaluru - 560001

In [None]:
import random
from PIL import Image, ImageDraw, ImageFont


def _load_font(size: int) -> ImageFont.FreeTypeFont:
    """Load a TrueType font with cross-platform fallbacks."""
    candidates = [
        "/System/Library/Fonts/Helvetica.ttc",               # macOS
        "/usr/share/fonts/truetype/dejavu/DejaVuSans.ttf",   # Linux (Debian/Ubuntu)
        "/usr/share/fonts/dejavu/DejaVuSans.ttf",            # Linux (Fedora/RHEL)
        "C:/Windows/Fonts/Arial.ttf",                        # Windows
    ]
    for path in candidates:
        try:
            return ImageFont.truetype(path, size)
        except (IOError, OSError):
            continue
    return ImageFont.load_default()


def _create_sample_aadhaar(output_path: str = "sample_data/sample_aadhaar.png") -> str:
    """Create a synthetic SPECIMEN Aadhaar card image for demo purposes.

    All data is entirely fabricated. The card is clearly labelled SPECIMEN and is
    not valid for any official purpose.
    """
    Path(output_path).parent.mkdir(parents=True, exist_ok=True)

    W, H = 856, 540
    img  = Image.new("RGB", (W, H), color=(240, 244, 255))
    draw = ImageDraw.Draw(img)

    font_hdr   = _load_font(20)
    font_title = _load_font(14)
    font_body  = _load_font(17)
    font_num   = _load_font(26)
    font_small = _load_font(12)
    font_spec  = _load_font(64)

    # Header bar
    draw.rectangle([(0, 0), (W, 72)], fill=(0, 82, 165))
    draw.text((W // 2, 22), "Government of India",
              font=font_hdr,   fill="white",         anchor="mm")
    draw.text((W // 2, 50), "Unique Identification Authority of India",
              font=font_title, fill=(200, 220, 255), anchor="mm")
    draw.rectangle([(0, 72), (W, 82)], fill=(255, 140, 0))

    # Photo placeholder
    draw.rectangle([(30, 100), (190, 270)],
                   outline=(100, 100, 180), width=2, fill=(220, 225, 245))
    draw.text((110, 185), "Photo", font=font_small, fill=(100, 100, 180), anchor="mm")

    # Personal details
    x, y = 215, 105
    draw.text((x, y),       "Arjun Sharma",           font=font_body,  fill=(20, 20, 80))
    draw.text((x, y + 34),  "DOB: 01/01/1990",        font=font_small, fill=(60, 60, 120))
    draw.text((x, y + 58),  "Male",                   font=font_small, fill=(60, 60, 120))
    draw.text((x, y + 90),  "42 MG Road,",            font=font_small, fill=(60, 60, 120))
    draw.text((x, y + 112), "Bengaluru - 560001,",    font=font_small, fill=(60, 60, 120))
    draw.text((x, y + 134), "Karnataka",              font=font_small, fill=(60, 60, 120))

    # QR code placeholder
    draw.rectangle([(W - 175, 95), (W - 30, 255)],
                   outline=(80, 80, 80), width=2, fill=(245, 245, 245))
    draw.text((W - 103, 175), "QR", font=font_hdr, fill=(80, 80, 80), anchor="mm")

    # Aadhaar number (fake — 1234 5678 9012)
    draw.line([(30, 280), (W - 30, 280)], fill=(180, 190, 230), width=1)
    draw.text((W // 2, 310), "1234  5678  9012",
              font=font_num, fill=(0, 82, 165), anchor="mm")

    # Footer bar
    draw.rectangle([(0, H - 50), (W, H)], fill=(0, 82, 165))
    draw.text((W // 2, H - 25), "mAadhaar  |  aadhaar.gov.in  |  1947",
              font=font_small, fill="white", anchor="mm")

    # SPECIMEN / NOT VALID watermark via RGBA overlay
    overlay = Image.new("RGBA", (W, H), (0, 0, 0, 0))
    ov_draw = ImageDraw.Draw(overlay)
    ov_draw.text((W // 2, H // 2 - 20), "SPECIMEN",
                 font=font_spec, fill=(200, 20, 20, 100), anchor="mm")
    ov_draw.text((W // 2, H // 2 + 52), "NOT VALID",
                 font=_load_font(32), fill=(200, 20, 20, 100), anchor="mm")
    overlay = overlay.rotate(30, expand=False)

    img = Image.alpha_composite(img.convert("RGBA"), overlay).convert("RGB")
    img.save(output_path)
    print(f"Sample SPECIMEN Aadhaar card created: {output_path}")
    return output_path


# --- Run the demo ---
sample_path = _create_sample_aadhaar()
filled_path = process_id_card(
    file_path=sample_path,
    template_path="templates/sample_form_template.html",
    output_path="filled_form.html",
)

### **7. Results**

Preview the filled form and download the generated HTML file.

In [None]:
from IPython.display import FileLink, IFrame, display

if filled_path:
    print("Form filled successfully.")
    print("\nDownload the filled form:")
    display(FileLink("filled_form.html", result_html_prefix="Click to download: "))
    print("\nInline preview:")
    display(IFrame(src="filled_form.html", width="100%", height="500"))
else:
    print("Processing failed. Check the error messages above.")

### **8. Error Reference**

| Error Code | HTTP Status | Cause | Solution |
| :--- | :--- | :--- | :--- |
| `invalid_api_key_error` | 403 | Invalid API key | Verify your key at [dashboard.sarvam.ai](https://dashboard.sarvam.ai). |
| `insufficient_quota_error` | 429 | Quota exceeded | Check your usage limits. |
| `internal_server_error` | 500 | Server-side issue | Wait and retry the request. |
| Job state not `Completed` | — | Doc Intelligence failure | Check file format; accepted: PDF, ZIP, or image. |
| `JSONDecodeError` | — | Sarvam-M returned non-JSON | Usually transient; re-run the cell. |
| `RuntimeError: SARVAM_API_KEY is not set` | — | Missing API key | Add key to `.env` file. |

### **9. Conclusion & Resources**

This recipe shows how to chain **Sarvam Vision** and **Sarvam-M** into a privacy-conscious
KYC automation workflow — reading Indian identity documents in any Indian language and
populating a standardised HTML form with a single pipeline call.

* [Sarvam AI Docs](https://docs.sarvam.ai)
* [Document Intelligence API](https://docs.sarvam.ai/api-reference-docs/document-intelligence)
* [Sarvam-M Chat API](https://docs.sarvam.ai/api-reference-docs/chat)
* [DPDP Act 2023](https://www.meity.gov.in/data-protection-framework)
* [UIDAI](https://uidai.gov.in)

> **Reminder:** This notebook is for demo and educational purposes only.
> Always comply with DPDP Act 2023, UIDAI guidelines, and applicable KYC regulations
> before deploying any ID-data pipeline in production.

**Keep Building!**