# Atopic Eczema VLM Extraction — RAG Pipeline (Gemma 3 27B IT)

This notebook prototypes the full pipeline:
1) Load RAG cards (fields, policies, abbrev, ranges, meds lexicon)  
2) Candidate extraction (get page tokens from image)  
3) RAG context assembly  
4) Build prompts and call **google/gemma-3-27b-it** (Hugging Face)  
5) Validate + compute confidences  
6) Merge into a patient JSON

> **Note:** You must accept the Gemma 3 license on Hugging Face and set `HF_TOKEN` in your environment to pull weights.


## 0. Environment & Installs

- Make sure you have a suitable GPU (27B benefits from A100 / H100, bf16).
- Install pinned `transformers` with Gemma 3 support and `accelerate`.
- Login to Hugging Face or set an access token.
- Accept the model license on its model card.

**References:**
- Hugging Face blog guide for Gemma 3 (inference & API usage).


In [1]:
!pip install -U transformers accelerate torch torchvision pillow

Collecting accelerate
  Downloading accelerate-1.10.1-py3-none-any.whl.metadata (19 kB)
Collecting torch
  Downloading torch-2.8.0-cp313-cp313-manylinux_2_28_x86_64.whl.metadata (30 kB)
Collecting torchvision
  Downloading torchvision-0.23.0-cp313-cp313-manylinux_2_28_x86_64.whl.metadata (6.1 kB)
Collecting pillow
  Downloading pillow-11.3.0-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (9.0 kB)
Collecting sympy>=1.13.3 (from torch)
  Using cached sympy-1.14.0-py3-none-any.whl.metadata (12 kB)
Collecting networkx (from torch)
  Downloading networkx-3.5-py3-none-any.whl.metadata (6.3 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.8.93 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.8.93-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl.metadata (1.7 kB)
Collecting nvidia-cuda-runtime-cu12==12.8.90 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.8.90-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.7 kB)
Collecting nvidia-cuda-cupti-

## 1. Imports — Local Modules

In [4]:
from pathlib import Path
from rag_store import RAGPaths, RAGStore, ContextAssembler, fields_for_section
from candidate_extractor import CandidateExtractor, make_gemma_runner


## 2. Load RAG Cards
Point to your cards directory (where we placed `field_cards.jsonl`, `policy/`, `abbr/`, `range/`, `lexicon/`).

In [5]:
CARDS_DIR = Path("/home/rijul/Gitlaboratory/Context_Engineering_LLM/cards")  # <-- update if different
store = RAGStore(RAGPaths.from_base(CARDS_DIR)).load()

print("Fields loaded:", len(store.fields_by_name))
print("Policies:", list(store.policy.keys()))
print("Abbr:", list(store.abbr.keys()))
print("Ranges:", list(store.ranges.keys()))
print("Lexicons:", list(store.lexicons.keys()))


Fields loaded: 128
Policies: ['policy/units:v1', 'policy/notation:v1', 'policy/date:v1']
Abbr: ['abbr/dermatology:core:v1']
Ranges: ['range/labs:v1', 'range/scorad:v1', 'range/anthro:v1']
Lexicons: ['lexicon/meds_observed:v1']


## 3. Candidate Extraction (VLM-assisted, no OCR)
Given a form page image, ask the VLM to list headings/labels/short snippets likely to be variable names or medications.

In [8]:
!pip install pdf2image

Collecting pdf2image
  Downloading pdf2image-1.17.0-py3-none-any.whl.metadata (6.2 kB)
Downloading pdf2image-1.17.0-py3-none-any.whl (11 kB)
Installing collected packages: pdf2image
Successfully installed pdf2image-1.17.0


In [9]:
# === 3. Candidate Extraction (VLM-assisted, no OCR) ===
# Convert PDFs -> images IN ORDER and extract candidate tokens page-by-page.

from pathlib import Path
from typing import Iterator, Tuple
import os

# ---- Configure your folders ----
PDF_INPUT_DIR = "/home/rijul/Academic/Atopic Eczema/cropped"   # source PDFs
IMG_OUTPUT_DIR = "/home/rijul/Academic/Atopic Eczema/images"   # rendered images go here
DPI = 200
MAX_PAGES = None  # set e.g. 10 while testing to avoid running all pages at once

# ---- Import (or define) the converter ----
try:
    # if you already ran the cell that defined pdfs_to_images_in_series, this import will succeed
    pdfs_to_images_in_series
except NameError:
    # lightweight definition here so this cell works standalone
    from pdf2image import convert_from_path

    def pdfs_to_images_in_series(pdf_dir: str, out_dir: str, dpi: int = 200) -> Iterator[Tuple[str, int, str]]:
        """
        Convert PDFs to images in deterministic order (sorted by filename, pages ascending).
        Caches by filename: if image already exists, reuse it (no re-render).
        Yields (pdf_file_path, page_index_1based, image_path).
        """
        pdf_dir = Path(pdf_dir)
        out_dir = Path(out_dir)
        out_dir.mkdir(parents=True, exist_ok=True)

        for pdf_file in sorted(pdf_dir.glob("*.pdf")):
            patient_id = pdf_file.stem
            # Count pages if any images already cached
            # Render only missing pages to avoid duplicate work
            # (If you prefer always re-rendering, remove the cache check)
            rendered_any = False
            pages = convert_from_path(str(pdf_file), dpi=dpi)
            for i, page in enumerate(pages, start=1):
                img_path = out_dir / f"{patient_id}_page{i}.png"
                if not img_path.exists():
                    page.save(img_path, "PNG")
                    rendered_any = True
                yield str(pdf_file), i, str(img_path)

# ---- Run the candidate extraction, page-by-page ----
extractor = CandidateExtractor()  # uses stub unless you passed a Gemma runner

PAGE_BATCH = []  # will be used in later cells
count = 0

for pdf_path, page_idx, img_path in pdfs_to_images_in_series(PDF_INPUT_DIR, IMG_OUTPUT_DIR, dpi=DPI):
    tokens = extractor.extract_candidates(img_path)
    PAGE_BATCH.append({
        "pdf": pdf_path,
        "page": page_idx,
        "image": img_path,
        "tokens": tokens,
    })
    print(f"[OK] {os.path.basename(pdf_path)} :: page {page_idx} -> {os.path.basename(img_path)}")
    print("     tokens:", tokens[:12], ("..." if len(tokens) > 12 else ""))
    count += 1
    if MAX_PAGES and count >= MAX_PAGES:
        break

print(f"\nTotal pages prepared: {len(PAGE_BATCH)}")
print("Example entry:", PAGE_BATCH[0] if PAGE_BATCH else "No pages found")


[OK] 1050.pdf :: page 1 -> 1050_page1.png
     tokens: ['symptoms', 'duration', 'tacroz 0.1% oint bd', 'xyzal tab'] 
[OK] 1050.pdf :: page 2 -> 1050_page2.png
     tokens: ['symptoms', 'duration', 'tacroz 0.1% oint bd', 'xyzal tab'] 
[OK] 1050.pdf :: page 3 -> 1050_page3.png
     tokens: ['symptoms', 'duration', 'tacroz 0.1% oint bd', 'xyzal tab'] 
[OK] 1050.pdf :: page 4 -> 1050_page4.png
     tokens: ['symptoms', 'duration', 'tacroz 0.1% oint bd', 'xyzal tab'] 
[OK] 1050.pdf :: page 5 -> 1050_page5.png
     tokens: ['symptoms', 'duration', 'tacroz 0.1% oint bd', 'xyzal tab'] 
[OK] 1050.pdf :: page 6 -> 1050_page6.png
     tokens: ['symptoms', 'duration', 'tacroz 0.1% oint bd', 'xyzal tab'] 
[OK] 1619.pdf :: page 1 -> 1619_page1.png
     tokens: ['symptoms', 'duration', 'tacroz 0.1% oint bd', 'xyzal tab'] 
[OK] 1619.pdf :: page 2 -> 1619_page2.png
     tokens: ['symptoms', 'duration', 'tacroz 0.1% oint bd', 'xyzal tab'] 
[OK] 1619.pdf :: page 3 -> 1619_page3.png
     tokens: ['symptom

## 4. Assemble RAG Context
Select a section (e.g., `history`, `scorad`, `investigations`, `followups`) and build the compact context payload.

In [None]:
assembler = ContextAssembler(store)
target_fields = fields_for_section("history")  # change to other sections as needed

ctx = assembler.build_context(target_fields, page_tokens=page_tokens)
chunks = assembler.to_prompt_chunks(ctx)

for i, ch in enumerate(chunks, 1):
    print(f"=== Chunk {i} ===\n{ch[:600]}\n")


## 5. Load Gemma 3 27B IT (Hugging Face)

Use `transformers` **pipeline** for simple VLM calls (image + text → text).  
Make sure you've accepted the model license on the model card: `google/gemma-3-27b-it`.


In [10]:
!pip install bitsandbytes

Collecting bitsandbytes
  Downloading bitsandbytes-0.47.0-py3-none-manylinux_2_24_x86_64.whl.metadata (11 kB)
Downloading bitsandbytes-0.47.0-py3-none-manylinux_2_24_x86_64.whl (61.3 MB)
[2K   [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.3/61.3 MB[0m [31m4.8 MB/s[0m  [33m0:00:13[0m[0m eta [36m0:00:01[0m0:01[0m:01[0m
[?25hInstalling collected packages: bitsandbytes
Successfully installed bitsandbytes-0.47.0


In [19]:
!pip install huggingface_hub
from huggingface_hub import login



In [34]:
import os
os.environ["HF_TOKEN"] = "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"

In [35]:
import os
print("HF_TOKEN:", os.getenv("HF_TOKEN"))

HF_TOKEN: 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


In [40]:
import os

# ✅ make sure token is actually present
os.environ["HF_TOKEN"] = os.getenv("HF_TOKEN") or "hf_your_token_here"

# 🚫 turn off Xet CAS (avoids those TLS EOF retries)
os.environ["HF_HUB_ENABLE_XET"] = "0"

# Optional: try the standard downloader (set to 0); you can try "1" later if you want rust/hf-transfer
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "0"

# Optional hardening
os.environ["HF_HUB_DISABLE_TELEMETRY"] = "1"
os.environ["HF_HUB_TIMEOUT"] = "120"

# If you have proxies set in your shell and you don't need them for HF:
os.environ.pop("HTTP_PROXY", None)
os.environ.pop("HTTPS_PROXY", None)
os.environ.pop("http_proxy", None)
os.environ.pop("https_proxy", None)


In [41]:
from huggingface_hub import snapshot_download
import os

model_id = "google/gemma-3-4b-it"
token = os.getenv("HF_TOKEN")

# Minimal allow_patterns; broaden if needed
allow = ["*.json", "*.bin", "*.safetensors", "*.model", "*processor*", "*tokenizer*"]

repo_dir = snapshot_download(
    repo_id=model_id,
    token=token,
    allow_patterns=allow,
    resume_download=True,
    max_workers=1,           # single-threaded -> fewer transient issues
)

print("Local model dir:", repo_dir)

Fetching 13 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [01:48<00:00,  8.35s/it]

Local model dir: /home/rijul/.cache/huggingface/hub/models--google--gemma-3-4b-it/snapshots/093f9f388b31de276ce2de164bdc2081324b9767





In [42]:
from pathlib import Path

repo_dir = Path("/home/rijul/.cache/huggingface/hub/models--google--gemma-3-4b-it/snapshots/093f9f388b31de276ce2de164bdc2081324b9767")
print("Exists:", repo_dir.exists())
print("\nSome files:")
for p in sorted(repo_dir.glob("*"))[:20]:
    print("-", p.name)

Exists: True

Some files:
- added_tokens.json
- chat_template.json
- config.json
- generation_config.json
- model-00001-of-00002.safetensors
- model-00002-of-00002.safetensors
- model.safetensors.index.json
- preprocessor_config.json
- processor_config.json
- special_tokens_map.json
- tokenizer.json
- tokenizer.model
- tokenizer_config.json


In [2]:
import torch
from transformers import AutoTokenizer, AutoImageProcessor, AutoModelForImageTextToText

repo_dir = "/home/rijul/.cache/huggingface/hub/models--google--gemma-3-4b-it/snapshots/093f9f388b31de276ce2de164bdc2081324b9767"
dtype = torch.bfloat16 if torch.cuda.is_available() else torch.float32

tokenizer = AutoTokenizer.from_pretrained(repo_dir, trust_remote_code=True)
image_processor = AutoImageProcessor.from_pretrained(repo_dir, trust_remote_code=True)

model = AutoModelForImageTextToText.from_pretrained(
    repo_dir,
    device_map="auto",
    load_in_4bit=True,   
    dtype=dtype,         
    trust_remote_code=True,
)

print("✅ Gemma 3 4B IT loaded locally in 4-bit.")


  from .autonotebook import tqdm as notebook_tqdm
  return torch._C._cuda_getDeviceCount() > 0
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.
The 8-bit optimizer is not available on your device, only available on CUDA for now.
Loading checkpoint shards: 100%|██████████████████| 2/2 [02:53<00:00, 86.73s/it]

✅ Gemma 3 4B IT loaded locally in 4-bit.





In [7]:
# Simple text-only inference
prompt = "Hello what are you capable of?"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.inference_mode():
    outputs = model.generate(**inputs, max_new_tokens=1000)

print("Model output:\n", tokenizer.decode(outputs[0], skip_special_tokens=True))


Model output:
 Hello what are you capable of?

As a large language model, I can perform a variety of text-based tasks, for example:

*   **Answering your questions:** I can provide informative answers on a wide range of topics, drawing upon the knowledge I've been trained on.
*   **Generating creative content:** I can write stories, poems, articles, code, scripts, and more.
*   **Summarizing text:** I can condense lengthy texts into concise summaries.
*   **Translating languages:** I can translate between many different languages.
*   **Following instructions:** I can execute your commands and requests within the context of our conversation.
*   **Engaging in conversations:** I can chat with you on various topics and try to understand your intent.

How can I help you today? Do you have a question, or would you like me to try generating something?


In [None]:
import torch, os
from transformers import pipeline
from PIL import Image

repo_dir = "/home/rijul/.cache/huggingface/hub/models--google--gemma-3-4b-it/snapshots/093f9f388b31de276ce2de164bdc2081324b9767"

pipe_cpu = pipeline(
    task="image-text-to-text",
    model=repo_dir,
    device_map={"": "cpu"},       # <- force CPU
    torch_dtype=torch.float32,    # bfloat16 on CPU can be flaky
    trust_remote_code=True,
    model_kwargs={
        "low_cpu_mem_usage": True,
        "load_in_4bit": False,    # quant not needed on CPU, you can set True if you want
    },
)

if pipe_cpu.tokenizer.pad_token_id is None:
    pipe_cpu.tokenizer.pad_token = pipe_cpu.tokenizer.eos_token
    pipe_cpu.generation_config.pad_token_id = pipe_cpu.tokenizer.pad_token_id

img = Image.open("/home/rijul/Academic/Atopic Eczema/images/716_page1.png").convert("RGB")
img.thumbnail((768, 768))

messages = [{
    "role": "user",
    "content": [
        {"type": "image", "image": img},
        {"type": "text",  "text": "Return the top 6 field labels you can read as a JSON array of strings."}
    ]
}]

out = pipe_cpu(
    text=messages,
    max_new_tokens=64,
    do_sample=False,
    temperature=0.0,
    top_p=1.0,
    num_beams=1,
    max_time=45,
)

print(out[0]["generated_text"][-1]["content"])


Loading checkpoint shards:   0%|                          | 0/2 [00:00<?, ?it/s]

## 6. Build the Field Extractor Prompt and Call the Model

We send:
- A **short system instruction**
- The **RAG context chunks**
- The **user instruction** describing what to extract
- The **page image**

We ask for **strict JSON** back.


In [None]:
SYSTEM = (
    "You are a medical data extractor. "
    "Use the provided field cards, policies, abbreviations, ranges, and meds lexicon to extract values. "
    "If a value is missing or illegible, return null and set a low confidence. "
    "Return only JSON."
)

USER_INSTR = (
    "Extract the requested fields from this page. "
    "For each field, return {value, confidence (0..1), provenance: short description}. "
    "Field set is in the context."
)

def build_messages(chunks, system_text, user_text, image_path):
    # Build interleaved messages for the VLM pipeline
    content = [{"type": "text", "text": system_text}]
    # Append context chunks
    for ch in chunks:
        content.append({"type": "text", "text": ch})
    # Append user instruction + image
    content.append({"type": "text", "text": user_text})
    content.append({"type": "image", "image": image_path})
    return [{"role": "user", "content": content}]

messages = build_messages(chunks, SYSTEM, USER_INSTR, PAGE_IMAGE)

# Example extraction call (uncomment to run with the actual model)
# resp = pipe(text=messages, max_new_tokens=800)
# raw = resp[0]["generated_text"][-1]["content"]
# print(raw)


## 7. Validate & Normalize
Apply ranges, unit/date normalization, and compute flags.

In [None]:
import json

def validate_record(raw_json_text: str, store: RAGStore):
    try:
        data = json.loads(raw_json_text)
    except Exception as e:
        return {"ok": False, "error": f"JSON parse failed: {e}", "flags": [], "data": None}

    flags = []
    # Simple examples
    scorad_range = store.ranges.get("range/scorad:v1", {}).get("ranges", {})
    if "scorad_final" in data:
        v = data["scorad_final"].get("value")
        if v is not None:
            lo, hi = scorad_range.get("scorad_total", [0, 103])
            if not (lo <= float(v) <= hi):
                flags.append({"field": "scorad_final", "reason": f"Out of range [{lo},{hi}]"})

    return {"ok": True, "flags": flags, "data": data}

# Example usage after a real model response:
# result = validate_record(raw, store)
# result


## 8. (Optional) Multi-page Merge
If your PDF has multiple pages, run steps 3–7 per page and merge field-wise by confidence and page provenance.

In [None]:
def merge_records(records: list[dict]) -> dict:
    merged = {}
    for rec in records:
        for k, v in rec.items():
            if k not in merged:
                merged[k] = v
            else:
                # keep the value with higher confidence
                if v.get("confidence", 0) > merged[k].get("confidence", 0):
                    merged[k] = v
    return merged


---

### Next
- Replace the **stub candidate extractor** with a real Gemma call (Section 5) to list headings/blocks.  
- Tune **Section selection** (batch fields in 10–15 chunks).  
- Expand the **validator** with tighter clinical rules & ontology checks.  
- Add a **resolver** step to re-query flagged fields with more focused crops.


## 9. PDF → Image Conversion (in-order)

In [None]:

# Requires: pip install pdf2image pillow  AND  poppler-utils (apt)
from pathlib import Path
from typing import Iterator, Tuple
from pdf2image import convert_from_path

def pdfs_to_images_in_series(pdf_dir: str, out_dir: str, dpi: int = 200) -> Iterator[Tuple[str, int, str]]:
    """
    Convert all PDFs in pdf_dir to images in deterministic order.
    Yields (pdf_file_path, page_index_1based, image_path).
    """
    pdf_dir = Path(pdf_dir)
    out_dir = Path(out_dir)
    out_dir.mkdir(parents=True, exist_ok=True)

    for pdf_file in sorted(pdf_dir.glob("*.pdf")):
        patient_id = pdf_file.stem
        pages = convert_from_path(str(pdf_file), dpi=dpi)
        for i, page in enumerate(pages, start=1):
            out_path = out_dir / f"{patient_id}_page{i}.png"
            page.save(out_path, "PNG")
            yield str(pdf_file), i, str(out_path)
            
print("PDF→Image helper ready. Set your input/output paths below.")            


## 10. End-to-End Batch Loop (All PDFs → Extraction)

In [None]:

# Configure your folders
PDF_INPUT_DIR = "/home/rijul/Academic/Atopic Eczema/cropped"     # source PDFs
IMG_OUTPUT_DIR = "/home/rijul/Academic/Atopic Eczema/images"     # where rendered PNGs will go

# Choose which section to extract in this pass (you can run multiple passes for other sections)
SECTION = "history"  # options: "history", "scorad", "investigations", "followups"

# Instantiate helpers
extractor = CandidateExtractor()
assembler = ContextAssembler(store)
target_fields = fields_for_section(SECTION)

# Optional: real Gemma pipeline (uncomment if configured)
# from transformers import pipeline
# import torch, os
# dtype = torch.bfloat16 if torch.cuda.is_available() else torch.float32
# pipe = pipeline(
#     task="image-text-to-text",
#     model="google/gemma-3-27b-it",
#     token=os.environ.get("HF_TOKEN"),
#     device_map="auto",
#     torch_dtype=dtype
# )

def build_messages(chunks, system_text, user_text, image_path):
    content = [{"type": "text", "text": system_text}]
    for ch in chunks: content.append({"type": "text", "text": ch})
    content.append({"type": "text", "text": user_text})
    content.append({"type": "image", "image": image_path})
    return [{"role": "user", "content": content}]

SYSTEM = ("You are a medical data extractor. Use the provided field cards, policies, abbreviations, ranges, "
          "and meds lexicon to extract values. If a value is missing or illegible, return null and set a low confidence. "
          "Return only JSON with keys matching canonical_name.")

USER_INSTR = ("Extract the requested fields from this page. For each field, return "
              "{value, confidence (0..1), provenance: short description}. Field set is in the context.")

results = []  # collect per-page outputs (replace with writing to disk if you prefer)

for pdf_path, page_idx, img_path in pdfs_to_images_in_series(PDF_INPUT_DIR, IMG_OUTPUT_DIR, dpi=200):
    # 1) Candidate tokens from the page
    page_tokens = extractor.extract_candidates(img_path)

    # 2) Assemble context for the chosen section
    ctx = assembler.build_context(target_fields, page_tokens=page_tokens)
    chunks = assembler.to_prompt_chunks(ctx)

    # 3) Build messages for the VLM
    messages = build_messages(chunks, SYSTEM, USER_INSTR, img_path)

    # 4) Call the model (stub shown; uncomment for real call)
    # resp = pipe(text=messages, max_new_tokens=800)
    # raw = resp[0]["generated_text"][-1]["content"]
    # For now, use a placeholder dict so the loop runs:
    raw = '{"duration": {"value": "6 months", "confidence": 0.8, "provenance": "upper right"}, "symptoms": {"value": ["itching"], "confidence": 0.9, "provenance": "middle"}}'

    # 5) Validate & store
    vr = validate_record(raw, store)
    results.append({
        "pdf": pdf_path,
        "page": page_idx,
        "image": img_path,
        "raw": raw,
        "validated": vr
    })

# Example summary print
print(f"Processed pages: {len(results)}")
print("Sample record:", results[0]["pdf"], results[0]["page"], results[0]["validated"]["ok"] if results else "N/A")
