
# Scalable Question Generation (MVP) - Inspired by Savaal
**Author:** (Your Name)  
**Run date:** 2025-09-17 00:57

This notebook implements a minimal yet scalable pipeline to generate conceptual multiple-choice questions from large PDFs or text files, drawing inspiration from the Savaal paper's concept-driven RAG approach.

Deliverables produced by this notebook:
- A single JSON file at `output/questions.json` with all generated questions + metadata.
- Clear, well-commented cells explaining design choices.
- Optional bonus: automatic quality scoring and difficulty tagging (Bloom's levels).

Note: You will need API access to your chosen LLM and embedding model (default prompts assume OpenAI). No secrets are stored in the notebook; set environment variables locally.



## Quickstart (Checklist)
1. Install deps (next cell).  
2. Set environment variables (API key) in the Config cell.  
3. Put your input documents into the `docs/` folder (PDF or .txt).  
4. Run all cells up to "Generate Questions".  
5. Inspect and optionally filter by quality.  
6. Find your final JSON at `output/questions.json`.  
7. Record a 3-minute demo walking through: dataflow diagram -> short run -> final JSON.


In [1]:

# 1) Install dependencies (run once)
# If you're in Colab: uncomment the following line. In local Jupyter, it's okay to run as-is.
# Note: FAISS wheel name varies by platform; 'faiss-cpu' works for most.
%pip install -q pypdf tiktoken faiss-cpu numpy pandas python-dotenv tqdm rapidfuzz google-generativeai



You should consider upgrading via the '/Library/Developer/CommandLineTools/usr/bin/python3 -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.



## Config
Set API keys via environment variables or `.env` file (not included in submission).  
Default uses OpenAI for both chat-completions and embeddings; feel free to swap vendors.


In [66]:

import os, pathlib, json, re, math, uuid, time, datetime
from typing import List, Dict, Any, Tuple
from tqdm import tqdm
import numpy as np

# PDF parsing + chunking
from pypdf import PdfReader
import tiktoken

# Retrieval
import faiss

# Optional bonus
from rapidfuzz import fuzz

# --- Google Gemini setup ---
import google.generativeai as genai

if os.path.exists(".env"):
    from dotenv import load_dotenv
    load_dotenv()

GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY", "")
GEMINI_MODEL = os.getenv("GEMINI_MODEL", "gemini-1.5-flash")          # or gemini-1.5-pro
GEMINI_EMBEDDING = os.getenv("GEMINI_EMBEDDING", "text-embedding-004") # 768-dim

if not GOOGLE_API_KEY:
    print("⚠️ Set GOOGLE_API_KEY in your .env")
genai.configure(api_key=GOOGLE_API_KEY)



In [67]:
from pathlib import Path

ROOT = Path().resolve()
DOCS_DIR = ROOT / "docs"
OUTPUT_DIR = ROOT / "output"
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

print("DOCS_DIR =", DOCS_DIR)
print("OUTPUT_DIR =", OUTPUT_DIR)


DOCS_DIR = /Users/vanshvirani/projects/savaalish-qg-mvp/docs
OUTPUT_DIR = /Users/vanshvirani/projects/savaalish-qg-mvp/output



## Design at a glance
- Parse PDFs -> chunk text with token-aware windows (overlap to preserve coherence).
- Map->Combine->Reduce: extract main ideas per chunk via LLM; lightly de-duplicate/merge.
- Retrieve top-k supporting passages per idea using FAISS over embeddings.
- Generate one MCQ per idea with grounded context -> shuffle choices to remove positional bias.
- Quality: LLM rubric + heuristics (groundedness, clarity, distractor plausibility).  
- Difficulty: Bloom's tags (Remember/Understand/Apply/Analyze/Evaluate/Create) collapsed to easy/med/hard.


In [68]:

# Utility: token-aware chunking
def chunk_text(text: str, model_name: str = "gpt-4o-mini", max_tokens=800, overlap=120) -> List[str]:
    # Use tiktoken encoding as an approximation of tokens
    enc = tiktoken.get_encoding("cl100k_base")
    toks = enc.encode(text)
    chunks = []
    i = 0
    while i < len(toks):
        sub = toks[i : i + max_tokens]
        chunks.append(enc.decode(sub))
        i += max_tokens - overlap
    return chunks

# PDF/text loader
def load_docs(docs_dir: pathlib.Path) -> List[Tuple[str, str]]:
    docs = []
    for p in docs_dir.glob("**/*"):
        if p.suffix.lower() == ".pdf":
            reader = PdfReader(str(p))
            pages = [page.extract_text() or "" for page in reader.pages]
            docs.append((str(p), "\n".join(pages)))
        elif p.suffix.lower() == ".txt":
            docs.append((str(p), p.read_text(encoding="utf-8", errors="ignore")))
    if not docs:
        print(f"No documents found in {docs_dir}. Add PDFs or .txt files.")
    return docs

# Embeddings
def embed_texts(texts: List[str], batch: int = 64) -> np.ndarray:
    vecs = []
    for i in range(0, len(texts), batch):
        for t in texts[i:i+batch]:
            # task_type is optional; retrieval_document gives stable behavior
            r = genai.embed_content(
                model=GEMINI_EMBEDDING,
                content=t,
                task_type="retrieval_document"
            )
            vecs.append(r["embedding"])
    arr = np.array(vecs, dtype="float32")
    return arr
# Build FAISS index
def build_faiss_index(chunks: List[str]) -> Tuple[faiss.IndexFlatIP, np.ndarray]:
    embs = embed_texts(chunks)
    faiss.normalize_L2(embs)
    index = faiss.IndexFlatIP(embs.shape[1])
    index.add(embs)
    return index, embs

# LLM call helper
def chat(system: str, user: str, max_tokens=700, temperature=0.0, as_json=False) -> str:
    import google.generativeai as genai
    model = genai.GenerativeModel(GEMINI_MODEL, system_instruction=system)
    gen_cfg = {"temperature": float(temperature), "max_output_tokens": int(max_tokens)}
    if as_json:
        # This makes Gemini return clean JSON (no extra prose)
        gen_cfg["response_mime_type"] = "application/json"
    resp = model.generate_content(user, generation_config=gen_cfg)
    return (resp.text or "").strip()


In [69]:
import json, re

def parse_json_loose(s: str):
    """
    Accepts messy model output and returns a Python object.
    Handles code fences, leading/trailing prose, and single-quoted JSON.
    """
    if isinstance(s, (dict, list)):
        return s
    txt = (s or "").strip()

    # strip code fences like ```json ... ```
    txt = re.sub(r"^```(?:json)?\s*|\s*```$", "", txt, flags=re.S)

    # find the first JSON object/array block
    m = re.search(r"(\{.*\}|\[.*\])", txt, flags=re.S)
    if m:
        block = m.group(1)
        try:
            return json.loads(block)
        except Exception:
            # light fallback: single quotes → double quotes (only when safe)
            if '"' not in block and "'" in block:
                try:
                    return json.loads(block.replace("'", '"'))
                except Exception:
                    pass
    # final attempt: try plain loads anyway
    return json.loads(txt)



## Load & Chunk


In [70]:
from pathlib import Path
from pypdf.errors import PdfReadError

def load_docs(docs_dir: Path):
    docs = []
    for p in docs_dir.glob("**/*"):
        if p.is_dir():
            continue

        # PDFs
        if p.suffix.lower() == ".pdf":
            try:
                size = p.stat().st_size
                if size < 1024:
                    print(f"⚠️ Skipping {p.name}: too small/empty ({size} bytes).")
                    continue
                reader = PdfReader(str(p))
                pages = [page.extract_text() or "" for page in reader.pages]
                text = "\n".join(pages).strip()
                if not text:
                    print(f"⚠️ No extractable text in {p.name} (likely scanned/image-only). Use OCR or a .txt for now.")
                    continue
                docs.append((str(p), text))
            except Exception as e:
                print(f"⚠️ Skipping unreadable PDF {p.name}: {e}")
                continue

        # Plain text
        elif p.suffix.lower() == ".txt":
            try:
                text = p.read_text(encoding="utf-8", errors="ignore").strip()
                if not text:
                    print(f"⚠️ Skipping empty txt: {p.name}")
                    continue
                docs.append((str(p), text))
            except Exception as e:
                print(f"⚠️ Skipping txt {p.name}: {e}")
                continue

    print(f"Loaded {len(docs)} text-bearing document(s).")
    return docs

docs = load_docs(DOCS_DIR)
docs_names = [d[0] for d in docs]
print(f"Loaded {len(docs)} document(s).")
all_chunks, chunk_meta = [], []
for path, text in docs:
    chunks = chunk_text(text, max_tokens=900, overlap=150)
    for idx, ch in enumerate(chunks):
        all_chunks.append(ch)
        chunk_meta.append({"doc": path, "chunk_id": f"{path}#chunk{idx}"})
print(f"Total chunks: {len(all_chunks)}")


Ignoring wrong pointing object 193 0 (offset 0)
Ignoring wrong pointing object 274 0 (offset 0)
Ignoring wrong pointing object 293 0 (offset 0)
Ignoring wrong pointing object 331 0 (offset 0)
Ignoring wrong pointing object 353 0 (offset 0)
Ignoring wrong pointing object 373 0 (offset 0)
Ignoring wrong pointing object 394 0 (offset 0)
Ignoring wrong pointing object 396 0 (offset 0)
Ignoring wrong pointing object 404 0 (offset 0)
Ignoring wrong pointing object 418 0 (offset 0)
Ignoring wrong pointing object 423 0 (offset 0)
Ignoring wrong pointing object 436 0 (offset 0)
Ignoring wrong pointing object 443 0 (offset 0)
Ignoring wrong pointing object 531 0 (offset 0)
Ignoring wrong pointing object 541 0 (offset 0)
Ignoring wrong pointing object 545 0 (offset 0)
Ignoring wrong pointing object 561 0 (offset 0)
Ignoring wrong pointing object 599 0 (offset 0)
Ignoring wrong pointing object 601 0 (offset 0)
Ignoring wrong pointing object 603 0 (offset 0)
Ignoring wrong pointing object 616 0 (of

Loaded 1 text-bearing document(s).
Loaded 1 document(s).
Total chunks: 147



## Map -> Combine -> Reduce: Extract candidate ideas
We prompt the LLM to extract 1-3 conceptual ideas per chunk. Then we lightly merge near-duplicates.


In [71]:
import json, re
from rapidfuzz import fuzz

IDEA_SYSTEM = (
    "Extract conceptual ideas from academic/professional prose. "
    "Return ONLY a JSON array; each item has keys: title (str), summary (str)."
)
IDEA_USER_TMPL = (
    "Extract up to 3 non-trivial conceptual IDEAS (definitions, mechanisms, assumptions, trade-offs).\n"
    "TEXT:\n---\n{chunk}\n---\nReturn ONLY JSON."
)

def _normalize_idea_items(obj):
    """Coerce Gemini outputs to a list of {title, summary} dicts."""
    # If raw string, try to parse JSON or pull the first JSON-looking array
    if isinstance(obj, str):
        try:
            obj = json.loads(obj)
        except Exception:
            m = re.search(r"\[.*\]", obj, re.S)
            obj = json.loads(m.group(0)) if m else []

    # If dict, look for common container keys
    if isinstance(obj, dict):
        obj = (
            obj.get("ideas")
            or obj.get("items")
            or obj.get("results")
            or obj.get("data")
            or obj.get("concepts")
            or []
        )

    # Ensure list
    if not isinstance(obj, list):
        obj = [obj]

    normalized = []
    for it in obj:
        if isinstance(it, str):
            t = it.strip()
            if not t:
                continue
            normalized.append({"title": t[:80], "summary": t})
        elif isinstance(it, dict):
            title = (
                it.get("title")
                or it.get("idea")
                or it.get("concept")
                or it.get("name")
                or ""
            ).strip()
            summary = (
                it.get("summary")
                or it.get("desc")
                or it.get("explanation")
                or ""
            ).strip()
            if not title and summary:
                title = summary.split(".")[0][:80]
            if not summary and title:
                summary = title
            if title or summary:
                normalized.append({"title": title, "summary": summary})
    return normalized

def extract_ideas_per_chunk(chunks):
    ideas = []
    for ch in chunks:
        out = chat(
            IDEA_SYSTEM,
            IDEA_USER_TMPL.format(chunk=ch),
            max_tokens=400,
            as_json=True,          # ← important for clean JSON
        )
        ideas.extend(_normalize_idea_items(out))
    return ideas

def dedup_ideas(ideas, thresh=88):
    kept = []
    for idea in ideas:
        if not any(fuzz.token_set_ratio(idea["title"], k["title"]) >= thresh for k in kept):
            kept.append(idea)
    for i, it in enumerate(kept):
        it["id"] = f"idea_{i+1:04d}"
    return kept


ideas_raw = extract_ideas_per_chunk(all_chunks)
print("ideas_raw:", len(ideas_raw))
ideas = dedup_ideas(ideas_raw, 88)
print("ideas:", len(ideas))


ideas_raw: 441
ideas: 278



## Retrieval: Top-k supporting context per idea
We index chunks with FAISS and fetch the top-k most relevant passages to ground each question.


In [72]:

index, _embs = build_faiss_index(all_chunks)

def retrieve_context_for_idea(idea_text: str, k=3) -> List[Dict[str,str]]:
    qvec = embed_texts([idea_text])
    faiss.normalize_L2(qvec)
    scores, idxs = index.search(qvec, k)
    items = []
    for rank, (j, sc) in enumerate(zip(idxs[0], scores[0]), 1):
        items.append({
            "rank": rank,
            "score": float(sc),
            "chunk": all_chunks[j],
            "meta": chunk_meta[j]
        })
    return items



## Question Generation
One MCQ per idea with grounded context. We also shuffle choices to avoid positional bias.


In [73]:

QG_SYSTEM = "You write exam-quality multiple-choice questions that test conceptual understanding.\n- 1 question only, grounded in the given context.\n- 4 options (A-D), with exactly one correct.\n- No trivial recall (dates/numbers) unless core to the concept.\n- Avoid vague or ambiguous wording.\nReturn strict JSON: {\"question\": str, \"choices\": [{\"label\":\"A\",\"text\":...},...], \"correct_label\":\"A\"}"
QG_USER_TMPL = "Idea summary: {idea}\nUse these supporting snippets (may be partial) to craft 1 conceptual question:\n{contexts}\nReturn ONLY the specified JSON."

import random
def shuffle_choices(payload: Dict[str,Any]) -> Dict[str,Any]:
    choices = payload["choices"]
    # track which is correct BEFORE shuffle
    correct = payload["correct_label"]
    correct_text = next(c["text"] for c in choices if c["label"]==correct)
    # shuffle
    labels = ["A","B","C","D"]
    random.shuffle(choices)
    # reassign labels
    out_choices = []
    new_correct = None
    for lab, ch in zip(labels, choices):
        out_choices.append({"label": lab, "text": ch["text"]})
        if ch["text"] == correct_text:
            new_correct = lab
    payload["choices"] = out_choices
    payload["correct_label"] = new_correct
    return payload

def generate_question_for_idea(idea: Dict[str,str], k=3) -> Dict[str,Any]:
    ctx_items = retrieve_context_for_idea(idea["summary"], k=k)
    ctx_str = "\\n---\\n".join([it["chunk"][:1200] for it in ctx_items])
    out = chat(QG_SYSTEM, QG_USER_TMPL.format(idea=idea["summary"], contexts=ctx_str), max_tokens=700, as_json=True)
    data = parse_json_loose(out)
    data = shuffle_choices(data)
    data["id"] = idea["id"]
    data["idea_summary"] = idea["summary"]
    data["source_citations"] = [f"{it['meta']['doc']}|{it['meta']['chunk_id']}" for it in ctx_items]
    return data



## Quality Control (Bonus)
We score each question on: clarity, groundedness, non-triviality, distractor quality -> average to an overall score.  
Threshold (default >= 0.7) filters out weaker items.


In [74]:

QC_SYSTEM = "You are grading a multiple-choice question with rubric 0.0-1.0.\nCriteria:\n- clarity: clear, unambiguous stem\n- groundedness: answer supported by provided context\n- non_triviality: requires understanding (not copy-paste recall)\n- distractor_quality: plausible but clearly incorrect\nReturn JSON: {\"clarity\":x,\"groundedness\":x,\"non_triviality\":x,\"distractor_quality\":x,\"notes\":str}"
QC_USER_TMPL = "Question:\n{q}\nChoices: {choices}\nCorrect: {correct}\nContext (evidence):\n{ctx}\n"

def score_question(item: Dict[str,Any]) -> Dict[str,Any]:
    ctx = "\\n---\\n".join(item["source_citations"])
    q = item["question"]
    ch = "; ".join([f"{c['label']}) {c['text']}" for c in item["choices"]])
    out = chat(QC_SYSTEM, QC_USER_TMPL.format(q=q, choices=ch, correct=item["correct_label"], ctx=ctx), max_tokens=400, as_json=True)
    try:
        
       scores = parse_json_loose(out)


    except Exception:
        scores = {"clarity":0.6,"groundedness":0.6,"non_triviality":0.6,"distractor_quality":0.6,"notes":"parse-fallback"}
    import numpy as np
    overall = float(np.mean([scores.get("clarity",0), scores.get("groundedness",0),
                             scores.get("non_triviality",0), scores.get("distractor_quality",0)]))
    item["quality"] = {"overall": round(overall,3), **{k: round(float(scores.get(k,0)),3) for k in ["clarity","groundedness","non_triviality","distractor_quality"]}, "notes": scores.get("notes","")}
    return item



## Difficulty Tagging (Bonus)
We map Bloom levels -> easy/medium/hard.


In [75]:

DIFF_SYSTEM = "You are a psychometrics expert. Classify the question's Bloom level (Remember, Understand, Apply, Analyze, Evaluate, Create).\nReturn JSON: {\"bloom\": \"Understand\"}"

def add_difficulty(item: Dict[str,Any]) -> Dict[str,Any]:
    text = item["question"] + " Choices: " + "; ".join([c["text"] for c in item["choices"]])
    out = chat(DIFF_SYSTEM, text, max_tokens=100)
    try:
        bloom = json.loads(out).get("bloom","Understand")
    except Exception:
        bloom = "Understand"
    mapping = {"Remember":"easy","Understand":"easy","Apply":"medium","Analyze":"medium","Evaluate":"hard","Create":"hard"}
    item["difficulty"] = mapping.get(bloom, "medium")
    return item



## Generate Questions
Adjust MAX_QUESTIONS as needed. For long docs, the pipeline amortizes costs and scales better than naive prompting.


In [76]:

MAX_QUESTIONS = int(os.environ.get("MAX_QUESTIONS", "20"))
K_CONTEXT = int(os.environ.get("K_CONTEXT", "3"))
QUALITY_THRESHOLD = float(os.environ.get("QUALITY_THRESHOLD", "0.70"))

results = []
for idea in tqdm(ideas[:MAX_QUESTIONS], desc="Generating Qs"):
    try:
        item = generate_question_for_idea(idea, k=K_CONTEXT)
        item = score_question(item)
        item = add_difficulty(item)
        results.append(item)
    except Exception as e:
        print(f"!! Skipping {idea.get('id')} due to parse error: {e}")
        continue


# Filter by quality
filtered = [r for r in results if r.get("quality",{}).get("overall",0) >= QUALITY_THRESHOLD]
print(f"Generated: {len(results)}  /  Kept after quality filter (>={QUALITY_THRESHOLD}): {len(filtered)}")


Generating Qs:  65%|██████▌   | 13/20 [00:44<00:20,  2.92s/it]

!! Skipping idea_0013 due to parse error: Expecting ',' delimiter: line 1 column 525 (char 524)


Generating Qs: 100%|██████████| 20/20 [01:05<00:00,  3.25s/it]

Generated: 19  /  Kept after quality filter (>=0.7): 19






## Save JSON


In [77]:
out = {
    "run_metadata": {
        "created_at": datetime.datetime.now().isoformat(),
        "docs": docs_names,
        "model": GEMINI_MODEL,
        "embedding_model": GEMINI_EMBEDDING,
        "k_context": K_CONTEXT,
        "cost_estimate_usd": None
    },
    "questions": filtered
}
OUTPUT_DIR.mkdir(exist_ok=True, parents=True)
with open(OUTPUT_DIR / "questions.json", "w", encoding="utf-8") as f:
    json.dump(out, f, ensure_ascii=False, indent=2)
print("Saved ->", OUTPUT_DIR / "questions.json")


Saved -> /Users/vanshvirani/projects/savaalish-qg-mvp/output/questions.json



## Appendix: 30-second Architecture
```
docs/ (PDF, txt)
   |
   |-- parse & chunk (token-aware windows)
   |         |
   |         `-- map->combine->reduce: extract conceptual ideas (LLM)
   |                      |
   |-- embed chunks ------+--> FAISS index
   |                      |
   `-- per-idea retrieve top-k context
                          |
                      question generation (LLM)
                          |
               quality scoring & difficulty (LLM)
                          |
                   output/questions.json
```
