
# Scalable Question Generation (MVP)
**Author:** Vansh Virani  

This notebook implements a minimal yet scalable pipeline to generate conceptual multiple-choice questions from large PDFs or text files, drawing inspiration from the Savaal paper's concept-driven RAG approach.

Deliverables produced by this notebook:
- A single JSON file at `output/questions.json` with all generated questions + metadata.
- Clear, well-commented cells explaining design choices.

Note: You will need API access to your chosen LLM and embedding model (default prompts assume Gemini). No secrets are stored in the notebook.



## Quickstart (Checklist)
1. Install deps (next cell).  
2. Set environment variables (API key) in the Config cell.  
3. Put your input documents into the `docs/` folder (PDF or .txt).  
4. Run all cells up to "Generate Questions".  
5. Inspect and optionally filter by quality.  
6. Find your final JSON at `output/questions.json`. 


## 0) Install once

In [1]:
%pip install -q pypdf tiktoken faiss-cpu numpy pandas python-dotenv tqdm rapidfuzz google-generativeai

You should consider upgrading via the '/Library/Developer/CommandLineTools/usr/bin/python3 -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.



## Config
- We use **Gemini** for both chat and embeddings via `google-generativeai`.
- API keys are read from `.env` (not committed).
- Adjust `MAX_QUESTIONS`, `K_CONTEXT`, and `QUALITY_THRESHOLD` for speed/cost vs. quality.
- Files go in `./docs/`; results in `./output/`.


In [83]:
import os, json, re, math, time, datetime, uuid, logging
from pathlib import Path
from typing import List, Dict, Any, Tuple
from tqdm import tqdm

import numpy as np
from pypdf import PdfReader
import tiktoken
import faiss
from rapidfuzz import fuzz

logging.getLogger("pypdf").setLevel(logging.ERROR)

if Path(".env").exists():
    from dotenv import load_dotenv
    load_dotenv()

import google.generativeai as genai

GOOGLE_API_KEY   = os.getenv("GOOGLE_API_KEY", "")
GEMINI_MODEL     = os.getenv("GEMINI_MODEL", "gemini-1.5-flash")
GEMINI_EMBEDDING = os.getenv("GEMINI_EMBEDDING", "text-embedding-004")

if not GOOGLE_API_KEY:
    print("Set GOOGLE_API_KEY in your .env before running generation.")
genai.configure(api_key=GOOGLE_API_KEY)

ROOT = Path().resolve()
DOCS_DIR = ROOT / "docs"
OUTPUT_DIR = ROOT / "output"
DOCS_DIR.mkdir(exist_ok=True, parents=True)
OUTPUT_DIR.mkdir(exist_ok=True, parents=True)

print("Model:", GEMINI_MODEL, "| Embedding:", GEMINI_EMBEDDING)
print("DOCS_DIR:", DOCS_DIR)
print("OUTPUT_DIR:", OUTPUT_DIR)

MAX_QUESTIONS = int(os.getenv("MAX_QUESTIONS", "12"))
K_CONTEXT     = int(os.getenv("K_CONTEXT", "2"))
QUALITY_THRESHOLD = float(os.getenv("QUALITY_THRESHOLD", "0.65"))
CHUNK_SIZE    = int(os.getenv("CHUNK_SIZE", "900"))
CHUNK_OVERLAP = int(os.getenv("CHUNK_OVERLAP", "150"))

print("MAX_QUESTIONS:", MAX_QUESTIONS, "| K_CONTEXT:", K_CONTEXT, "| QUALITY_THRESHOLD:", QUALITY_THRESHOLD)

Model: gemini-1.5-flash | Embedding: text-embedding-004
DOCS_DIR: /Users/vanshvirani/projects/savaalish-qg-mvp/docs
OUTPUT_DIR: /Users/vanshvirani/projects/savaalish-qg-mvp/output
MAX_QUESTIONS: 20 | K_CONTEXT: 3 | QUALITY_THRESHOLD: 0.7



## Utilities — chunking and document loading

- We **chunk** long text with token overlap (to avoid cutting concepts in half).  
- The loader reads PDFs with `pypdf`, then falls back to **PyMuPDF** if needed (helps with tricky layouts).  
- Image‑only PDFs will need OCR; for this MVP, we skip those or you can add a `.txt` version.

In [84]:
from typing import List, Tuple

def chunk_text(text: str, max_tokens=800, overlap=120) -> List[str]:
    enc = tiktoken.get_encoding("cl100k_base")
    toks = enc.encode(text)
    chunks = []
    i = 0
    while i < len(toks):
        sub = toks[i: i + max_tokens]
        chunks.append(enc.decode(sub))
        i += max_tokens - overlap
    return chunks

def load_docs(docs_dir: Path) -> List[Tuple[str, str]]:
    docs = []
    for p in docs_dir.glob("**/*"):
        if p.is_dir():
            continue

        if p.suffix.lower() == ".pdf":
            text = ""
            try:
                if p.stat().st_size < 1024:
                    print(f" Skipping {p.name}: too small/empty ({p.stat().st_size} bytes).")
                    continue
                reader = PdfReader(str(p))
                pages = [page.extract_text() or "" for page in reader.pages]
                text = "\n".join(pages).strip()
            except Exception as e:
                print(f" pypdf issue on {p.name}: {e}")
                text = ""

            if not text:
                try:
                    import fitz
                    with fitz.open(str(p)) as doc:
                        text = "\n".join(page.get_text() for page in doc).strip()
                except Exception as e:
                    print(f" PyMuPDF fallback failed for {p.name}: {e}")
                    text = ""

            if text:
                docs.append((str(p), text))
            else:
                print(f" No extractable text in {p.name}; skip or add a .txt.")

        elif p.suffix.lower() == ".txt":
            try:
                t = p.read_text(encoding="utf-8", errors="ignore").strip()
                if t:
                    docs.append((str(p), t))
                else:
                    print(f" Skipping empty txt: {p.name}")
            except Exception as e:
                print(f" Skipping txt {p.name}: {e}")
    print(f"Loaded {len(docs)} text-bearing document(s).")
    return docs



## LLM helpers — Gemini wrappers and safe JSON parsing


In [85]:

def chat(system: str, user: str, max_tokens=700, temperature=0.0, as_json=False) -> str:
    model = genai.GenerativeModel(GEMINI_MODEL, system_instruction=system)
    gen_cfg = {"temperature": float(temperature), "max_output_tokens": int(max_tokens)}
    if as_json:
        gen_cfg["response_mime_type"] = "application/json"
    resp = model.generate_content(user, generation_config=gen_cfg)
    return (resp.text or "").strip()

def embed_texts(texts: List[str], batch: int = 64) -> np.ndarray:
    vecs = []
    for i in range(0, len(texts), batch):
        for t in texts[i:i+batch]:
            r = genai.embed_content(model=GEMINI_EMBEDDING, content=t, task_type="retrieval_document")
            vecs.append(r["embedding"])
    return np.array(vecs, dtype="float32")

import json, re
def parse_json_loose(s: str):
    if isinstance(s, (dict, list)):
        return s
    txt = (s or "").strip()
    txt = re.sub(r"^```(?:json)?\s*|\s*```$", "", txt, flags=re.S)
    m = re.search(r"(\{.*\}|\[.*\])", txt, flags=re.S)
    if m:
        block = m.group(1)
        try:
            return json.loads(block)
        except Exception:
            if '"' not in block and "'" in block:
                try:
                    return json.loads(block.replace("'", '"'))
                except Exception:
                    pass
    return json.loads(txt)



## Retrieval index and idea extraction
We prompt the LLM to extract 1-3 conceptual ideas per chunk. Then we lightly merge near-duplicates.


In [86]:

def build_faiss_index(chunks: List[str]):
    embs = embed_texts(chunks)
    faiss.normalize_L2(embs)
    index = faiss.IndexFlatIP(embs.shape[1])
    index.add(embs)
    return index

def retrieve_context_for_idea(idea_text: str, index, all_chunks: List[str], chunk_meta: List[Dict[str,Any]], k=3):
    qvec = embed_texts([idea_text])
    faiss.normalize_L2(qvec)
    scores, idxs = index.search(qvec, k)
    items = []
    for rank, (j, sc) in enumerate(zip(idxs[0], scores[0]), 1):
        items.append({"rank": rank, "score": float(sc), "chunk": all_chunks[j], "meta": chunk_meta[j]})
    return items

IDEA_SYSTEM = (
    "Extract conceptual ideas from academic/professional prose. "
    "Return ONLY a JSON array; each item has keys: title (str), summary (str)."
)
IDEA_USER_TMPL = (
    "Extract up to 3 non-trivial conceptual IDEAS (definitions, mechanisms, assumptions, trade-offs).\n"
    "TEXT:\n---\n{chunk}\n---\nReturn ONLY JSON."
)

def _normalize_idea_items(obj):
    if isinstance(obj, str):
        try: obj = json.loads(obj)
        except Exception:
            m = re.search(r"\[.*\]", obj, re.S)
            obj = json.loads(m.group(0)) if m else []
    if isinstance(obj, dict):
        obj = obj.get("ideas") or obj.get("items") or obj.get("results") or obj.get("data") or obj.get("concepts") or []
    if not isinstance(obj, list):
        obj = [obj]
    out = []
    for it in obj:
        if isinstance(it, str):
            t = it.strip()
            if t: out.append({"title": t[:80], "summary": t})
        elif isinstance(it, dict):
            title = (it.get("title") or it.get("idea") or it.get("concept") or it.get("name") or "").strip()
            summary = (it.get("summary") or it.get("desc") or it.get("explanation") or "").strip()
            if not title and summary: title = summary.split(".")[0][:80]
            if not summary and title: summary = title
            if title or summary: out.append({"title": title, "summary": summary})
    return out

def extract_ideas_per_chunk(chunks: List[str]) -> List[Dict[str,str]]:
    ideas = []
    for ch in chunks:
        out = chat(IDEA_SYSTEM, IDEA_USER_TMPL.format(chunk=ch), max_tokens=400, as_json=True)
        ideas.extend(_normalize_idea_items(out))
    return ideas

def dedup_ideas(ideas: List[Dict[str,str]], thresh=88) -> List[Dict[str,str]]:
    kept = []
    for idea in ideas:
        if not any(fuzz.token_set_ratio(idea["title"], k["title"]) >= thresh for k in kept):
            kept.append(idea)
    for i, it in enumerate(kept):
        it["id"] = f"idea_{i+1:04d}"
    return kept



## Question generation, scoring, difficulty



In [87]:

QG_SYSTEM = (
    "You write exam-quality multiple-choice questions that test conceptual understanding.\n"
    "- 1 question only, grounded in the given context.\n"
    "- 4 options (A-D), exactly one correct.\n"
    "- Avoid trivial recall and ambiguity.\n"
    "Return strict JSON: {\"question\": str, \"choices\": [{\"label\":\"A\",\"text\":...},...], \"correct_label\": \"A\"}"
)
QG_USER_TMPL = "Idea summary: {idea}\nUse these supporting snippets to craft 1 conceptual question (avoid trivia):\n{contexts}\nReturn ONLY JSON."

QC_SYSTEM = (
    "You are grading a multiple-choice question with rubric 0.0-1.0.\n"
    "Criteria: clarity, groundedness, non_triviality, distractor_quality.\n"
    "Return JSON: {\"clarity\":x,\"groundedness\":x,\"non_triviality\":x,\"distractor_quality\":x,\"notes\":str}"
)
QC_TMPL = "Question:\n{q}\nChoices:\n{choices}\nCorrect: {correct}\nContext (evidence):\n{ctx}\n"

DIFF_SYSTEM = "Classify the Bloom level (Remember/Understand/Apply/Analyze/Evaluate/Create). Return JSON: {\"bloom\":\"Understand\"}"

def shuffle_choices(payload: Dict[str,Any]) -> Dict[str,Any]:
    choices = payload["choices"]
    correct = payload["correct_label"]
    correct_text = next(c["text"] for c in choices if c["label"] == correct)
    import random
    random.shuffle(choices)
    labels = ["A","B","C","D"]
    out_choices, new_correct = [], None
    for lab, ch in zip(labels, choices):
        out_choices.append({"label": lab, "text": ch["text"]})
        if ch["text"] == correct_text:
            new_correct = lab
    payload["choices"] = out_choices
    payload["correct_label"] = new_correct
    return payload

def generate_question_for_idea(idea: Dict[str,Any], index, all_chunks, chunk_meta, k=2) -> Dict[str,Any]:
    ctx_items = retrieve_context_for_idea(idea["summary"], index, all_chunks, chunk_meta, k=k)
    ctx_str = "\n---\n".join([it["chunk"][:1200] for it in ctx_items])
    out = chat(QG_SYSTEM, QG_USER_TMPL.format(idea=idea["summary"], contexts=ctx_str), max_tokens=700, as_json=True)
    data = parse_json_loose(out)
    data = shuffle_choices(data)
    data["id"] = idea["id"]
    data["idea_summary"] = idea["summary"]
    data["source_citations"] = [f"{it['meta']['doc']}|{it['meta']['chunk_id']}" for it in ctx_items]
    return data

def score_question(item: Dict[str,Any]) -> Dict[str,Any]:
    q = item["question"]
    ch = "\n".join([f"{c['label']}) {c['text']}" for c in item["choices"]])
    ctx = "\n---\n".join(item["source_citations"])
    out = chat(QC_SYSTEM, QC_TMPL.format(q=q, choices=ch, correct=item["correct_label"], ctx=ctx), max_tokens=400, as_json=True)
    scores = parse_json_loose(out)
    import numpy as np
    overall = float(np.mean([scores.get("clarity",0), scores.get("groundedness",0), scores.get("non_triviality",0), scores.get("distractor_quality",0)]))
    item["quality"] = {
        "overall": round(overall,3),
        "clarity": round(float(scores.get("clarity",0)),3),
        "groundedness": round(float(scores.get("groundedness",0)),3),
        "non_triviality": round(float(scores.get("non_triviality",0)),3),
        "distractor_quality": round(float(scores.get("distractor_quality",0)),3),
        "notes": scores.get("notes",""),
    }
    return item

def add_difficulty(item: Dict[str,Any]) -> Dict[str,Any]:
    text = item["question"] + " Choices: " + "; ".join([c["text"] for c in item["choices"]])
    out = chat(DIFF_SYSTEM, text, max_tokens=80, as_json=True)
    try:
        bloom = parse_json_loose(out).get("bloom","Understand")
    except Exception:
        bloom = "Understand"
    mapping = {"Remember":"easy","Understand":"easy","Apply":"medium","Analyze":"medium","Evaluate":"hard","Create":"hard"}
    item["difficulty"] = mapping.get(bloom, "medium")
    return item



## Accuracy proxy (judge) and coverage tools

In [88]:

JUDGE_SYSTEM = (
    "You are a strict exam proctor. Given a question, options, and evidence, "
    "select the single correct label A-D. If evidence is insufficient, return 'U'. "
    "Return ONLY JSON: {\"label\":\"A|B|C|D|U\",\"justification\":str}"
)
JUDGE_TMPL = """Question:
{q}

Choices:
{choices}

Evidence:
{ctx}
"""

def judge_label(qitem):
    ch = "\n".join([f"{c['label']}) {c['text']}" for c in qitem["choices"]])
    ctx = "\n---\n".join(qitem["source_citations"])
    out = chat(JUDGE_SYSTEM, JUDGE_TMPL.format(q=qitem["question"], choices=ch, ctx=ctx), max_tokens=200, temperature=0.0, as_json=True)
    data = parse_json_loose(out)
    return (data.get("label") or "U").strip(), data.get("justification","" )

def cited_chunk_ids(item):
    ids = []
    for c in item["source_citations"]:
        m = re.search(r"#chunk(\d+)", c)
        if m: ids.append(int(m.group(1)))
    return ids

def coverage_report(results, all_chunks):
    used = set()
    for it in results:
        used.update(cited_chunk_ids(it))
    ratio = (len(used) / len(all_chunks)) if all_chunks else 0.0
    print(f"Chunk coverage: {len(used)}/{len(all_chunks)} = {ratio:.1%}")
    return ratio

def primary_chunk_for_idea(idea, index, all_chunks, chunk_meta, k=1):
    ctx = retrieve_context_for_idea(idea["summary"], index, all_chunks, chunk_meta, k=k)
    if not ctx: return None
    m = re.search(r"#chunk(\d+)", ctx[0]["meta"]["chunk_id"])
    return int(m.group(1)) if m else None

def stratified_select(ideas, index, all_chunks, chunk_meta, num_questions, bins=10):
    mapped = []
    for it in ideas:
        idx = primary_chunk_for_idea(it, index, all_chunks, chunk_meta, k=1)
        if idx is not None:
            mapped.append((idx, it))
    if not mapped:
        return ideas[:num_questions]

    max_idx = max(idx for idx,_ in mapped) + 1
    bin_size = max(1, math.ceil(max_idx / bins))
    buckets = [[] for _ in range(bins)]
    for idx, it in mapped:
        b = min(idx // bin_size, bins-1)
        buckets[b].append(it)

    out = []
    while len(out) < num_questions and any(buckets):
        for b in buckets:
            if b and len(out) < num_questions:
                out.append(b.pop(0))
    return out



## Run the pipeline (end‑to‑end)


In [89]:

docs = load_docs(DOCS_DIR)
docs_names = [d[0] for d in docs]
print(f"Loaded {len(docs)} document(s).") 

all_chunks, chunk_meta = [], []
for path, text in docs:
    chunks = chunk_text(text, max_tokens=CHUNK_SIZE, overlap=CHUNK_OVERLAP)
    for idx, ch in enumerate(chunks):
        all_chunks.append(ch)
        chunk_meta.append({"doc": path, "chunk_id": f"{path}#chunk{idx}"})
print(f"Total chunks: {len(all_chunks)}") 

if not all_chunks:
    raise SystemExit("No text to process. Add PDFs/.txt into ./docs and re-run.")

index = build_faiss_index(all_chunks)

print("Extracting ideas...")
ideas_raw = extract_ideas_per_chunk(all_chunks)
ideas = dedup_ideas(ideas_raw, thresh=88)
print("Ideas found:", len(ideas))

selected_ideas = stratified_select(
    ideas, index, all_chunks, chunk_meta, num_questions=MAX_QUESTIONS, bins=10
)
print("Selected ideas:", len(selected_ideas))

results = []
for idea in tqdm(selected_ideas, desc="Generating Qs"):
    try:
        item = generate_question_for_idea(idea, index, all_chunks, chunk_meta, k=K_CONTEXT)
        item = score_question(item)
        item = add_difficulty(item)
        results.append(item)
    except Exception as e:
        print(f"!! Skipping {idea.get('id')} due to parse error: {e}")
        continue

filtered = [r for r in results if r.get("quality",{}).get("overall",0) >= QUALITY_THRESHOLD]
print(f"Generated: {len(results)}  /  Kept after quality filter (>= {QUALITY_THRESHOLD}): {len(filtered)}") 

agree, unknown, total = 0, 0, 0
for it in filtered:
    try:
        lab, _ = judge_label(it)
        if lab == "U": unknown += 1
        else:
            total += 1
            if lab == it["correct_label"]: agree += 1
    except Exception:
        pass
if total:
    print(f"Judge agreement: {agree}/{total} = {agree/total:.2%}  (unknown={unknown})")
else:
    print("Judge skipped (no items or all unknown).") 

coverage_report(filtered, all_chunks)

out = {
    "run_metadata": {
        "created_at": datetime.datetime.now().isoformat(),
        "docs": docs_names,
        "model": GEMINI_MODEL,
        "embedding_model": GEMINI_EMBEDDING,
        "k_context": K_CONTEXT,
        "quality_threshold": QUALITY_THRESHOLD,
    },
    "questions": filtered,
}
OUTPUT_DIR.mkdir(exist_ok=True, parents=True)
with open(OUTPUT_DIR / "questions.json", "w", encoding="utf-8") as f:
    json.dump(out, f, ensure_ascii=False, indent=2)
print("Saved ->", OUTPUT_DIR / "questions.json")


Loaded 1 text-bearing document(s).
Loaded 1 document(s).
Total chunks: 147
Extracting ideas...
Ideas found: 281
Selected ideas: 20


Generating Qs: 100%|██████████| 20/20 [01:07<00:00,  3.39s/it]


Generated: 20  /  Kept after quality filter (>= 0.7): 20
Judge agreement: 17/19 = 89.47%  (unknown=0)
Chunk coverage: 52/147 = 35.4%
Saved -> /Users/vanshvirani/projects/savaalish-qg-mvp/output/questions.json



## Peek at the result


In [90]:

import json
p = OUTPUT_DIR / "questions.json"
if p.exists():
    data = json.loads(p.read_text())
    print("Total questions:", len(data.get("questions", [])))
    if data.get("questions"):
        from pprint import pprint
        pprint(data["questions"][0])
else:
    print("questions.json not found (run the previous cell).")


Total questions: 20
{'choices': [{'label': 'A',
              'text': 'It primarily serves as a mathematical curiosity, with '
                      'limited practical applications in engineering or '
                      'physics.'},
             {'label': 'B',
              'text': 'It proves the existence of complex numbers, resolving a '
                      'long-standing mathematical debate about their '
                      'validity.'},
             {'label': 'C',
              'text': 'It facilitates conversions between Cartesian, polar, '
                      'and exponential forms, enabling easier calculations and '
                      'geometric interpretations of complex number '
                      'operations.'},
             {'label': 'D',
              'text': 'It simplifies complex number arithmetic by providing a '
                      'single, unified representation, eliminating the need '
                      'for conversions between forms.'}],
 'correct_