
# Scalable Question Generation (MVP) - Inspired by Savaal
**Author:** (Your Name)  
**Run date:** 2025-09-17 00:57

This notebook implements a minimal yet scalable pipeline to generate conceptual multiple-choice questions from large PDFs or text files, drawing inspiration from the Savaal paper's concept-driven RAG approach.

Deliverables produced by this notebook:
- A single JSON file at `output/questions.json` with all generated questions + metadata.
- Clear, well-commented cells explaining design choices.
- Optional bonus: automatic quality scoring and difficulty tagging (Bloom's levels).

Note: You will need API access to your chosen LLM and embedding model (default prompts assume OpenAI). No secrets are stored in the notebook; set environment variables locally.



## Quickstart (Checklist)
1. Install deps (next cell).  
2. Set environment variables (API key) in the Config cell.  
3. Put your input documents into the `docs/` folder (PDF or .txt).  
4. Run all cells up to "Generate Questions".  
5. Inspect and optionally filter by quality.  
6. Find your final JSON at `output/questions.json`.  
7. Record a 3-minute demo walking through: dataflow diagram -> short run -> final JSON.


In [None]:

# 1) Install dependencies (run once)
# If you're in Colab: uncomment the following line. In local Jupyter, it's okay to run as-is.
# Note: FAISS wheel name varies by platform; 'faiss-cpu' works for most.
%pip install -q pypdf tiktoken faiss-cpu numpy pandas openai python-dotenv tqdm rapidfuzz



## Config
Set API keys via environment variables or `.env` file (not included in submission).  
Default uses OpenAI for both chat-completions and embeddings; feel free to swap vendors.


In [None]:

import os, pathlib, json, re, math, uuid, time, datetime
from dataclasses import dataclass
from typing import List, Dict, Any, Tuple
from tqdm import tqdm

# LLM + embeddings (default: OpenAI)
from openai import OpenAI
import numpy as np

# PDF parsing + chunking
from pypdf import PdfReader
import tiktoken

# Retrieval
import faiss

# Optional bonus
from rapidfuzz import fuzz

# Project paths
ROOT = pathlib.Path().resolve()
DOCS_DIR = ROOT / "docs"
OUTPUT_DIR = ROOT / "output"
OUTPUT_DIR.mkdir(exist_ok=True, parents=True)

# Load secrets if present
if os.path.exists(".env"):
    from dotenv import load_dotenv
    load_dotenv()

OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY", "")
OPENAI_MODEL = os.environ.get("OPENAI_MODEL", "gpt-4o-mini")
OPENAI_EMBEDDING = os.environ.get("OPENAI_EMBEDDING", "text-embedding-3-large")

if not OPENAI_API_KEY:
    print("Set OPENAI_API_KEY in your environment. Example: export OPENAI_API_KEY=sk-...")

client = OpenAI(api_key=OPENAI_API_KEY)



## Design at a glance
- Parse PDFs -> chunk text with token-aware windows (overlap to preserve coherence).
- Map->Combine->Reduce: extract main ideas per chunk via LLM; lightly de-duplicate/merge.
- Retrieve top-k supporting passages per idea using FAISS over embeddings.
- Generate one MCQ per idea with grounded context -> shuffle choices to remove positional bias.
- Quality: LLM rubric + heuristics (groundedness, clarity, distractor plausibility).  
- Difficulty: Bloom's tags (Remember/Understand/Apply/Analyze/Evaluate/Create) collapsed to easy/med/hard.


In [None]:

# Utility: token-aware chunking
def chunk_text(text: str, model_name: str = "gpt-4o-mini", max_tokens=800, overlap=120) -> List[str]:
    # Use tiktoken encoding as an approximation of tokens
    enc = tiktoken.get_encoding("cl100k_base")
    toks = enc.encode(text)
    chunks = []
    i = 0
    while i < len(toks):
        sub = toks[i : i + max_tokens]
        chunks.append(enc.decode(sub))
        i += max_tokens - overlap
    return chunks

# PDF/text loader
def load_docs(docs_dir: pathlib.Path) -> List[Tuple[str, str]]:
    docs = []
    for p in docs_dir.glob("**/*"):
        if p.suffix.lower() == ".pdf":
            reader = PdfReader(str(p))
            pages = [page.extract_text() or "" for page in reader.pages]
            docs.append((str(p), "\n".join(pages)))
        elif p.suffix.lower() == ".txt":
            docs.append((str(p), p.read_text(encoding="utf-8", errors="ignore")))
    if not docs:
        print(f"No documents found in {docs_dir}. Add PDFs or .txt files.")
    return docs

# Embeddings
def embed_texts(texts: List[str], batch=64) -> np.ndarray:
    vecs = []
    for i in range(0, len(texts), batch):
        resp = client.embeddings.create(model=OPENAI_EMBEDDING, input=texts[i:i+batch])
        vecs.extend([d.embedding for d in resp.data])
    return np.array(vecs, dtype="float32")

# Build FAISS index
def build_faiss_index(chunks: List[str]) -> Tuple[faiss.IndexFlatIP, np.ndarray]:
    embs = embed_texts(chunks)
    faiss.normalize_L2(embs)
    index = faiss.IndexFlatIP(embs.shape[1])
    index.add(embs)
    return index, embs

# LLM call helper
def chat(system: str, user: str, max_tokens=700, temperature=0.0) -> str:
    resp = client.chat.completions.create(
        model=OPENAI_MODEL,
        messages=[{"role":"system","content":system},{"role":"user","content":user}],
        temperature=temperature,
        max_tokens=max_tokens,
    )
    return resp.choices[0].message.content.strip()



## Load & Chunk


In [None]:

docs = load_docs(DOCS_DIR)
docs_names = [d[0] for d in docs]
print(f"Loaded {len(docs)} document(s).")
all_chunks, chunk_meta = [], []
for path, text in docs:
    chunks = chunk_text(text, max_tokens=900, overlap=150)
    for idx, ch in enumerate(chunks):
        all_chunks.append(ch)
        chunk_meta.append({"doc": path, "chunk_id": f"{path}#chunk{idx}"})
print(f"Total chunks: {len(all_chunks)}")



## Map -> Combine -> Reduce: Extract candidate ideas
We prompt the LLM to extract 1-3 conceptual ideas per chunk. Then we lightly merge near-duplicates.


In [None]:

IDEA_SYSTEM = "You are extracting conceptual ideas from academic/professional prose. Return a compact JSON array; each item MUST have: {\"title\": str, \"summary\": str}."
IDEA_USER_TMPL = "Extract up to 3 key conceptual ideas (not mere facts) from the following text. Focus on definitions, mechanisms, assumptions, or trade-offs likely to be tested.\nText:\n---\n{chunk}\n---\nReturn ONLY a compact JSON array."

def extract_ideas_per_chunk(chunks: List[str]) -> List[Dict[str,str]]:
    ideas = []
    for ch in tqdm(chunks, desc="Extracting ideas"):
        out = chat(IDEA_SYSTEM, IDEA_USER_TMPL.format(chunk=ch), max_tokens=400)
        # Robust parse: try json; fallback simple pattern
        try:
            arr = json.loads(out)
            for it in arr:
                title = (it.get("title") or it.get("idea") or "").strip()
                summary = (it.get("summary") or it.get("desc") or "").strip()
                if title and summary:
                    ideas.append({"title": title, "summary": summary})
        except Exception:
            for m in re.findall(r'"title"\\s*:\\s*"([^"]+)"\\s*,\\s*"summary"\\s*:\\s*"([^"]+)"', out):
                ideas.append({"title": m[0], "summary": m[1]})
    return ideas

# Lightweight dedup based on title similarity
def dedup_ideas(ideas: List[Dict[str,str]], thresh=85) -> List[Dict[str,str]]:
    kept = []
    for idea in ideas:
        if not any(fuzz.token_set_ratio(idea["title"], k["title"]) >= thresh for k in kept):
            kept.append(idea)
    # assign IDs
    for i, it in enumerate(kept):
        it["id"] = f"idea_{i+1:04d}"
    return kept

ideas_raw = extract_ideas_per_chunk(all_chunks)
ideas = dedup_ideas(ideas_raw, thresh=88)
print(f"Ideas raw: {len(ideas_raw)}  ->  deduped: {len(ideas)}")



## Retrieval: Top-k supporting context per idea
We index chunks with FAISS and fetch the top-k most relevant passages to ground each question.


In [None]:

index, _embs = build_faiss_index(all_chunks)

def retrieve_context_for_idea(idea_text: str, k=3) -> List[Dict[str,str]]:
    qvec = embed_texts([idea_text])
    faiss.normalize_L2(qvec)
    scores, idxs = index.search(qvec, k)
    items = []
    for rank, (j, sc) in enumerate(zip(idxs[0], scores[0]), 1):
        items.append({
            "rank": rank,
            "score": float(sc),
            "chunk": all_chunks[j],
            "meta": chunk_meta[j]
        })
    return items



## Question Generation
One MCQ per idea with grounded context. We also shuffle choices to avoid positional bias.


In [None]:

QG_SYSTEM = "You write exam-quality multiple-choice questions that test conceptual understanding.\n- 1 question only, grounded in the given context.\n- 4 options (A-D), with exactly one correct.\n- No trivial recall (dates/numbers) unless core to the concept.\n- Avoid vague or ambiguous wording.\nReturn strict JSON: {\"question\": str, \"choices\": [{\"label\":\"A\",\"text\":...},...], \"correct_label\":\"A\"}"
QG_USER_TMPL = "Idea summary: {idea}\nUse these supporting snippets (may be partial) to craft 1 conceptual question:\n{contexts}\nReturn ONLY the specified JSON."

import random
def shuffle_choices(payload: Dict[str,Any]) -> Dict[str,Any]:
    choices = payload["choices"]
    # track which is correct BEFORE shuffle
    correct = payload["correct_label"]
    correct_text = next(c["text"] for c in choices if c["label"]==correct)
    # shuffle
    labels = ["A","B","C","D"]
    random.shuffle(choices)
    # reassign labels
    out_choices = []
    new_correct = None
    for lab, ch in zip(labels, choices):
        out_choices.append({"label": lab, "text": ch["text"]})
        if ch["text"] == correct_text:
            new_correct = lab
    payload["choices"] = out_choices
    payload["correct_label"] = new_correct
    return payload

def generate_question_for_idea(idea: Dict[str,str], k=3) -> Dict[str,Any]:
    ctx_items = retrieve_context_for_idea(idea["summary"], k=k)
    ctx_str = "\\n---\\n".join([it["chunk"][:1200] for it in ctx_items])
    out = chat(QG_SYSTEM, QG_USER_TMPL.format(idea=idea["summary"], contexts=ctx_str), max_tokens=700)
    data = json.loads(out)
    data = shuffle_choices(data)
    data["id"] = idea["id"]
    data["idea_summary"] = idea["summary"]
    data["source_citations"] = [f"{it['meta']['doc']}|{it['meta']['chunk_id']}" for it in ctx_items]
    return data



## Quality Control (Bonus)
We score each question on: clarity, groundedness, non-triviality, distractor quality -> average to an overall score.  
Threshold (default >= 0.7) filters out weaker items.


In [None]:

QC_SYSTEM = "You are grading a multiple-choice question with rubric 0.0-1.0.\nCriteria:\n- clarity: clear, unambiguous stem\n- groundedness: answer supported by provided context\n- non_triviality: requires understanding (not copy-paste recall)\n- distractor_quality: plausible but clearly incorrect\nReturn JSON: {\"clarity\":x,\"groundedness\":x,\"non_triviality\":x,\"distractor_quality\":x,\"notes\":str}"
QC_USER_TMPL = "Question:\n{q}\nChoices: {choices}\nCorrect: {correct}\nContext (evidence):\n{ctx}\n"

def score_question(item: Dict[str,Any]) -> Dict[str,Any]:
    ctx = "\\n---\\n".join(item["source_citations"])
    q = item["question"]
    ch = "; ".join([f"{c['label']}) {c['text']}" for c in item["choices"]])
    out = chat(QC_SYSTEM, QC_USER_TMPL.format(q=q, choices=ch, correct=item["correct_label"], ctx=ctx), max_tokens=400)
    try:
        scores = json.loads(out)
    except Exception:
        scores = {"clarity":0.6,"groundedness":0.6,"non_triviality":0.6,"distractor_quality":0.6,"notes":"parse-fallback"}
    import numpy as np
    overall = float(np.mean([scores.get("clarity",0), scores.get("groundedness",0),
                             scores.get("non_triviality",0), scores.get("distractor_quality",0)]))
    item["quality"] = {"overall": round(overall,3), **{k: round(float(scores.get(k,0)),3) for k in ["clarity","groundedness","non_triviality","distractor_quality"]}, "notes": scores.get("notes","")}
    return item



## Difficulty Tagging (Bonus)
We map Bloom levels -> easy/medium/hard.


In [None]:

DIFF_SYSTEM = "You are a psychometrics expert. Classify the question's Bloom level (Remember, Understand, Apply, Analyze, Evaluate, Create).\nReturn JSON: {\"bloom\": \"Understand\"}"

def add_difficulty(item: Dict[str,Any]) -> Dict[str,Any]:
    text = item["question"] + " Choices: " + "; ".join([c["text"] for c in item["choices"]])
    out = chat(DIFF_SYSTEM, text, max_tokens=100)
    try:
        bloom = json.loads(out).get("bloom","Understand")
    except Exception:
        bloom = "Understand"
    mapping = {"Remember":"easy","Understand":"easy","Apply":"medium","Analyze":"medium","Evaluate":"hard","Create":"hard"}
    item["difficulty"] = mapping.get(bloom, "medium")
    return item



## Generate Questions
Adjust MAX_QUESTIONS as needed. For long docs, the pipeline amortizes costs and scales better than naive prompting.


In [None]:

MAX_QUESTIONS = int(os.environ.get("MAX_QUESTIONS", "20"))
K_CONTEXT = int(os.environ.get("K_CONTEXT", "3"))
QUALITY_THRESHOLD = float(os.environ.get("QUALITY_THRESHOLD", "0.70"))

results = []
for idea in tqdm(ideas[:MAX_QUESTIONS], desc="Generating Qs"):
    item = generate_question_for_idea(idea, k=K_CONTEXT)
    item = score_question(item)               # bonus
    item = add_difficulty(item)               # bonus
    results.append(item)

# Filter by quality
filtered = [r for r in results if r.get("quality",{}).get("overall",0) >= QUALITY_THRESHOLD]
print(f"Generated: {len(results)}  /  Kept after quality filter (>={QUALITY_THRESHOLD}): {len(filtered)}")



## Save JSON


In [None]:

out = {
    "run_metadata": {
        "created_at": datetime.datetime.now().isoformat(),
        "docs": docs_names,
        "model": OPENAI_MODEL,
        "embedding_model": OPENAI_EMBEDDING,
        "k_context": K_CONTEXT,
        "cost_estimate_usd": None  # optionally compute using tiktoken counts * $/1K
    },
    "questions": filtered
}
OUTPUT_DIR.mkdir(exist_ok=True, parents=True)
with open(OUTPUT_DIR / "questions.json", "w", encoding="utf-8") as f:
    json.dump(out, f, ensure_ascii=False, indent=2)
print("Saved ->", OUTPUT_DIR / "questions.json")



## Appendix: 30-second Architecture
```
docs/ (PDF, txt)
   |
   |-- parse & chunk (token-aware windows)
   |         |
   |         `-- map->combine->reduce: extract conceptual ideas (LLM)
   |                      |
   |-- embed chunks ------+--> FAISS index
   |                      |
   `-- per-idea retrieve top-k context
                          |
                      question generation (LLM)
                          |
               quality scoring & difficulty (LLM)
                          |
                   output/questions.json
```
