# Submission by: Shubham Singh 
# Dated: 09/19/2025
# MVP: High‑Quality MCQ Generator (Python + Gemini API)

This notebook builds a **minimum viable product (MVP)** that generates **high‑quality multiple‑choice questions** (MCQs) from large text documents.
It’s inspired by ideas similar to *Savaal* (answerable-from-context, unambiguous, distractors based on near-misses, etc.) but focuses on a **practical, scalable** implementation.

**Tech stack:** Python + Gemini API (with a safe offline fallback function so this notebook runs without network access).  
**Deliverables:** MCQs in DataFrame + JSONL/CSV exports, with difficulty, Bloom tags, rationales, and source spans.

---
## Core Capabilities
- Ingest **PDF/TXT** (and easily extensible).
- **Chunk** large documents with overlap to preserve context.
- Use **Gemini** to propose MCQs per chunk (or a local fallback if no API key/internet).
- Enforce **quality checks**:
  - Answer is found in the source span.
  - Exactly one correct option; plausible distractors.
  - Non-ambiguous wording; no “all of the above”/“none of the above”.
- Annotate **difficulty** (easy/medium/hard), **Bloom’s level**, and **rationale**.
- **Scalable design**: batchable, stateless prompting; streaming-friendly interface.
- **Reproducibility**: configurable seed + JSON config for generation parameters.

## Bonus Feature
- **Item analysis**: quick stats (option distribution, difficulty mix, span coverage) and simple flags for potential improvements.
- **Exports**: JSONL/CSV and a compact HTML review report.



## 1) Setup

- If you have a Gemini API key, set it in an environment variable before running:
  ```bash
  export GOOGLE_API_KEY="enter_your_APIkey_here"
  ```
- If the key is **missing**, the notebook will **fallback** to a heuristic generator so you can still test the full pipeline.


In [8]:
import os, re, json, random, math, uuid, html 
from dataclasses import dataclass, asdict # defines simple classes for structured data and convert them to dictionaries
from typing import List, Dict, Any, Optional, Tuple # Provides type hints for function signatures and variables
from collections import Counter # A specialized dictionary for counting hashable objects
from datetime import datetime # Supplies classes to work with dates and times
from pathlib import Path # Offers an object-oriented way to handle filesystem paths
_GLOBAL_Q_SEEN = set() # to remove the dupliocates from the text


import pandas as pd # for data analysis and manipulation

# Try basic PDF text extraction; if unavailable, degrade gracefully.
try:
    import PyPDF2
    HAS_PYPDF2 = True
except Exception:
    HAS_PYPDF2 = False

# Optional: Gemini (google-generativeai). If not present, we'll fall back.
GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY", "").strip()
try:
    import google.generativeai as genai
    HAS_GEMINI = bool(GOOGLE_API_KEY)
    if HAS_GEMINI:
        genai.configure(api_key=GOOGLE_API_KEY)
    else:
        HAS_GEMINI = False
except Exception:
    HAS_GEMINI = False

print("PyPDF2 available:", HAS_PYPDF2)
print("Gemini configured:", HAS_GEMINI)


PyPDF2 available: True
Gemini configured: False



## 2) Load Assignment (for context)

We’ll attempt to read the provided assignment PDF to confirm requirements contextually.


In [9]:
DOCUMENT_PATHS = [
    "/Users/jisusingh/Downloads/Assignment_Encando/Documents/notes.pdf"
]

def extract_pdf_text(path: str) -> str:
    if not HAS_PYPDF2:
        return ""
    text = []
    with open(path, 'rb') as f:
        reader = PyPDF2.PdfReader(f)
        for i in range(len(reader.pages)):
            try:
                page_text = reader.pages[i].extract_text() or ""
            except Exception:
                page_text = ""
            text.append(page_text)
    return "\n".join(text)

assignment_texts = {}
for p in DOCUMENT_PATHS:
    if Path(p).exists():
        assignment_texts[p] = extract_pdf_text(p)

print("Found assignment files:", list(assignment_texts.keys()))
for k,v in assignment_texts.items():
    print(f"--- {k} (first 800 chars) ---\n{v[:800]}\n...\n")


Found assignment files: ['/Users/jisusingh/Downloads/Assignment_Encando/Documents/notes.pdf']
--- /Users/jisusingh/Downloads/Assignment_Encando/Documents/notes.pdf (first 800 chars) ---
Contents
1 Mathematical Preliminaries 2
1.1 Trigonometric Identities ............................. 3
1.2 Magnitude and Angle Representation ...................... 3
1.3 Complex Numbers ................................ 4
1.3.1 History - ( Veritasium’s video ,KRN’s video ).............. 4
1.3.2 Cartesian Form - ( Video ,Python notebook ).............. 4
1.3.3 Magnitude and Phase ( Video )...................... 4
1.3.4 Euler’s Formula - ( Video )........................ 6
1.3.5 Polar form or Exponential form ( Video )................ 7
1.3.6 Conjugate - ( Video )........................... 9
1.3.7 Arithmetic with two complex numbers - ( Video )........... 9
1.3.8 Geometric interpretation of arithmetic operations - ( Video ,Python
notebook )................................. 10
1.3.9 More prope
...




## 3) Document Ingestion

Use `load_document()` to read a `.pdf` or `.txt`. For PDFs, text extraction is best‑effort. 


In [10]:

DEFAULT_DOC = None
if Path('/Users/jisusingh/Downloads/Assignment_Encando/Documents/notes.pdf').exists():
    DEFAULT_DOC = '/Users/jisusingh/Downloads/Assignment_Encando/Documents/notes.pdf'

def load_document(path: Optional[str]) -> str:
    if path is None:
        # Fallback sample content
        return (
    "A Linear Time-Invariant (LTI) system is one where the principles of linearity and time invariance hold. "
    "For a continuous-time (CT) LTI system, the output y(t) is given by the convolution of the input x(t) with the "
    "impulse response h(t): y(t) = x(t) * h(t). Similarly, for a discrete-time (DT) LTI system, the output is "
    "y[n] = x[n] * h[n]. The Fourier Series represents periodic signals as sums of harmonically related complex "
    "exponentials, while the Fourier Transform extends this to aperiodic signals. A bandlimited signal has zero "
    "frequency content beyond a certain cutoff, and to avoid aliasing, it must be sampled at least at the Nyquist rate "
    "(twice its bandwidth). These concepts form the foundation for continuous-time and discrete-time signal analysis, "
    "including Fourier Series, Fourier Transform, and sampling theory."
)

    path = str(path)
    if path.lower().endswith(".txt"):
        return Path(path).read_text(encoding="utf-8", errors="ignore")
    elif path.lower().endswith(".pdf") and HAS_PYPDF2:
        return extract_pdf_text(path)
    else:
        # Unknown extension; try reading as text
        try:
            return Path(path).read_text(encoding="utf-8", errors="ignore")
        except Exception:
            return ""

doc_text = load_document(DEFAULT_DOC)
print("Loaded document chars:", len(doc_text))
print(doc_text[:800])


Loaded document chars: 279982
Contents
1 Mathematical Preliminaries 2
1.1 Trigonometric Identities ............................. 3
1.2 Magnitude and Angle Representation ...................... 3
1.3 Complex Numbers ................................ 4
1.3.1 History - ( Veritasium’s video ,KRN’s video ).............. 4
1.3.2 Cartesian Form - ( Video ,Python notebook ).............. 4
1.3.3 Magnitude and Phase ( Video )...................... 4
1.3.4 Euler’s Formula - ( Video )........................ 6
1.3.5 Polar form or Exponential form ( Video )................ 7
1.3.6 Conjugate - ( Video )........................... 9
1.3.7 Arithmetic with two complex numbers - ( Video )........... 9
1.3.8 Geometric interpretation of arithmetic operations - ( Video ,Python
notebook )................................. 10
1.3.9 More prope



## 4) Chunking Strategy

To scale to large documents, we split text into **overlapping chunks**.  
Parameters:
- `target_tokens` (approx. words) per chunk
- `overlap_tokens` (words) to keep context continuity


In [11]:

def chunk_text(text: str, target_tokens: int = 300, overlap_tokens: int = 60) -> List[Dict[str, Any]]:
    tokens = re.findall(r"\w+|\S", text)
    if not tokens:
        return []
    chunks = []
    start = 0
    while start < len(tokens):
        end = min(len(tokens), start + target_tokens)
        chunk_tokens = tokens[start:end]
        chunk_text = "".join([" " + t if re.match(r"\w+", t) else t for t in chunk_tokens]).strip()
        chunks.append({
            "id": str(uuid.uuid4()),
            "start_idx": start,
            "end_idx": end,
            "text": chunk_text
        })
        if end == len(tokens):
            break
        start = max(0, end - overlap_tokens)
    return chunks

chunks = chunk_text(doc_text, target_tokens=250, overlap_tokens=50)
print(f"Made {len(chunks)} chunks.")
print(chunks[0]['text'][:400] + "\n...")


Made 462 chunks.
Contents 1 Mathematical Preliminaries 2 1. 1 Trigonometric Identities............................. 3 1. 2 Magnitude and Angle Representation...................... 3 1. 3 Complex Numbers................................ 4 1. 3. 1 History-( Veritasium’ s video, KRN’ s video).............. 4 1. 3. 2 Cartesian Form-( Video, Python notebook).............. 4 1. 3. 3 Magnitude and Phase( Video)...........
...



## 5) LLM Client (Gemini or Offline Fallback)

- If `GOOGLE_API_KEY` is set and `google-generativeai` is installed, we use **Gemini 1.5 Flash** (configurable).
- Otherwise, we use a **heuristic fallback** that creates simple fact-based MCQs from the chunk text.


In [12]:
@dataclass
class MCQ:
    question: str
    options: List[str]
    correct_index: int
    rationale: str
    difficulty: str
    bloom: str
    source_span: str
    chunk_id: str

def _fallback_generate_mcqs_from_chunk(text: str, n: int, chunk_id: str) -> List[MCQ]:
    """
    Signals & Systems fallback generator with strong anti-duplication:
    - global seen-question cache across chunks
    - trigram Jaccard + difflib similarity checks
    - per-topic caps + template rotation
    - varied stems, difficulty, and Bloom's
    """
    import re, random, difflib
    from collections import defaultdict

    # helpers
    def _norm(s: str) -> str:
        s = re.sub(r"\s+", " ", s.strip().lower())
        return re.sub(r"[^\w\s]", "", s)

    def _ngrams(s: str, n: int = 3):
        toks = _norm(s).split()
        if len(toks) < n: 
            return {tuple(toks)} if toks else set()
        return {tuple(toks[i:i+n]) for i in range(len(toks)-n+1)}

    def _jaccard_sim(a: str, b: str, n: int = 3) -> float:
        A, B = _ngrams(a, n), _ngrams(b, n)
        if not A or not B:
            return 0.0
        return len(A & B) / max(1, len(A | B))

    def _too_similar(q: str, chosen_qs: list, jac_thresh: float = 0.5, diff_thresh: float = 0.86) -> bool:
        # reject if either similarity metric is high
        for cq in chosen_qs:
            if _jaccard_sim(q, cq) > jac_thresh:
                return True
            if difflib.SequenceMatcher(a=_norm(q), b=_norm(cq)).ratio() > diff_thresh:
                return True
        return False

    def _any_seen(q: str) -> bool:
        key = _norm(q)
        return key in _GLOBAL_Q_SEEN

    def _mark_seen(q: str):
        _GLOBAL_Q_SEEN.add(_norm(q))

    def has(patterns, s):
        return any(re.search(p, s, re.IGNORECASE) for p in patterns)

    # topic templates
    T = {
        "LTI_CT": [
            "For a continuous-time LTI system, the output y(t) equals:",
            "In CT LTI analysis, which relation correctly gives y(t)?",
            "Which statement expresses the CT LTI input–output relation?"
        ],
        "LTI_DT": [
            "For a discrete-time LTI system, the output y[n] equals:",
            "In DT LTI analysis, which relation correctly gives y[n]?",
            "Which statement expresses the DT LTI input–output relation?"
        ],
        "CT_DT_SIG": [
            "Which notation correctly distinguishes CT and DT signals?",
            "Identify the correct mapping between x(t) and x[n].",
            "Which option correctly pairs signal types with notation?"
        ],
        "CTFS": [
            "In the CT Fourier Series, spectral lines occur at:",
            "For a CT periodic signal, where do CTFS components lie?",
            "Which statement is true about CTFS line locations?"
        ],
        "DTFS": [
            "For an N-periodic sequence, the DTFS indices X[k] are:",
            "Which statement about DTFS indexing is correct?",
            "How are DTFS coefficient indices defined for period N?"
        ],
        "CTFT": [
            "Which property holds for the CTFT of x(t−t0)?",
            "Which time-shift property applies to X(jω)?",
            "Under a CT time shift, how does the CTFT change?"
        ],
        "DTFT": [
            "Which fundamental property holds for X(e^{jω})?",
            "Which periodicity property applies to the DTFT?",
            "What is true about the frequency periodicity of the DTFT?"
        ],
        "BW": [
            "A bandlimited signal is one whose spectrum:",
            "Which description correctly characterizes a bandlimited signal?",
            "What condition on X(ω) defines bandlimitedness?"
        ],
        "PERIODIC_FT": [
            "The Fourier transform of a CT periodic signal is:",
            "Which spectrum type arises for CT periodic signals?",
            "How does the FT of a CT periodic signal appear?"
        ],
        "SAMPLING": [
            "To avoid aliasing for bandwidth B, the sampling frequency must be:",
            "Which sampling rate condition prevents aliasing?",
            "What is the Nyquist-rate requirement for bandwidth B?"
        ],
        "DT_PROC_CT": [
            "Before A/D conversion, which filter suppresses out-of-band content?",
            "Which block prevents aliasing prior to sampling?",
            "What pre-sampling filtering is required to limit aliasing?"
        ],
        "GENERIC": [
            "Which statement best aligns with the definitions and properties presented?",
            "Select the option consistent with the identities discussed.",
            "Which option follows from the stated properties without external facts?"
        ],
    }

    KEY = {
        "LTI_CT": (
            "The convolution of x(t) with h(t): y(t) = x(t) * h(t)",
            ["The pointwise product y(t)=x(t)·h(t)", "The correlation y(t)=x(t)∘h(t)", "The sum y(t)=x(t)+h(t)"]
        ),
        "LTI_DT": (
            "The convolution of x[n] with h[n]: y[n] = x[n] * h[n]",
            ["The pointwise product y[n]=x[n]·h[n]", "The correlation y[n]=x[n]∘h[n]", "The sum y[n]=x[n]+h[n]"]
        ),
        "CT_DT_SIG": (
            "x(t) is continuous-time; x[n] is discrete-time.",
            ["x[n] is continuous-time; x(t) is discrete-time.", "Both x(t) and x[n] are CT.", "Both x(t) and x[n] are DT."]
        ),
        "CTFS": (
            "Integer multiples of the fundamental frequency ω0.",
            ["Arbitrary non-integer multiples of ω0", "Only DC and Nyquist", "Every rational multiple including fractions"]
        ),
        "DTFS": (
            "k = 0,…,N−1 (periodic modulo N).",
            ["All integers with no periodicity", "Only k=0 and k=N/2", "k = 1,…,N with no wrap-around"]
        ),
        "CTFT": (
            "A time shift x(t−t0) introduces a phase factor e^{−jωt0} in X(jω).",
            ["A time shift scales |X(jω)| by |t0|", "A time shift leaves X(jω) unchanged", "A time shift shifts X(jω) along ω by t0"]
        ),
        "DTFT": (
            "X(e^{jω}) is 2π-periodic in ω.",
            ["X(e^{jω}) is aperiodic in ω", "X(e^{jω}) is π-periodic", "Periodic only for finite-length signals"]
        ),
        "BW": (
            "Is zero for |ω| greater than some finite cutoff.",
            ["Is nonzero at arbitrarily large |ω|", "Is nonzero only at DC", "Is constant for all ω"]
        ),
        "PERIODIC_FT": (
            "A line spectrum (Dirac impulses) at harmonics of ω0.",
            ["A continuous flat spectrum", "A single impulse only at ω=0", "Nonzero only between −ω0 and +ω0"]
        ),
        "SAMPLING": (
            "At least 2B (the Nyquist rate).",
            ["At most B", "Exactly B/2", "Any positive value suffices"]
        ),
        "DT_PROC_CT": (
            "An anti-aliasing low-pass filter.",
            ["A high-pass pre-emphasis filter", "A notch filter at DC", "A comb filter at the sampling rate"]
        ),
        "GENERIC": (
            "The option consistent with the definitions/identities described.",
            ["Requires external facts not given", "Contradicts the provided properties", "Numerically plausible but unsupported"]
        ),
    }

    topic_idx = defaultdict(int)
    def next_template(topic: str) -> str:
        idx = topic_idx[topic] % len(T[topic])
        topic_idx[topic] += 1
        return T[topic][idx]

    def detect_topics(s: str):
        tests = [
            ("LTI_CT", [r"\bLTI\b|linear time invariant", r"\bconvolution\b|h\(t\)", r"\bcontinuous[-\s]?time\b|\bCT\b|x\(t\)"]),
            ("LTI_DT", [r"\bLTI\b|linear time invariant", r"\bconvolution\b|h\[n\]", r"\bdiscrete[-\s]?time\b|\bDT\b|x\[\s*n\s*\]"]),
            ("CTFS", [r"\bCTFS\b|\bfourier series\b|\ba_k\b"]),
            ("DTFS", [r"\bDTFS\b|X\[\s*k\s*\]"]),
            ("CTFT", [r"\bCTFT\b|X\(j\w\)|\bcontinuous[-\s]?time fourier transform\b"]),
            ("DTFT", [r"\bDTFT\b|X\(e\^\{j\w\}\)|\bdiscrete[-\s]?time fourier transform\b"]),
            ("BW", [r"\bbandwidth\b|\bbandlimited\b"]),
            ("PERIODIC_FT", [r"\bperiodic\b.*\bfourier transform\b|\bline spectrum\b|\bDirac comb\b|\bimpulses?\b"]),
            ("SAMPLING", [r"\bsampling\b|\bNyquist\b|\balias"]),
            ("DT_PROC_CT", [r"\bA/D\b|\bADC\b|\bD/A\b|\bDAC\b|anti-?alias|reconstruction|zero[-\s]?order hold|\bZOH\b"]),
            ("CT_DT_SIG", [r"\bcontinuous[-\s]?time\b|\bCT\b|x\(t\)|\bdiscrete[-\s]?time\b|\bDT\b|x\[\s*n\s*\]"]),
        ]
        found = []
        for topic, pats in tests:
            if any(re.search(p, s, re.IGNORECASE) for p in pats):
                found.append(topic)
        if not found:
            found = ["GENERIC"]
        return found

    DIFF = ["Easy", "Medium", "Hard"]
    BLOOM = ["Understand", "Apply", "Analyze", "Evaluate"]

    # -------- prioritize relevant sentences --------
    sentences = re.split(r'(?<=[.!?])\s+', text.strip())
    sentences = [s for s in sentences if s.strip()]
    def relevance(s):
        pats = [r"\bLTI\b", r"\bCTFS\b|\bDTFS\b", r"\bCTFT\b|\bDTFT\b", r"\bbandwidth\b|bandlimited", r"\bNyquist\b|sampling", r"Dirac|impulses?"]
        return sum(bool(re.search(p, s, re.IGNORECASE)) for p in pats)
    sentences = sorted(sentences, key=relevance, reverse=True)

    chosen: List[MCQ] = []
    chosen_qs: list[str] = []
    per_topic_count = defaultdict(int)
    per_topic_cap = max(1, (n + 1) // 2)
    prefix_seen = set()

    # Try to fill n items with strict uniqueness
    attempts = 0
    for s in sentences:
        if len(chosen) >= n:
            break
        topics = detect_topics(s)
        got_one_here = False

        for topic in topics:
            if per_topic_count[topic] >= per_topic_cap:
                continue

            tried_templates = 0
            while tried_templates < len(T[topic]):
                q = next_template(topic)
                correct, distractors = KEY[topic]

                # dedup against this chunk + global runs
                prefix = " ".join(_norm(q).split()[:6])
                if prefix in prefix_seen:
                    tried_templates += 1
                    continue
                if _too_similar(q, chosen_qs, jac_thresh=0.48, diff_thresh=0.85):
                    tried_templates += 1
                    continue
                if _any_seen(q):
                    tried_templates += 1
                    continue

                opts = [correct] + distractors
                random.shuffle(opts)
                item = MCQ(
                    question=q,
                    options=opts,
                    correct_index=opts.index(correct),
                    rationale=f"The sentence '{s.strip()}' provides the relevant context for this property.",
                    difficulty=DIFF[len(chosen) % len(DIFF)],
                    bloom=BLOOM[len(chosen) % len(BLOOM)],
                    source_span=s.strip(),
                    chunk_id=chunk_id
                )

                chosen.append(item)
                chosen_qs.append(q)
                per_topic_count[topic] += 1
                prefix_seen.add(prefix)
                _mark_seen(q)
                got_one_here = True
                break  # stop rotating templates for this topic

            if got_one_here or len(chosen) >= n:
                break

        attempts += 1
        if attempts > 5 * n:
            break  # safety

    # If still short, use GENERIC templates with uniqueness checks
    guard = 0
    while len(chosen) < n and guard < 50:
        guard += 1
        q = T["GENERIC"][topic_idx["GENERIC"] % len(T["GENERIC"])]
        topic_idx["GENERIC"] += 1
        prefix = " ".join(_norm(q).split()[:6])
        if prefix in prefix_seen:
            continue
        if _too_similar(q, chosen_qs, jac_thresh=0.45, diff_thresh=0.84):
            continue
        if _any_seen(q):
            continue

        correct, distractors = KEY["GENERIC"]
        opts = [correct] + distractors
        random.shuffle(opts)
        item = MCQ(
            question=q,
            options=opts,
            correct_index=opts.index(correct),
            rationale="Consistent with the properties stated in the passage.",
            difficulty=DIFF[len(chosen) % len(DIFF)],
            bloom=BLOOM[len(chosen) % len(BLOOM)],
            source_span="",
            chunk_id=chunk_id
        )
        chosen.append(item)
        chosen_qs.append(q)
        prefix_seen.add(prefix)
        _mark_seen(q)

    return chosen[:n]


GEMINI_MODEL_NAME = "models/gemini-1.5-flash"

def _gemini_generate_mcqs_from_chunk(text: str, n: int, chunk_id: str, seed: int = 7) -> List[MCQ]:
    system_prompt = (
    "You are a careful exam-item writer. From the provided passage, generate high-quality MCQs.\n"
    "Rules:\n"
    "• Every question MUST be logically inferable from the passage (no external facts).\n"
    "• Avoid ambiguity, avoid 'all/none of the above', and avoid double negatives.\n"
    "• Do NOT start all questions with 'What is'; vary phrasing logically (e.g., 'Which statement best explains…', 'Why does…', 'In the context of…').\n"
    "• Each item must have exactly 4 options: 1 correct answer and 3 plausible distractors (near-misses).\n"
    "• Difficulty must vary across items (mix of easy, medium, and hard) and be annotated.\n"
    "• Label Bloom’s taxonomy level appropriately (e.g., Remember, Understand, Apply, Analyze, Evaluate, Create).\n"
    "• Return strict JSON as an array of items with the fields:\n"
    "{question, options[4], correct_index, rationale, difficulty, bloom, source_span}."
)
    user_prompt = f"Passage:\n'''\n{text}\n'''\nGenerate {n} MCQs."

    model = genai.GenerativeModel(GEMINI_MODEL_NAME)
    resp = model.generate_content(
        [{"role":"system","parts":[system_prompt]},
         {"role":"user","parts":[user_prompt]}],
        generation_config={"temperature":0.7, "top_p":0.95, "top_k":40, "max_output_tokens": 1024, "seed": seed}
    )
    # Parse as JSON
    raw = resp.text
    try:
        data = json.loads(raw)
        items = []
        for it in data:
            items.append(MCQ(
                question=it["question"].strip(),
                options=[o.strip() for o in it["options"]],
                correct_index=int(it["correct_index"]),
                rationale=it.get("rationale","").strip(),
                difficulty=it.get("difficulty","medium").strip(),
                bloom=it.get("bloom","Understand").strip(),
                source_span=it.get("source_span","").strip(),
                chunk_id=chunk_id
            ))
        return items
    except Exception:
        # If the model didn't return strict JSON, attempt a fallback via regex or just degrade to heuristic
        return _fallback_generate_mcqs_from_chunk(text, n, chunk_id)

def generate_mcqs_for_chunks(chunks: List[Dict[str,Any]], n_per_chunk: int = 2, seed: int = 7) -> List[MCQ]:
    random.seed(seed)
    all_items = []
    for ch in chunks:
        if HAS_GEMINI:
            items = _gemini_generate_mcqs_from_chunk(ch["text"], n_per_chunk, ch["id"], seed=seed)
        else:
            items = _fallback_generate_mcqs_from_chunk(ch["text"], n_per_chunk, ch["id"])
        all_items.extend(items)
    return all_items



## 6) Quality Checks & Normalization

We enforce simple **guardrails**:
- Exactly **4 options** and a single **correct_index** in `[0..3]`.
- **No ambiguous** choices; ban common pitfalls.
- **Answer within source**: correct option text must overlap with the `source_span` (basic containment check).


In [13]:

FORBIDDEN_PHRASES = {
    "all of the above", "none of the above", "all the above", "none of these", "all of these"
}

def passes_quality(item: MCQ) -> Tuple[bool, List[str]]:
    errs = []
    if len(item.options) != 4:
        errs.append("Must have exactly 4 options.")
    if not (0 <= item.correct_index < len(item.options)):
        errs.append("correct_index out of range.")
    # forbid certain phrases
    lower_opts = [o.lower() for o in item.options]
    if any(any(fp in o for fp in FORBIDDEN_PHRASES) for o in lower_opts):
        errs.append("Forbidden options like 'all/none of the above'.")
    # answer in source check (weak but useful)
    correct = item.options[item.correct_index]
    if item.source_span and (correct[:20].lower() not in item.source_span.lower()):
        # very lenient containment check on leading snippet
        pass  # sometimes Gemini paraphrases; accept but flag softly
    return (len(errs) == 0, errs)

def normalize_items(items: List[MCQ]) -> List[MCQ]:
    norm = []
    for it in items:
        ok, errs = passes_quality(it)
        if ok:
            norm.append(it)
    return norm



## 7) Run Generation

Use a modest `n_per_chunk` to limit output size for the MVP. You can scale this up and/or parallelize by chunk.


In [14]:

items_raw = generate_mcqs_for_chunks(chunks, n_per_chunk=2, seed=11)
items = normalize_items(items_raw)

print(f"Generated {len(items_raw)} MCQs; {len(items)} passed basic quality checks.")


Generated 36 MCQs; 36 passed basic quality checks.



## 8) View & Export

Assemble a DataFrame with all annotations and export to **CSV** and **JSONL**.  
Also include the **bonus**: simple item analysis statistics.


In [15]:

def mcqs_to_records(items: List[MCQ]) -> List[Dict[str,Any]]:
    recs = []
    for i, it in enumerate(items, 1):
        recs.append({
            "id": f"Q{i:04d}",
            "question": it.question,
            "option_A": it.options[0] if len(it.options)>0 else "",
            "option_B": it.options[1] if len(it.options)>1 else "",
            "option_C": it.options[2] if len(it.options)>2 else "",
            "option_D": it.options[3] if len(it.options)>3 else "",
            "correct_index": it.correct_index,
            "correct_letter": ["A","B","C","D"][it.correct_index] if 0<=it.correct_index<4 else "",
            "rationale": it.rationale,
            "difficulty": it.difficulty,
            "bloom": it.bloom,
            "source_span": it.source_span,
            "chunk_id": it.chunk_id
        })
    return recs

records = mcqs_to_records(items)
df = pd.DataFrame.from_records(records)
display(df.head(10))

out_csv = "mcqs_output.csv"
out_jsonl = "mcqs_output.jsonl"

df.to_csv(out_csv, index=False)
with open(out_jsonl, "w", encoding="utf-8") as f:
    for r in records:
        f.write(json.dumps(r, ensure_ascii=False) + "\n")

print("Saved:", out_csv, "and", out_jsonl)


Unnamed: 0,id,question,option_A,option_B,option_C,option_D,correct_index,correct_letter,rationale,difficulty,bloom,source_span,chunk_id
0,Q0001,Which statement best aligns with the definitio...,The option consistent with the definitions/ide...,Requires external facts not given,Contradicts the provided properties,Numerically plausible but unsupported,0,A,The sentence 'Contents 1 Mathematical Prelimin...,Easy,Understand,Contents 1 Mathematical Preliminaries 2 1.,16fbc609-dfbe-4c95-95c2-b16f4e0d312b
1,Q0002,Select the option consistent with the identiti...,Requires external facts not given,The option consistent with the definitions/ide...,Contradicts the provided properties,Numerically plausible but unsupported,1,B,Consistent with the properties stated in the p...,Medium,Apply,,16fbc609-dfbe-4c95-95c2-b16f4e0d312b
2,Q0003,Which option follows from the stated propertie...,The option consistent with the definitions/ide...,Numerically plausible but unsupported,Contradicts the provided properties,Requires external facts not given,0,A,The sentence '...' provides the relevant conte...,Easy,Understand,...,52d4e988-4a3b-402d-8634-6bf0080896a3
3,Q0004,"To avoid aliasing for bandwidth B, the samplin...",Exactly B/2,Any positive value suffices,At least 2B (the Nyquist rate).,At most B,2,C,The sentence '3 Sampling.........................,Easy,Understand,3 Sampling.................................,ff13ff54-1403-429e-aec7-1b0b0c09d5cd
4,Q0005,"For a continuous-time LTI system, the output y...",The pointwise product y(t)=x(t)·h(t),The sum y(t)=x(t)+h(t),The convolution of x(t) with h(t): y(t) = x(t)...,The correlation y(t)=x(t)∘h(t),2,C,The sentence '3 Continuous- Time( CT) and Disc...,Medium,Apply,3 Continuous- Time( CT) and Discrete- Time( DT...,ff13ff54-1403-429e-aec7-1b0b0c09d5cd
5,Q0006,"For a discrete-time LTI system, the output y[n...",The sum y[n]=x[n]+h[n],The pointwise product y[n]=x[n]·h[n],The correlation y[n]=x[n]∘h[n],The convolution of x[n] with h[n]: y[n] = x[n]...,3,D,The sentence '2 Energy and Power of DT Signals...,Easy,Understand,2 Energy and Power of DT Signals.................,071cff3c-ccf9-41ea-b12f-3cbb9f232cdb
6,Q0007,"In DT LTI analysis, which relation correctly g...",The pointwise product y[n]=x[n]·h[n],The convolution of x[n] with h[n]: y[n] = x[n]...,The correlation y[n]=x[n]∘h[n],The sum y[n]=x[n]+h[n],1,B,The sentence 'erence operator for DT signals.....,Easy,Understand,erence operator for DT signals...................,bd7e1adc-9706-4c17-88c5-18c3c50f5742
7,Q0008,Which statement expresses the DT LTI input–out...,The pointwise product y[n]=x[n]·h[n],The sum y[n]=x[n]+h[n],The correlation y[n]=x[n]∘h[n],The convolution of x[n] with h[n]: y[n] = x[n]...,3,D,The sentence 'scaling for DT signals.............,Easy,Understand,scaling for DT signals.......................,d992be3a-e052-487a-99e4-4f15af5e41c2
8,Q0009,"In CT LTI analysis, which relation correctly g...",The pointwise product y(t)=x(t)·h(t),The correlation y(t)=x(t)∘h(t),The sum y(t)=x(t)+h(t),The convolution of x(t) with h(t): y(t) = x(t)...,3,D,The sentence '10 Time Shifting of CT signals.....,Medium,Apply,10 Time Shifting of CT signals...................,d992be3a-e052-487a-99e4-4f15af5e41c2
9,Q0010,The Fourier transform of a CT periodic signal is:,Nonzero only between −ω0 and +ω0,A line spectrum (Dirac impulses) at harmonics ...,A single impulse only at ω=0,A continuous flat spectrum,1,B,The sentence '12 Discrete time Impulse or Delt...,Easy,Understand,12 Discrete time Impulse or Delta function-( v...,abd1d62a-6ace-442d-8da2-6dab573ac697


Saved: mcqs_output.csv and mcqs_output.jsonl



## 9) Bonus: Item Analysis (Quick Stats)

We compute:
- Difficulty distribution
- Bloom’s taxonomy distribution
- Option-position bias (how often A/B/C/D is correct)
- Span coverage by chunk


In [16]:

def item_analysis(df: pd.DataFrame) -> Dict[str, Any]:
    stats = {}
    stats["difficulty_counts"] = df["difficulty"].value_counts().to_dict()
    stats["bloom_counts"] = df["bloom"].value_counts().to_dict()
    stats["correct_letter_counts"] = df["correct_letter"].value_counts().to_dict()
    stats["by_chunk"] = df.groupby("chunk_id")["id"].count().to_dict()
    return stats

stats = item_analysis(df)
stats


{'difficulty_counts': {'Easy': 30, 'Medium': 6},
 'bloom_counts': {'Understand': 30, 'Apply': 6},
 'correct_letter_counts': {'B': 13, 'C': 9, 'D': 9, 'A': 5},
 'by_chunk': {'0018bf16-d7a3-4af5-9b10-227d8c83603d': 1,
  '071cff3c-ccf9-41ea-b12f-3cbb9f232cdb': 1,
  '16fbc609-dfbe-4c95-95c2-b16f4e0d312b': 2,
  '19be7aa2-35fc-4ed5-afe6-fd6c4c872b7a': 1,
  '247b7a94-75dd-411e-9734-052e7ec7d20c': 1,
  '264db5b6-8bd8-4589-917e-bbd62b2c12e2': 1,
  '2b246f01-1538-4868-9cce-908103adc975': 1,
  '2c8617cd-6f9a-4930-b4e8-ce3e0db80b42': 1,
  '43d33124-aa55-476f-a098-d8cf21716ff2': 1,
  '45ab8db5-c8eb-4d10-a7e7-c3e83d526d8a': 1,
  '52d4e988-4a3b-402d-8634-6bf0080896a3': 1,
  '53ebe84d-ce79-427c-8820-e876d40cfaf0': 1,
  '5cef7c56-0d16-4ddd-b243-0c16babebbf6': 1,
  '5e66cf01-5cc3-46f8-812b-35a8d6c6bd61': 1,
  '82a0d2c4-893c-4ead-b5db-085bdfb22f41': 2,
  '91a4f210-6b45-4a7e-8e96-e1ec27eebb0f': 1,
  '922e6422-59c8-4f06-921f-80b65126865c': 1,
  '9f957c35-e149-4c75-942b-eee29ba565e5': 1,
  'a3a6ff8e-ce20-41


## 10) HTML Review Report

We export a compact HTML for quick manual review by SMEs.


In [17]:

def to_html_report(df: pd.DataFrame, stats: Dict[str,Any], path: str) -> None:
    rows = []
    for _, r in df.iterrows():
        rows.append(f"""
        <div class='q'>
            <h3>{html.escape(r['id'])}: {html.escape(r['question'])}</h3>
            <ol type="A">
                <li>{html.escape(r['option_A'])}</li>
                <li>{html.escape(r['option_B'])}</li>
                <li>{html.escape(r['option_C'])}</li>
                <li>{html.escape(r['option_D'])}</li>
            </ol>
            <p><b>Answer:</b> {html.escape(r['correct_letter'])} &nbsp;|&nbsp;
               <b>Difficulty:</b> {html.escape(str(r['difficulty']))} &nbsp;|&nbsp;
               <b>Bloom:</b> {html.escape(str(r['bloom']))}</p>
            <p><b>Rationale:</b> {html.escape(r['rationale'])}</p>
            <details><summary>Source Span</summary><pre>{html.escape(r['source_span'])}</pre></details>
            <hr/>
        </div>
        """)
    stats_html = "<pre>" + html.escape(json.dumps(stats, indent=2)) + "</pre>"
    html_str = f"""
    <html>
    <head>
      <meta charset="utf-8"/>
      <title>MCQ Review</title>
      <style>
        body {{ font-family: system-ui, -apple-system, Segoe UI, Arial; margin: 24px; }}
        .q {{ margin-bottom: 24px; }}
        h1 {{ margin-top: 0; }}
        pre {{ white-space: pre-wrap; }}
      </style>
    </head>
    <body>
      <h1>MCQ Review ({datetime.utcnow().isoformat()}Z)</h1>
      <h2>Questions</h2>
      {''.join(rows)}
      <h2>Bonus<h2>
      <h2>Item Analysis</h2>
      {stats_html}
    </body>
    </html>
    """
    Path(path).write_text(html_str, encoding="utf-8")

report_path = "mcqs_review.html"
to_html_report(df, stats, report_path)
report_path


'mcqs_review.html'


## 11) Reusable Facade

Call `generate_mcqs(document_path, n_per_chunk, chunk_tokens, overlap_tokens)` programmatically.


In [18]:

def generate_mcqs(
    document_path: Optional[str] = DEFAULT_DOC,
    n_per_chunk: int = 2,
    chunk_tokens: int = 250,
    overlap_tokens: int = 50,
    seed: int = 11
) -> pd.DataFrame:
    text = load_document(document_path)
    chs = chunk_text(text, target_tokens=chunk_tokens, overlap_tokens=overlap_tokens)
    its = generate_mcqs_for_chunks(chs, n_per_chunk=n_per_chunk, seed=seed)
    its = normalize_items(its)
    df = pd.DataFrame.from_records(mcqs_to_records(its))
    return df

# Example (commented):
# df2 = generate_mcqs('/path/to/your.pdf', n_per_chunk=3)
# display(df2.head())



## 12) Configuration JSON (Reproducibility)

We persist a small config used for this run.


In [19]:

run_config = {
    "model": GEMINI_MODEL_NAME if HAS_GEMINI else "fallback-local",
    "seed": 11,
    "n_per_chunk": 2,
    "chunk_tokens": 250,
    "overlap_tokens": 50,
    "document_used": DEFAULT_DOC,
}

config_path = "mcq_run_config.json"
Path(config_path).write_text(json.dumps(run_config, indent=2), encoding="utf-8")
config_path, run_config


('mcq_run_config.json',
 {'model': 'fallback-local',
  'seed': 11,
  'n_per_chunk': 2,
  'chunk_tokens': 250,
  'overlap_tokens': 50,
  'document_used': '/Users/jisusingh/Downloads/Assignment_Encando/Documents/notes.pdf'})


## 13) How to Use with Your Own Document

1. Upload a `.pdf` or `.txt` to this environment (or mount a path).
2. Set the `document_path` below and run the cell.
3. Provide your `GOOGLE_API_KEY` to use Gemini; otherwise fallback runs.
