# AI-Resistant Assignment Evaluation & Redesign Demo

This notebook runs:
1. **Simulate** an AI “student” answer  
2. **Assess** the assignment vulnerability + grade the answer  
3. **Redesign** suggestions to make it AI-resistant  
4. **Report** assembly  

Make sure you have:
- A `.env` file in this folder with `OPENAI_API_KEY=sk-...`
- `assignment.pdf` or `.txt` in `assignment_examples/`

In [1]:
# %%bash
# Uncomment if you need to install dependencies in your notebook env
# pip install openai langchain langchain-community pyyaml python-dotenv PyPDF2

# %%python
import os
import json
import yaml
import argparse
from dotenv import load_dotenv
from PyPDF2 import PdfReader
from langchain_community.chat_models import ChatOpenAI
from langchain.schema import HumanMessage

# Utility: display Markdown nicely in Jupyter
from IPython.display import Markdown, display

In [2]:
# Load your API key
load_dotenv()  # expects a .env file in this folder

# Helper to load PDF or TXT
def load_assignment(path: str) -> str:
    if path.lower().endswith(".pdf"):
        reader = PdfReader(path)
        return "\n\n".join(p.extract_text() or "" for p in reader.pages)
    else:
        return open(path, "r", encoding="utf-8").read()

# Example prompt path
ASSIGNMENT_PATH = "assignment_examples/CS598LMZ_HW2_sp25.pdf"
assignment_text = load_assignment(ASSIGNMENT_PATH)
display(Markdown("**Assignment Prompt:**\n\n" + assignment_text[:500] + "..."))

**Assignment Prompt:**

HW2: Code Generation and Prompting
CS598LMZ Spring 2025
1 Goal
In this assignment you will use code models for program synthesis. The goal is to test how code
LLMs can solve programming problems.
1.1 HumanEval benchmark
def common (l1: list, l2: list): 
    """Return sorted unique common elements for two lists""" 
[5,3,2,8], [3,2] 
[4,3,2,8], [3,2,4] [4,3,2,8], [] H UMAN E V AL input 
    common_elements = list (set(l1). intersection (set(l2))) 
    common_elements. sort ()
    return common_ele...

In [None]:
import os

API_KEY = os.getenv("OPENAI_API_KEY")
if not API_KEY or API_KEY == "your_key_here":
    # ask you to type it in
    API_KEY = input("Enter your OpenAI API key: ").strip()

# later, when you init the LLM:
llm = ChatOpenAI(model="o4-mini", openai_api_key=API_KEY)

In [18]:
# Simulate (AI Student)
# %%python
def simulate_assignment(text: str) -> str:
    llm = ChatOpenAI(model="o4-mini", openai_api_key=API_KEY,temperature=1.0)
    prompt = (
        "You are a third-year university student with two hours to complete the assignment. "
        "Write an answer of at least 800 words, citing relevant course materials.\n\n"
        f"{text}\n\nBegin your answer now:"
    )
    resp = llm.invoke([HumanMessage(content=prompt)])
    return resp.content

ai_answer = simulate_assignment(assignment_text)
# save or display
with open("ai_answer.md","w",encoding="utf-8") as f: f.write(ai_answer)
display(Markdown("## AI-Generated Answer\n\n" + ai_answer[:500] + "..."))

## AI-Generated Answer

Below is the report.md that summarizes our HW2 runs, followed by the code snippets and discussion of base vs. improved synthesis. All references to course materials are included at the end.

------------------------------------------------------------
report.md
------------------------------------------------------------

1) Results of both runs on HumanEval (greedy pass@1) using EvalPlus:

Base Run:  
 • Base score           = 19.1%  
 • Base + Extra score   = 20.4%  

Improved Run:  
 • Base s...

In [None]:
import re
import json

def safe_parse_json(raw: str) -> dict:
    """
    Strip anything before/after the first {…} and load JSON.
    """
    raw = raw.strip()
    if not raw.startswith("{"):
        m = re.search(r"(\{.*\})", raw, re.DOTALL)
        if m:
            raw = m.group(1)
    return json.loads(raw)

In [22]:
# Assess
# Load rubric
with open("rubric.yaml","r",encoding="utf-8") as f:
    RUBRIC = yaml.safe_load(f)

def analyze_assignment(text: str) -> dict:
    """
    Analyze the assignment for AI‐vulnerability across five dimensions.
    Returns a dict:
      { dimension: {score: 0–100, rationale: "…" }, … }
    """
    metrics = {
        "retrievability": (
            "Answer can be directly retrieved from publicly available sources "
            "without context constraints."
        ),
        "templatedness": (
            "Response follows a highly generic formulaic structure lacking "
            "specific context (e.g., a five‐paragraph essay with no proper names)."
        ),
        "context_dependence": (
            "Does not require integration of real‐world context or personal experience."
        ),
        "counterfactual_need": (
            "Does not demand counterfactual reasoning, evaluation, or hypothesis "
            "(e.g., no prompts asking 'explain why not')."
        ),
        "process_orientation": (
            "Does not require submission of intermediate work (drafts, outlines), only a final answer."
        )
    }

    llm = ChatOpenAI(
        model="o4-mini",
        openai_api_key=API_KEY,
        # o4-mini only supports temperature=1.0 by default
        temperature=1.0
    )

    prompt = (
        "You are an educational assessment expert. Analyze the following assignment prompt "
        "**only** according to these five dimensions, and **respond with a single valid JSON object and nothing else**:\n\n"
        "For each dimension, assign a risk score from 0 to 100 (higher means more vulnerable), and provide a brief rationale."
        f"{yaml.dump(metrics, allow_unicode=True)}\n\n"
        f"Assignment:\n{text}\n\n"
        "The JSON must be of the form:\n"
        "{\n"
        '  "retrievability": {"score": <0-100>, "rationale": "…"},\n'
        '  "templatedness": {"score": <0-100>, "rationale": "…"},\n'
        '  "context_dependence":{"score": <0-100>, "rationale": "…"},\n'
        '  "counterfactual_need": {"score": <0-100>, "rationale": "…"},\n'
        '  "process_orientation": {"score": <0-100>, "rationale": "…"},\n'
        "}\n"
    )

    raw = llm.invoke([HumanMessage(content=prompt)]).content
    return safe_parse_json(raw)



def grade_answer(text: str, answer: str) -> dict:
    """
    Grade the given answer against the rubric.
    Returns a dict:
      {
        total_score: 0–100,
        breakdown: { criterion: score, … },
        strengths: "…",
        weaknesses: "…"
      }
    """
    rubric_text = yaml.dump(RUBRIC, allow_unicode=True)

    llm = ChatOpenAI(
        model="o4-mini",
        openai_api_key=API_KEY,
        temperature=1.0
    )

    prompt = (
        "You are a strict teaching assistant. Grade the student’s answer according to the rubric below "
        "and **respond with valid JSON only** (no extra text):\n\n"
        f"{rubric_text}\n\n"
        f"Assignment:\n{text}\n\n"
        f"Student Answer:\n{answer}\n\n"
        "The JSON schema must be:\n"
        "{\n"
        "  \"total_score\": <0-100>,\n"
        "  \"breakdown\": {\"criterion\": <0-criterion_weight>, …},\n"
        "  \"strengths\": \"…\",\n"
        "  \"weaknesses\": \"…\"\n"
        "}\n"
    )

    raw = llm.invoke([HumanMessage(content=prompt)]).content
    return safe_parse_json(raw)


assignment_analysis = analyze_assignment(assignment_text)
answer_assessment   = grade_answer(assignment_text, ai_answer)


# Package up the results exactly as our pipeline expect
evaluation = {
    "assignment_analysis": assignment_analysis,
    "answer_assessment": answer_assessment
}

# Write to evaluation.json
with open("evaluation.json", "w", encoding="utf-8") as f:
    json.dump(evaluation, f, ensure_ascii=False, indent=2)

print("✅ evaluation.json saved")

display(Markdown("### Assignment Analysis\n" + 
                 "\n".join(f"- **{k}**: {v}" for k,v in assignment_analysis.items())))
display(Markdown("### Answer Assessment\n" + json.dumps(answer_assessment, indent=2)))

NameError: name 'safe_parse_json' is not defined

In [None]:
# Redesign Suggestions
def redesign(assignment: str, analysis: dict, assess: dict) -> str:
    llm = ChatOpenAI(model="o4-mini", openai_api_key=API_KEY,temperature=1.0)
    prompt = (
        "You are an instructional designer. Given:\n\n"
        f"Assignment:\n{assignment}\n\n"
        f"Vulnerability Analysis:\n{json.dumps(analysis,indent=2)}\n\n"
        f"Weaknesses:\n{assess['weaknesses']}\n\n"
        "Propose 3 AI-resistant redesign suggestions with rationale and examples."
    )
    return llm.invoke([HumanMessage(content=prompt)]).content

suggestions_md = redesign(assignment_text, assignment_analysis, answer_assessment)
with open("suggestions.md","w",encoding="utf-8") as f: f.write(suggestions_md)
display(Markdown("## Redesign Suggestions\n\n" + suggestions_md))

In [None]:
# Assemble Report

# Build final report in a variable
report_lines = []
report_lines.append("# Final AI-Resistant Assignment Report\n")
report_lines.append("## 1. Prompt\n" + assignment_text + "\n")
report_lines.append("## 2. AI Answer\n" + ai_answer + "\n")
report_lines.append("## 3. Vulnerability Analysis\n" + json.dumps(assignment_analysis, indent=2) + "\n")
report_lines.append("## 4. Assessment\n" + json.dumps(answer_assessment, indent=2) + "\n")
report_lines.append("## 5. Redesign Suggestions\n" + suggestions_md + "\n")

final_report = "\n".join(report_lines)
with open("report.md","w",encoding="utf-8") as f: f.write(final_report)
display(Markdown(final_report))