# Final Project Starting Guide

Hello everyone, welcome to the final project! This notebook is provided to you to reiterate the rules and guidelines, and give you some starting points.

### What we provide

In this project, we will provide you with 
- This starting guide
- A working API that you can access under ASU network (i.e., on campus or with VPN)
- A starting development data that you can use to develop your agent. It contains 1,000 instances with {domain, input, expected_output}

### Your goal

In this project, you will implement an inference-time agent to solve reasoning requests, as those provided in the development data. The grading of this project will be effort-based and you will get full credit if you produce the minimum deliverables below, with subject to the rules and requirements below.

#### Minimum Deliverables

1. A working agent loop (in the form of a Github project) that the TA can run, and implements *at least three* inference-time algorithms or techniques.
2. Outputs from your agent on the released test data (see important dates). 
3. A short one-page report on how your agent works, and pointer to important techniques (referece to code blocks).

#### Rules and Requirements
1. You must only use our provided API call to access LLMs; meaning that you cannot use any other LLMs in any other way within your agent loop. Some exceptions may be made if you call certain external tools (e.g., Google search) that use some LLMs internally. Please discuss any exceptions with us to avoid penalties up to 50% of the project grade.
2. You must not hardcode a full delegation to an external tool (e.g., google_search(input_problem)). Such delegations must be automatically selected/decided by your agent. Hardcode delegations will lead to a zero.
3. You cannot use Cursor or any AI coding aids to implement the final project. You can, however, ask LLMs (or other online resources) for conceptual clarification or code examples. Your final project should not contain any blocks of code (i.e., > 3 lines) that are written by AI. Violations will lead to a zero.
4. Your agent should be able to run efficiently, with <20 LLM calls per question. Exceptions may be made when you have a complicated agent but please discuss with us. Up to 10% of the project grade may be deducted if we observe very inefficient LLM usages that do not clearly benefit the performance.
5. Your agent must run without any requests to any paid services (paid is defined by if the TA has to pay to run it, regardless of whether you actuallly pay for it or not.) Violations will lead to a zero.
6. You must submit a Github project link as your code submission. All changes must be tracked and any commits should be within 100 lines of +/- with good messages. Points will be deducted to up to 25% of the project grade if we observe "magic commits" or too few commits. 


### Suggestions
1. Start early, please.
2. You should consider how you can evaluate whether your output is good enough compared to the provided expected_outputs, and we will not release how we will actually evaluate your outputs; meaning that you have to try to predict how we will evaluate things.
3. Start with a basic implementation, and iterate based on mistakes/feedbacks.
4. Find more development data, or create your own cases to stree-test your agent. 
5. You are free to modify any provided code in this starting guide, or not using any of these code at all.

### Important dates
- **Release of final test data**: 11/25/2025
- **Deadline for submitting all deliverables**: 12/05/2025

### Extra Credit. 
The top 20 projects (ranked by performance metrics on the test data and at the TA's discretion of implementation quality) will be given extra credits. The actual credits will be between 1% to 7.5% depending on the ranking.

In [1]:
# %% Minimal setup
# If needed (uncomment in a notebook):
# !pip install requests python-dotenv

import os, json, textwrap, re, time
import requests

# API_KEY  = os.getenv("OPENAI_API_KEY", "cse476")
# API_BASE = os.getenv("API_BASE", "http://10.4.58.53:41701/v1")  
# MODEL    = os.getenv("MODEL_NAME", "bens_model")   

API_KEY = "cse476"
API_BASE = "http://10.4.58.53:41701/v1"
MODEL = "bens_model"           

def call_model_chat_completions(prompt: str,
                                system: str = "You are a helpful assistant. Reply with only the final answer—no explanation.",
                                model: str = MODEL,
                                temperature: float = 0.0,
                                timeout: int = 60) -> dict:
    """
    Calls an OpenAI-style /v1/chat/completions endpoint and returns:
    { 'ok': bool, 'text': str or None, 'raw': dict or None, 'status': int, 'error': str or None, 'headers': dict }
    """
    url = f"{API_BASE}/chat/completions"
    headers = {
        "Authorization": f"Bearer {API_KEY}",
        "Content-Type":  "application/json",
    }
    payload = {
        "model": model,
        "messages": [
            {"role": "system", "content": system},
            {"role": "user",   "content": prompt}
        ],
        "temperature": temperature,
        "max_tokens": 128,
    }

    try:
        resp = requests.post(url, headers=headers, json=payload, timeout=timeout)
        status = resp.status_code
        hdrs   = dict(resp.headers)
        if status == 200:
            data = resp.json()
            text = data.get("choices", [{}])[0].get("message", {}).get("content", "")
            return {"ok": True, "text": text, "raw": data, "status": status, "error": None, "headers": hdrs}
        else:
            # try best-effort to surface error text
            err_text = None
            try:
                err_text = resp.json()
            except Exception:
                err_text = resp.text
            return {"ok": False, "text": None, "raw": None, "status": status, "error": str(err_text), "headers": hdrs}
    except requests.RequestException as e:
        return {"ok": False, "text": None, "raw": None, "status": -1, "error": str(e), "headers": {}}


## 1) Smoke test: direct inference

We’ll do a single request with a strict instruction to answer briefly.  
*If you see an auth error, set `OPENAI_API_KEY` and (if needed) `API_BASE`/`MODEL_NAME`.*


In [2]:
# %% Direct call example
demo_prompt = "What is 17 + 28? Answer with just the number."
result = call_model_chat_completions(demo_prompt)
print("OK:", result["ok"], "HTTP:", result["status"])
print("MODEL SAYS:", (result["text"] or "").strip())

# Optional: Inspect rate-limit headers if your provider exposes them
for k in ["x-ratelimit-remaining-requests", "x-ratelimit-limit-requests", "x-request-id"]:
    if k in result["headers"]:
        print(f"{k}: {result['headers'][k]}")


OK: True HTTP: 200
MODEL SAYS: 45


## 2) A tiny test set (3 questions)

We’ll cover:
1. **Math reasoning** — inequality solving,
2. **Common sense** — buoyancy/ice & water,
3. **Logic** — a classic race-position puzzle.

We also tightly constrain the required answer forms to enable simple auto‑grading.


In [3]:
# %% Define three tests: input + expected
tests = [
    {
        "id": "math_inequality",
        "type": "numeric",  # grader will prefer numeric extraction
        "prompt": "Solve for the smallest integer n such that 3n + 5 > 26. Answer with just the integer.",
        "expected": "8",    # Because 3n > 21 => n > 7, smallest integer is 8
    },
    {
        "id": "commonsense_ice",
        "type": "text",
        "prompt": (
            "You place an ice cube in a glass of water and mark the water level. "
            "After the ice melts, does the water level rise, fall, or stay the same? "
            "Answer with exactly one of: 'rise', 'fall', 'stay the same'."
        ),
        "expected": "stay the same",
    },
    {
        "id": "logic_race",
        "type": "text",
        "prompt": (
            "In a race, you pass the person in second place. What position are you now in? "
            "Answer with a single word like 'first', 'second', 'third'."
        ),
        "expected": "second",
    },
]


## 3) Minimal evaluator

We provide some example code to decide whether the agent outputs match the expected outputs, just to give you an idea of how evaluations can be done. You are free to use this code, or not.

In [4]:
# %% Simple normalization and evaluation helpers
def normalize_text(s: str) -> str:
    s = (s or "").strip().lower()
    # Remove surrounding punctuation and extra whitespace
    s = re.sub(r"[^\w\s\-']", " ", s)
    s = re.sub(r"\s+", " ", s).strip()

    # Map common synonyms used in these tests
    synonyms = {
        "unchanged": "stay the same",
        "no change": "stay the same",
        "same": "stay the same",
        "second place": "second",
        "2nd": "second",
        "first place": "first",
        "third place": "third",
    }
    return synonyms.get(s, s)

def extract_number(s: str):
    # Returns first number occurrence as string if found, else None
    if not s:
        return None
    m = re.search(r"[-+]?\d+(\.\d+)?", s)
    return m.group(0) if m else None

def grade(expected: str, got: str, kind: str) -> bool:
    if kind == "numeric":
        exp_num = extract_number(expected)
        got_num = extract_number(got)
        return (exp_num is not None) and (got_num == exp_num)
    else:
        return normalize_text(got) == normalize_text(expected)

def evaluate_tests(tests, model=MODEL):
    rows = []
    for t in tests:
        r = call_model_chat_completions(
            t["prompt"],
            system="You are a careful solver. Reply ONLY with the final answer, nothing else.",
            model=model,
            temperature=0.0,
        )
        got = (r["text"] or "").strip()
        is_correct = grade(t["expected"], got, t["type"])
        rows.append({
            "id": t["id"],
            "expected": t["expected"],
            "got": got,
            "correct": is_correct,
            "status": r["status"],
            "error": r["error"],
        })
        # Tiny pacing to be polite to the API
        time.sleep(0.2)

    # Print a small report
    correct = sum(1 for x in rows if x["correct"])
    print(f"Score: {correct}/{len(rows)} correct")
    for x in rows:
        mark = "✅" if x["correct"] else "❌"
        print(f"{mark} {x['id']}: expected={x['expected']!r}, got={x['got']!r} (HTTP {x['status']})")
        if x["error"]:
            print("   error:", x["error"])
    return rows

results = evaluate_tests(tests)


Score: 2/3 correct
❌ math_inequality: expected='8', got='4' (HTTP 200)
✅ commonsense_ice: expected='stay the same', got='stay the same' (HTTP 200)
✅ logic_race: expected='second', got='second' (HTTP 200)


In [5]:
def self_evaluate(question, prediction, expected_answer, model=MODEL):
    """
    Use the model itself as a strict grader.
    Returns True if the model says the prediction matches the expected answer; else False.
    Falls back to a simple normalized string compare if the model's reply is malformed.
    """
    import re

    system = "You are a strict grader. Reply with exactly True or False. No punctuation. No explanation."
    prompt = f"""You are grading a question-answer pair.

Return exactly True if the PREDICTION would be accepted as correct for the EXPECTED_ANSWER.
Otherwise, return False.

QUESTION:
{question}

PREDICTION:
{prediction}

EXPECTED_ANSWER:
{expected_answer}

Answer with exactly: True or False
"""

    r = call_model_chat_completions(
        prompt,
        system=system,
        model=model,
        temperature=0.0,
    )

    reply = (r.get("text") or "").strip().lower()
    if reply.startswith("true"):
        return True
    if reply.startswith("false"):
        return False

    # Fallback: simple normalization-based equality
    norm = lambda s: re.sub(r"\s+", " ", (s or "").strip().lower())
    return norm(prediction) == norm(expected_answer)


In [6]:
def self_evaluate_tests(tests, model=MODEL, grader_model=None, sleep_sec=0.2, verbose=True):
    """
    Run the tests by querying the model for each prompt, then use LLM-as-a-judge
    (self_evaluate) to determine correctness.

    Args:
        tests: list of dicts with keys: id, prompt, expected (and optionally type)
        model: model used to generate predictions
        grader_model: model used to judge correctness (defaults to `model` if None)
        sleep_sec: small delay between calls to be polite to the API
        verbose: if True, print a summary line per test

    Returns:
        rows: list of dicts with fields:
              id, expected, got, correct, status, error
    """
    import time

    judge_model = grader_model or model
    rows = []

    for t in tests:
        # 1) Get model prediction
        r = call_model_chat_completions(
            t["prompt"],
            system="You are a careful solver. Reply ONLY with the final answer, nothing else.",
            model=model,
            temperature=0.0,
        )
        got = (r.get("text") or "").strip()

        # 2) LLM-as-a-judge: strict True/False
        is_correct = self_evaluate(
            question=t["prompt"],
            prediction=got,
            expected_answer=t["expected"],
            model=judge_model,
        )

        row = {
            "id": t.get("id", "<unnamed>"),
            "expected": t["expected"],
            "got": got,
            "correct": bool(is_correct),
            "status": r.get("status"),
            "error": r.get("error"),
        }
        rows.append(row)

        if verbose:
            mark = "✅" if is_correct else "❌"
            print(f"{mark} {row['id']}: expected={row['expected']!r}, got={row['got']!r} (HTTP {row['status']})")
            if row["error"]:
                print("   error:", row["error"])

        if sleep_sec:
            time.sleep(sleep_sec)

    return rows

# Example:
results_llm_judge = self_evaluate_tests(tests, verbose=True, model=MODEL, grader_model=MODEL)


❌ math_inequality: expected='8', got='4' (HTTP 200)
✅ commonsense_ice: expected='stay the same', got='stay the same' (HTTP 200)
✅ logic_race: expected='second', got='second' (HTTP 200)


In [None]:
# 3 methods implamented here 
#1st method better prompt the bot to ensure that the prblem gets solved especailly the math ones

def get_classifaction(question):
    SYSTEM_PROMPT = """
    You are a helpful assitant. 
    This is very important please classify each problem into a category Math, General knowledge, predictions, and planning. Make sure the classification is correct and true.
    It is very important that the classifcation is correct and is about what the problem is describing make sure that is the case.

    This is very important as well please do not give the answers here this is jus classifcations for the problems. Also use the tags that I gave. 
    This is very important and must be followed. Please give one word as the repsonse and that is the answer. 
    """.strip()
    prompt = 'Question' + question
    
    classifaction = call_model_chat_completions(prompt=prompt, system=SYSTEM_PROMPT, model=MODEL, temperature = 0.0, timeout = 60)
    return classifaction.get('text')
    
    
def get_classifaction_prompt(question):
    classifaction = get_classifaction(question)
    print(classifaction)
    ans = ''
    if classifaction.lower() == 'math':
        SYSTEM_PROMPT = """
        You are a Math assitant. Your job is to do a math problem. 
        The first thing needed is to undertsand what the problem is asking for.
        Once that is figured out solve the problem in steps until a soltion is reached. Do not display each step!
        Redo the problem multiple times to make sure the correct answer is gotten. 
        Return the soltion like WHERE THE ANSWER IS ON A NEWLINE MAKE SURE THIS IS ON A NEW LINE BY ITSELF
        AT THE END OF calcutions MAKE SURE ONLY the ANSWER IS OUTPUT
        
        """.strip()
        ans = call_model_chat_completions(prompt=question, system=SYSTEM_PROMPT, model=MODEL, temperature = 0.0, timeout = 60)
        ans = ans.get('text').split()[-1]
    elif classifaction.lower() == 'general knowledge':
        SYSTEM_PROMPT = """
        You are a assistent that knows lot of general knowledge. Your job is to answer the question correctly
        The first thing needed is to undertsand what the problem is asking for.
        Respond to the problem with only one answer and make sure the answer is correct 
        
         """.strip()
        ans = call_model_chat_completions(prompt=question, system=SYSTEM_PROMPT, model=MODEL, temperature = 0.0, timeout = 60)
        ans = ans.get('text')
    elif classifaction.lower() == 'predictions':
        SYSTEM_PROMPT = """
        Your a predicting agent. 
        Use the correct data when making the predicitions and make sure they are correct 
        Make sure the anser is the answer only
        """.strip()
        ans = call_model_chat_completions(prompt=question, system=SYSTEM_PROMPT, model=MODEL, temperature = 0.0, timeout = 60)
        ans = ans.get('text')
    elif classifaction.lower() == 'planning':
        SYSTEM_PROMPT = """
        You are a planning agent. The problems will be given with have to do with planning the order of events.
        You will need to figure out what comes first and how they work with the information provided.
        Make sure the output is only the answer and nothing but the answer
        """.strip()
        ans = call_model_chat_completions(prompt=question, system=SYSTEM_PROMPT, model=MODEL, temperature = 0.0, timeout = 60)
        ans = ans.get('text')
    else:
        SYSTEM_PROMPT = """
        You are a assistent. Given the input of the problem make sure to thoughouly go through the problem.
        Make sure to correctly answer the given problem.
        The output should any be the answer only return the answer.
        """.strip()
        ans = call_model_chat_completions(prompt=question, system=SYSTEM_PROMPT, model=MODEL, temperature = 0.0, timeout = 60)
        ans = ans.get('text')
    return ans
         
    
    


import json

with open("cse476_final_project_dev_data.json", "r") as f:
    dev_data = json.load(f)

print("Loaded", len(dev_data), "examples.")
print("=== RUNNING AGENT ON DEV DATA ===")

results = []

for i, example in enumerate(dev_data):
    question = example["input"]
    expected = example["output"]
    domain = example["domain"]

    predicted = get_classifaction_prompt(question)

    results.append({
        "index": i,
        "domain": domain,
        "input": question,
        "predicted": predicted,
        "expected": expected
    })

    # Print a preview for each item
    print(f"\nExample {i}: Domain={domain}")
    print("INPUT:", question[:80], "...")
    print("PREDICTED:", predicted)
    print("EXPECTED: ", expected)




Loaded 1000 examples.
=== RUNNING AGENT ON DEV DATA ===
math

Example 0: Domain=math
INPUT: Let $ABCD$ be a convex quadrilateral with $AB = CD = 10$ , $BC = 14$ , and $AD = ...
PREDICTED: quadrilateral
EXPECTED:  112
math

Example 1: Domain=math
INPUT: A tennis player computes her win ratio by dividing the number of matches she has ...
PREDICTED: $
EXPECTED:  164
math

Example 2: Domain=math
INPUT: What is the product of the real roots of the equation $x^2 + 18x + 30 = 2 \sqrt{ ...
PREDICTED: 60
EXPECTED:  20
math

Example 3: Domain=math
INPUT: In $\triangle ABC$ , $AB= 425$ , $BC=450$ , and $AC=510$ . An interior point $P$ ...
PREDICTED: 42
EXPECTED:  306
math

Example 4: Domain=math
INPUT: How many even integers between 4000 and 7000 have four different digits? ...
PREDICTED: all
EXPECTED:  728
math

Example 5: Domain=math
INPUT: For all positive integers $x$ , let \[f(x)=\begin{cases}1 &\mbox{if }x = 1\\ \fr ...
PREDICTED: d(x)
EXPECTED:  511
math

Example 6: Domain=math
INPUT: Eigh

KeyboardInterrupt: 

In [None]:
#Method 2