# Group Project / Assignment 4: Instruction finetuning a Llama-3.2 model
**Assignment due 21 April 11:59pm**

Welcome to the fourth and final assignment for 50.055 Machine Learning Operations. The third and fourth assignment together form the course group project. You will continue the work on a chatbot which can answer questions about SUTD to prospective students.


**This assignment is a group assignment.**

- Read the instructions in this notebook carefully
- Add your solution code and answers in the appropriate places. The questions are marked as **QUESTION:**, the places where you need to add your code and text answers are marked as **ADD YOUR SOLUTION HERE**. The assignment is more open-ended than previous assignments, i.e. you have more freedom how to solve the problem and how to structure your code.
- The completed notebook, including your added code and generated output will be your submission for the assignment.
- The notebook should execute without errors from start to finish when you select "Restart Kernel and Run All Cells..". Please test this before submission.
- Use the SUTD Education Cluster to solve and test the assignment. If you work on another environment, minimally test your work on the SUTD Education Cluster.

**Rubric for assessment**

Your submission will be graded using the following criteria.
1. Code executes: your code should execute without errors. The SUTD Education cluster should be used to ensure the same execution environment.
2. Correctness: the code should produce the correct result or the text answer should state the factual correct answer.
3. Style: your code should be written in a way that is clean and efficient. Your text answers should be relevant, concise and easy to understand.
4. Partial marks will be awarded for partially correct solutions.
5. Creativity and innovation: in this assignment you have more freedom to design your solution, compared to the first assignments. You can show of your creativity and innovative mindset.
6. There is a maximum of 310 points for this assignment.

**ChatGPT policy**

If you use AI tools, such as ChatGPT, to solve the assignment questions, you need to be transparent about its use and mark AI-generated content as such. In particular, you should include the following in addition to your final answer:
- A copy or screenshot of the prompt you used
- The name of the AI model
- The AI generated output
- An explanation why the answer is correct or what you had to change to arrive at the correct answer

**Assignment Notes:** Please make sure to save the notebook as you go along. Submission Instructions are located at the bottom of the notebook.



### Finetuning LLMs

The goal of the assignment is to build a more advanced chatbot that can talk to prospective students and answer questions about SUTD.

We will finetune a smaller 1B LLM on question-answer pairs which we synthetically generate. Then we will compare the finetuned and non-finetuned LLMs with and without RAG to see if we were able to improve the SUTD chatbot answer quality.

We'll be leveraging `langchain`, `llama 3.2` and `Google AI STudio with Gemini 2.0`.

Check out the docs:
- [LangChain](https://docs.langchain.com/docs/)
- [Llama 3.2](https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_2/)
- [Google AI Studio](https://aistudio.google.com/)

Note: Google AI Studio provides a lot of free tokens but has certain rate limits. Write your code in a way that it can handle these limits.

# Install dependencies
Use pip to install all required dependencies of this assignment in the cell below. Make sure to test this on the SUTD cluster as different environments have different software pre-installed.  

In [1]:
# QUESTION: Install and import all required packages
# The rest of your code should execute without any import or dependency errors.

# **--- ADD YOUR SOLUTION HERE (10 points) ---**
!pip install -U langchain langchain-community openai
!pip install langchain-google-genai
!pip install openai datasets huggingface_hub pandas scikit-learn
!pip install --upgrade gradio

from langchain_google_genai import ChatGoogleGenerativeAI
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain

from openai import OpenAI
import pandas as pd
from datasets import Dataset, DatasetDict
from huggingface_hub import login
from sklearn.model_selection import train_test_split

from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
    Trainer,
    DataCollatorForLanguageModeling,
)
from peft import LoraConfig, get_peft_model, PeftModel, PeftConfig
from datasets import load_dataset

import os
import time
import json
import torch



# Generate training data (Done By Zhang Jianyu)
The first step of the assignment is generating synthetic question-answer pairs which can be used for finetuning an LLM model.
Use the Google AI studio with the Gemini models to create -high-quality QA training data.


In [None]:
# !pip install -U langchain langchain-community openai

import os
import json
from langchain_community.chat_models import ChatOpenAI
from langchain.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain.schema.runnable import RunnableSequence

# Set your OpenAI API Key
os.environ["OPENAI_API_KEY"] = ""

# Prompt template
prompt = PromptTemplate(
    input_variables=["category"],
    template="List 3 items in the category of {category} and return them in a JSON array format. Respond only with the JSON."
)

# OpenAI LLM
llm = ChatOpenAI(model_name="gpt-4", temperature=0.7)

# Use StrOutputParser + json.loads manually
def json_parser(output_str):
    try:
        return json.loads(output_str)
    except json.JSONDecodeError:
        return {"error": "Could not parse JSON", "raw": output_str}

chain = RunnableSequence(prompt | llm | StrOutputParser() | json_parser)

# Run
response = chain.invoke({"category": "fruits"})
print(response)




  llm = ChatOpenAI(model_name="gpt-4", temperature=0.7)


['Apple', 'Banana', 'Cherry']


## Generate topics
When generating data, it is often helpful to guide the generation process through some hierachical structure.
Before we create question-answer pairs, let's generate some topics which the questions should be about.



In [6]:
import json
from langchain_community.chat_models import ChatOpenAI
from langchain.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain.schema.runnable import RunnableSequence

# Set your OpenAI key in the environment (already done above if reused)
# os.environ["OPENAI_API_KEY"] = "your-openai-key-here"

# Define OpenAI LLM
llm = ChatOpenAI(model_name="gpt-4", temperature=0.7)

# Define prompt template
topic_prompt = PromptTemplate(
    input_variables=["num"],
    template=(
        "Generate a list of {num} topics that prospective university students might care about. "
        "Focus especially on those relevant for students considering studying at SUTD. "
        "Return only a JSON array of strings."
    )
)

# Output parser that returns raw string
output_parser = StrOutputParser()

# Create chain
topic_chain = RunnableSequence(topic_prompt | llm | output_parser)

# Define function to use the chain
def generate_topics(n=20):
    """
    Generate a list of `n` topics using OpenAI model.
    Returns:
        list of str: Parsed JSON array of topic strings.
    """
    raw = topic_chain.invoke({"num": n}).strip()

    # Clean markdown formatting if present
    if raw.startswith("```json"):
        raw = raw.replace("```json", "").replace("```", "").strip()

    try:
        topics = json.loads(raw)
        if not isinstance(topics, list) or not all(isinstance(t, str) for t in topics):
            raise ValueError("Generated content is not a list of strings.")
        return topics
    except Exception as e:
        print("Failed to parse response as JSON:", e)
        print("Raw response:", raw)
        return []


In [7]:
# test topic generation
print(generate_topics(3))

["SUTD's Collaboration with MIT and Zhejiang University", "SUTD's Unique Curriculum Structure and Pedagogy", 'Accommodation and Living Conditions at SUTD']


In [8]:
import os
import json

# Define file path
TOPIC_FILE = "topics.json"

def load_or_generate_topics():
    """
    Load a clean list of topics from file if it exists.
    If the file contains invalid formatting (e.g., markdown markers),
    it will be cleaned and parsed properly.
    If the file does not exist or parsing fails, new topics will be generated and saved.
    Returns:
        list of str: A list of topic strings.
    """
    topics = []

    if os.path.exists(TOPIC_FILE):
        with open(TOPIC_FILE, "r", encoding="utf-8") as f:
            raw = f.read()

        # Remove markdown formatting like ```json and ```
        cleaned = raw.replace("```json", "").replace("```", "").strip()

        try:
            topics = json.loads(cleaned)
            if not isinstance(topics, list) or not all(isinstance(t, str) for t in topics):
                raise ValueError("Invalid format: topics must be a list of strings.")
            print("✅ Loaded and cleaned topics from file.")
        except Exception as e:
            print(f"⚠️ Failed to parse {TOPIC_FILE}: {e}")
            topics = []

    if not topics:
        # Generate new topics if file is missing or invalid
        topics = generate_topics(20)
        with open(TOPIC_FILE, "w", encoding="utf-8") as f:
            json.dump(topics, f, ensure_ascii=False, indent=2)
        print("🆕 Generated and saved new topics.")

    return topics

# Example usage
topics = load_or_generate_topics()
print(json.dumps(topics, ensure_ascii=False, indent=2))


✅ Loaded and cleaned topics from file.
[
  "Design-Centric Curriculum",
  "Hands-on Learning Opportunities",
  "Interdisciplinary Programs",
  "Technology and Innovation Ecosystem",
  "Career Prospects in Emerging Fields",
  "Industry Collaboration and Internships",
  "Global Exchange Programs",
  "Research Opportunities for Undergraduates",
  "Scholarships and Financial Aid",
  "Campus Facilities and Resources",
  "Student Life and Community",
  "Entrepreneurship and Startup Incubation",
  "Faculty Expertise and Research Areas",
  "Location and Accessibility",
  "Admissions Requirements and Process",
  "Diversity and Inclusion on Campus",
  "Sustainability Initiatives",
  "Student Support Services (e.g., counseling, academic advising)",
  "Alumni Network and Mentorship",
  "Focus on Digital Manufacturing and Design"
]


## Generate questions (Done By Zhang Jianyu)
Now generate a set of questions about each topic

In [None]:
# QUESTION: Create a function 'generate_questions' which generates quetions about a given topic.
# Generate a list of 10 questions per topics. In total you should have 200 questions.
#

#--- ADD YOUR SOLUTION HERE (20 points)---

# Define prompt template for question generation
question_prompt = PromptTemplate(
    input_variables=["topic", "num"],
    template=(
        "Generate {num} questions that a prospective university student might ask about the topic '{topic}'. "
        "Focus on aspects relevant to students considering studying at SUTD. "
        "Return only a JSON array of strings, no explanations or extra text."
    )
)

# Create runnable chain
question_chain = question_prompt | llm

# Define function to generate questions
def generate_questions(topic, n=10):
    """Generate `n` questions about a given topic using Gemini"""
    response = question_chain.invoke({"topic": topic, "num": n})
    try:
        return json.loads(response.content)  # convert JSON string to Python list
    except json.JSONDecodeError:
        print("JSON decode failed. Raw output:")
        print(response.content)
        return []

In [None]:
# test it
print(generate_questions("Academic Reputation and Program Quality", 3))


["How does SUTD's interdisciplinary curriculum contribute to its academic reputation and the quality of its programs, specifically in terms of preparing graduates for emerging industries?", 'What metrics or rankings does SUTD use to assess and maintain the quality of its academic programs, and how do these compare to other leading universities in Singapore and globally?', 'Beyond coursework, what opportunities are available at SUTD for undergraduate students to engage in research, design projects, or other activities that enhance the academic rigor and reputation of their chosen program?']


In [10]:
# # QUESTION: Now let's put it together and generate 10 questions for each topic. Save the questions in a local file.

#--- ADD YOUR SOLUTION HERE (20 points)---

# Define file path
from langchain.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain.schema.runnable import RunnableSequence

# Define prompt template for question generation
question_prompt = PromptTemplate(
    input_variables=["topic", "num"],
    template=(
        "Generate {num} questions that a prospective university student might ask about the topic '{topic}'. "
        "Focus on aspects relevant to students considering studying at SUTD. "
        "Return only a JSON array of strings, no explanations or extra text."
    )
)

# Define output parser
output_parser = StrOutputParser()

# Build chain with OpenAI LLM
question_chain = RunnableSequence(question_prompt | llm | output_parser)

# Define function to generate questions
def generate_questions(topic, n=10):
    """Generate `n` questions about a given topic using OpenAI"""
    raw = question_chain.invoke({"topic": topic, "num": n}).strip()

    # Handle markdown-wrapped JSON
    if raw.startswith("```json"):
        raw = raw.replace("```json", "").replace("```", "").strip()

    try:
        questions = json.loads(raw)
        if not isinstance(questions, list) or not all(isinstance(q, str) for q in questions):
            raise ValueError("Response is not a list of strings.")
        return questions
    except Exception as e:
        print("⚠️ Failed to parse questions as JSON:", e)
        print("Raw output:\n", raw)
        return []


## Generate Answers (Done By Liu Yu)

Now create answers for the questions.

You can use the Google AI Studio Gemini model (assuming that they are good enough to generate good answers), your RAG system from assignment 3 or any other method you choose to generate answers for your question dataset.

Note: it is normal that some LLM calls fail, even with retry, so maybe you end up with less than 200 QA pairs but it should be at least 160 QA pairs.

In [None]:
# QUESTION: Generate answers to al your questions using Gemini, your SUTD RAG system or any other method.
# Split your dataset in to 80% training and 20% test dataset.
# Store all questions and answer pairs in a huggingface dataset `sutd_qa_dataset` and push it to your Huggingface hub.

#--- ADD YOUR SOLUTION HERE (40 points)---

# Set API keys
client = OpenAI(api_key="")
login("")  

# Load question data
with open("questions.json") as qf:
    questions_data = json.load(qf)

with open("topics.json") as tf:
    topics = json.load(tf)

qa_pairs = []
for topic in topics:
    for question in questions_data[topic]:
        qa_pairs.append({"topic": topic, "question": question})

df = pd.DataFrame(qa_pairs)

# Generate answers using OpenAI
answers = []
for i, row in df.iterrows():
    prompt = f"Answer this question as if you are a SUTD admissions or faculty staff. Limit your response to **no more than 100 words**:\n\n{row['question']}"
    try:
        response = client.chat.completions.create(
            model="gpt-3.5-turbo",
            messages=[
                {"role": "system", "content": "You are a helpful assistant for SUTD university answering prospective student questions."},
                {"role": "user", "content": prompt}
            ],
            temperature=0.7
        )
        answer = response.choices[0].message.content.strip()
        print(f"[{i}] {row['question']}\n→ {answer}\n")
    except Exception as e:
        answer = None
        print(f"❌ Failed at {i}: {e}")
    answers.append(answer)
    time.sleep(1.5)

# STEP 6: Clean failed entries
df["answer"] = answers
df = df[df["answer"].notnull()].reset_index(drop=True)

# STEP 7: Train/test split
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)

# STEP 8: Convert to HF Dataset
train_dataset = Dataset.from_pandas(train_df)
test_dataset = Dataset.from_pandas(test_df)

qa_dataset = DatasetDict({
    "train": train_dataset,
    "test": test_dataset
})

# STEP 9: Push to Hugging Face Hub
qa_dataset.push_to_hub("")

# access the dataset via: ""

[0] How does SUTD's design-centric curriculum differ from traditional engineering programs?
→ SUTD's design-centric curriculum integrates engineering principles with design thinking, emphasizing creativity, innovation, and human-centered solutions. Students learn to approach problems holistically, considering technical, social, and environmental aspects. This approach fosters interdisciplinary collaboration and equips students with the skills to address complex real-world challenges effectively. Traditional engineering programs often focus primarily on technical skills and theory. At SUTD, students not only gain technical expertise but also develop a deep understanding of the impact of their designs on society and the environment.

[1] What specific design thinking methodologies are taught and practiced throughout the curriculum?
→ At SUTD, we integrate design thinking methodologies such as user-centered design, prototyping, and iteration across various courses. Our curriculum emphasiz

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

README.md:   0%|          | 0.00/482 [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/datasets/DrakeLLLLLLL/sutd_qa_dataset/commit/7dedeb9197fd33cbf14ea80fae070a200d79a7bd', commit_message='Upload dataset', commit_description='', oid='7dedeb9197fd33cbf14ea80fae070a200d79a7bd', pr_url=None, repo_url=RepoUrl('https://huggingface.co/datasets/DrakeLLLLLLL/sutd_qa_dataset', endpoint='https://huggingface.co', repo_type='dataset', repo_id='DrakeLLLLLLL/sutd_qa_dataset'), pr_revision=None, pr_num=None)

In [13]:
# test the chain

# Define the generation function
def generate_answer(question):
    prompt = f"You are a helpful assistant for SUTD admissions. Answer the following question in less than 100 words:\n{question}"

    try:
        response = client.chat.completions.create(
            model="gpt-3.5-turbo",
            messages=[
                {"role": "system", "content": "You are an expert in SUTD university matters."},
                {"role": "user", "content": prompt}
            ],
            temperature=0.7
        )
        answer_text = response.choices[0].message.content.strip()
        return {"answer": answer_text}

    except Exception as e:
        print(f"Error: {e}")
        return {"answer": "Unable to generate an answer at this time."}

question = "When was SUTD founded?"

# Now run the answer generation chain
response = generate_answer(question)
print("\nModel Response:")
print(response["answer"])


Model Response:
SUTD, also known as the Singapore University of Technology and Design, was founded in collaboration with the Massachusetts Institute of Technology (MIT) on 30 October 2009.


# Finetune Llama 3.2 1B model

Now use your SUTD QA dataset training data set to finetune a smaller Llama 3.2 1B LLM using parameter-efficient finetuning (PEFT).
We recommend the unsloth library but you are free to choose other frameworks. You can decide the parameters for the finetuning.
Push your finetuned model to Huggingface.

Then we will compare the finetuned and non-finetuned LLMs with and without RAG to see if we were able to improve the SUTD chatbot answer quality.


In [2]:
import torch
print(torch.cuda.is_available())  # Should return True

True


In [None]:
# log in to hugging face repo
login("")

In [4]:
print(f"Available GPU memory: {torch.cuda.get_device_properties(0).total_memory/1e9:.2f}GB")

Available GPU memory: 23.80GB


In [5]:
torch.cuda.empty_cache() #offload gpu before running

In [None]:
# Install dependencies
!pip install --no-cache-dir bitsandbytes
!pip install -U peft transformers datasets accelerate

# Import required libraries
from datasets import load_dataset
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    TrainingArguments,
    Trainer,
    DataCollatorForLanguageModeling,
    BitsAndBytesConfig,
)
from peft import get_peft_model, LoraConfig
import torch

# HF Hub details
model_name = "meta-llama/Llama-3.2-1B"
hf_model_id = ""
hf_token = ""  # 

# Step 1: Load dataset
dataset = load_dataset("dataset") ##import dataset
train_data = dataset["train"]

# Step 2: Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

# Step 3: BitsAndBytesConfig for quantized loading
bnb_config = BitsAndBytesConfig(
    load_in_8bit=True,
    llm_int8_threshold=6.0,
    llm_int8_has_fp16_weight=True
)

# Step 4: Load quantized model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)

# Step 5: Apply LoRA config
peft_config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)
model = get_peft_model(model, peft_config)

# Step 6: Tokenize dataset
def format_and_tokenize(example):
    prompt = f"### Question: {example['question']}\n### Answer: {example['answer']}"
    tokenized = tokenizer(
        prompt,
        truncation=True,
        padding="max_length",
        max_length=512,
    )
    tokenized["labels"] = tokenized["input_ids"].copy()
    return tokenized

tokenized_dataset = train_data.map(format_and_tokenize)

# Step 7: Training arguments
training_args = TrainingArguments(
    per_device_train_batch_size=4,
    gradient_accumulation_steps=2,
    warmup_steps=10,
    num_train_epochs=3,
    learning_rate=2e-4,
    fp16=True,
    logging_steps=10,
    output_dir="./finetuned-llama3-sutd",
    save_strategy="epoch",
    push_to_hub=True,
    hub_model_id=hf_model_id,
    hub_token=hf_token,
)

# Step 8: Train using Hugging Face Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    tokenizer=tokenizer,
    data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False),
)

# Step 9: Launch training + Push to Hub
trainer.train()
trainer.push_to_hub()




Map:   0%|          | 0/160 [00:00<?, ? examples/s]

  trainer = Trainer(
No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33myu2_liu[0m ([33myu2_liu-sutd[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Step,Training Loss
10,1.9362
20,1.6999
30,1.4781
40,1.4065
50,1.351
60,1.2993


CommitInfo(commit_url='https://huggingface.co/ayupermhm/llama-3.2-1B-sutdqa/commit/385099c910ec0a91cf05639827ea7adf47c985cc', commit_message='End of training', commit_description='', oid='385099c910ec0a91cf05639827ea7adf47c985cc', pr_url=None, repo_url=RepoUrl('https://huggingface.co/ayupermhm/llama-3.2-1B-sutdqa', endpoint='https://huggingface.co', repo_type='model', repo_id='ayupermhm/llama-3.2-1B-sutdqa'), pr_revision=None, pr_num=None)

In [8]:
# Train and push to Hub
trainer.train()
trainer.push_to_hub()
tokenizer.push_to_hub("ayupermhm/llama-3.2-1B-sutdqa")

Step,Training Loss
10,1.2928
20,1.2708
30,1.1929
40,1.2065
50,1.168
60,1.1333


No files have been modified since last commit. Skipping to prevent empty commit.


README.md:   0%|          | 0.00/1.32k [00:00<?, ?B/s]

No files have been modified since last commit. Skipping to prevent empty commit.


CommitInfo(commit_url='https://huggingface.co/ayupermhm/llama-3.2-1B-sutdqa/commit/577d52dd9f9b5c11c109725357429f35d7307a7b', commit_message='Upload tokenizer', commit_description='', oid='577d52dd9f9b5c11c109725357429f35d7307a7b', pr_url=None, repo_url=RepoUrl('https://huggingface.co/ayupermhm/llama-3.2-1B-sutdqa', endpoint='https://huggingface.co', repo_type='model', repo_id='ayupermhm/llama-3.2-1B-sutdqa'), pr_revision=None, pr_num=None)

In [9]:
# QUESTION: Load a non-finetuned Llama 3.2 1B model and your finetuned SUTD QA Llama 3.2 1B model
# Ask it a simple test question (e.g. "What is special about SUTD?") to check that both models can generated answers

#--- ADD YOUR SOLUTION HERE (10 points)---

device = "cuda" if torch.cuda.is_available() else "cpu"

# Load base model
base_model_id = "meta-llama/Llama-3.2-1B"
base_tokenizer = AutoTokenizer.from_pretrained(base_model_id)
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    torch_dtype=torch.float16,
    device_map="auto",
) # Explicitly move to device

# Load finetuned model
finetuned_model_id = "ayupermhm/llama-3.2-1B-sutdqa"
finetuned_tokenizer = AutoTokenizer.from_pretrained(finetuned_model_id)

# Load PEFT model correctly
peft_config = PeftConfig.from_pretrained(finetuned_model_id)
base_model_for_peft = AutoModelForCausalLM.from_pretrained(
    peft_config.base_model_name_or_path,
    torch_dtype=torch.float16,
    device_map="auto",
).to(device)  # Explicitly move to device

finetuned_model = PeftModel.from_pretrained(
    base_model_for_peft,
    finetuned_model_id,
    torch_dtype=torch.float16,
).to(device)  # Explicitly move to device

# Generation functions with device handling
def generate_base(prompt):
    inputs = base_tokenizer(prompt, return_tensors="pt").to(device)  # Move inputs to device
    with torch.inference_mode():
        outputs = base_model.generate(
            **inputs,
            max_new_tokens=256,
            do_sample=True,
            temperature=0.7,
            top_p=0.9,
            repetition_penalty=1.2,
        )
    return base_tokenizer.decode(outputs[0], skip_special_tokens=True)

def generate_finetuned(prompt):
    inputs = finetuned_tokenizer(prompt, return_tensors="pt").to(device)  # Move inputs to device
    with torch.inference_mode():
        outputs = finetuned_model.generate(
            **inputs,
            max_new_tokens=256,
            do_sample=True,
            temperature=0.7,
            top_p=0.9,
            repetition_penalty=1.2,
        )
    return finetuned_tokenizer.decode(outputs[0], skip_special_tokens=True)

tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/335 [00:00<?, ?B/s]

adapter_config.json:   0%|          | 0.00/778 [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/3.42M [00:00<?, ?B/s]

In [11]:
# try out the llms

query = "What is special about SUTD?"

print("Question:", query)
response_base = generate_base(query)
print("Answer base:", response_base)

print("---------")

# Generate response from finetuned model
response_finetune = generate_finetuned(query)
print("Answer finetune:", response_finetune)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Question: What is special about SUTD?


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Answer base: What is special about SUTD? It’s an idea of a new school that puts innovation, creativity and design at the heart of its education. This year marks our 10th anniversary as we continue to develop this vision.
In fact, in some ways you could say it was born out of necessity — or maybe even destiny! As Singapore grows into one of Asia's most dynamic economies, there are increasing demands for talent who have both technical expertise and creative flair.
With growing globalisation and technological advances shaping how businesses operate across industries today, companies need employees with these two skills sets more than ever before!
As such, many universities throughout Asia now offer courses which combine theoretical knowledge from engineering disciplines like civil & structural engineering; mechanical engineering etc., alongside practical application through hands-on projects involving fabrication techniques such as metal forming processes (e.g casting), machining operatio

# Integrate and evaluate

Now integrate both the non-finetuned Llama 3.2 1B model and your finetuned model into your SUTD chatbot RAG system.
Generate responses to the 20 questions you have collected in assignment 3 using these 4 appraoches
1. non-finetuned Llama 3.2 1B model without RAG
2. finetuned Llama 3.2 1B SUTD QA model without RAG
3. non-finetuned Llama 3.2 1B model with RAG
4. finetuned Llama 3.2 1B SUTD QA model with RAG

Compare the responses and decide what system produces the most accurate and high quality responses

In [17]:
!pip install faiss-cpu

# Import
import os
import json
import torch
from dotenv import load_dotenv
from tqdm import tqdm
from langchain.document_loaders import PyPDFLoader

# Langchain & data loaders
from langchain.document_loaders import UnstructuredURLLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Embedding + vector store
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS

# Model
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

# LangChain for RAG
from langchain.chains import RetrievalQA
from langchain.llms import HuggingFacePipeline

from langchain.prompts import PromptTemplate
from langchain import hub
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
import re

from IPython.display import HTML, display

Collecting faiss-cpu
  Downloading faiss_cpu-1.10.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (4.4 kB)
Downloading faiss_cpu-1.10.0-cp311-cp311-manylinux_2_28_x86_64.whl (30.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m30.7/30.7 MB[0m [31m76.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.10.0


In [19]:
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
vectorstore = FAISS.load_local("sutd_faiss_index", embeddings, allow_dangerous_deserialization=True)
print("Success! Vectorstore loaded 🎉")

Success! Vectorstore loaded 🎉


In [20]:
# QUESTION: Re-create the RAG chatbot system you have created in assignment 3 but with the Llama 3.2 1B (non-tuned and finetuned) models

#--- ADD YOUR SOLUTION HERE (40 points)---
# Load questions from assignment 3
test_questions = ["What are the admissions deadlines for SUTD?",
             "Is there financial aid available?",
             "What is the minimum score for the Mother Tongue Language?",
             "Do I require reference letters?",
             "Can polytechnic diploma students apply?",
             "Do I need SAT score?",
             "How many PhD students does SUTD have?",
             "How much are the tuition fees for Singaporeans?",
             "How much are the tuition fees for international students?",
             "Is there a minimum CAP?"
             ]
test_questions.extend([ "What is the difference between CSD in SUTD and Computer Science major in nus?",
                  "What is the teaching style like in SUTD?",
                  "What are the possible career path for ESD pillar?",
                  "How heavy are the project workload like in SUTD?",
                  "How does the employment rates of SUTD compare to other universities like NUS and NTU?",
                  "Who can I approach if I have more questions about SUTD?",
                  "What is the difference between CSD and DAI?",
                  "Can you tell me about the admission process?",
                  "How can I prepare for the admission process?",
                  "What is the meaning of pillar?"
])

# Check CUDA availability
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

def load_vectorstore():
    """Load or create vector store for our documents"""
    if os.path.exists("sutd_faiss_index"):
        embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
        vectorstore = vectorstore = FAISS.load_local("sutd_faiss_index", embedding_model, allow_dangerous_deserialization=True)
        print("Loaded existing FAISS index")
    else:
        print("Loading documents from scratch...")
        raise FileNotFoundError("Please run document loading from assignment 3 first")

    return vectorstore

# 2. Set up retriever
vectorstore = load_vectorstore()
retriever = vectorstore.as_retriever(search_kwargs={"k": 7})

# 3. Define generation functions for all 4 approaches
def format_docs(docs):
    """Format retrieved documents into context string"""
    return "\n\n".join(doc.page_content for doc in docs)

# For non-finetuned model without RAG
def generate_base_no_rag(question):
    """Generate response using base model without RAG"""
    prompt = f"You are a helpful assistant for SUTD university answering questions from prospective students. Answer the following question in a helpful and friendly way:\n\nQuestion: {question}\n\nAnswer:"
    inputs = base_tokenizer(prompt, return_tensors="pt", truncation=True).to(device)
    outputs = base_model.generate(
        **inputs,
        max_new_tokens=256,
        do_sample=True,
        temperature=0.7,
        top_p=0.9,
        repetition_penalty=1.2,
    )
    response = base_tokenizer.decode(outputs[0], skip_special_tokens=True)
    # Extract just the answer part
    try:
        answer = response.split("Answer:")[1].strip()
    except:
        answer = response
    return answer

# For finetuned model without RAG
def generate_finetuned_no_rag(question):
    """Generate response using finetuned model without RAG"""
    prompt =  f"You are a helpful assistant for SUTD university answering questions from prospective students. Answer the following question in a helpful and friendly way:\n\nQuestion: {question}\n\nAnswer:"
    inputs = finetuned_tokenizer(prompt, return_tensors="pt", truncation=True).to(device)
    outputs = finetuned_model.generate(
        **inputs,
        max_new_tokens=256,
        do_sample=True,
        temperature=0.7,
        top_p=0.9,
        repetition_penalty=1.2,
    )
    response = finetuned_tokenizer.decode(outputs[0], skip_special_tokens=True)
    # Extract just the answer part
    try:
        answer = response.split("Answer:")[1].strip()
    except:
        answer = response
    return answer

# For base model with RAG
def generate_base_with_rag(question):
    """Generate response using base model with RAG"""
    docs = retriever.get_relevant_documents(question)
    context = format_docs(docs)

    prompt = f"""You are a helpful assistant for SUTD university answering questions from prospective students.
Use the following context information to answer the question. If you don't know the answer based on the provided context, say that you don't have enough information.

Context:
{context}

Question: {question}

Answer:"""

    inputs = base_tokenizer(prompt, return_tensors="pt", truncation=True).to(device)
    outputs = base_model.generate(
        **inputs,
        max_new_tokens=256,
        do_sample=True,
        temperature=0.7,
        top_p=0.9,
        repetition_penalty=1.2,
    )
    response = base_tokenizer.decode(outputs[0], skip_special_tokens=True)
    # Extract just the answer part
    try:
        answer = response.split("Answer:")[1].strip()
    except:
        answer = response
    return answer, docs

# For finetuned model with RAG
def generate_finetuned_with_rag(question):
    """Generate response using finetuned model with RAG"""
    docs = retriever.get_relevant_documents(question)
    context = format_docs(docs)

    prompt = f"""You are a helpful assistant for SUTD university answering questions from prospective students.
Use the following context information to answer the question. If you don't know the answer based on the provided context, say that you don't have enough information.

Context:
{context}

Question: {question}

Answer:"""

    inputs = finetuned_tokenizer(prompt, return_tensors="pt", truncation=True).to(device)
    outputs = finetuned_model.generate(
        **inputs,
        max_new_tokens=256,
        do_sample=True,
        temperature=0.7,
        top_p=0.9,
        repetition_penalty=1.2,
    )
    response = finetuned_tokenizer.decode(outputs[0], skip_special_tokens=True)
    # Extract just the answer part
    try:
        answer = response.split("Answer:")[1].strip()
    except:
        answer = response
    return answer, docs

#Evaluate each approach using the test questions
def evaluate_models():
    results = []

    for i, question in enumerate(test_questions):
        print(f"Processing question {i+1}/{len(test_questions)}: {question}")

        #Base model without RAG
        ans1 = generate_base_no_rag(question)

        #Finetuned model without RAG
        ans2 = generate_finetuned_no_rag(question)

        #Base model with RAG
        ans3, docs3 = generate_base_with_rag(question)

        #Finetuned model with RAG
        ans4, docs4 = generate_finetuned_with_rag(question)

        results.append({
            "question": question,
            "base_no_rag": ans1,
            "finetuned_no_rag": ans2,
            "base_with_rag": ans3,
            "finetuned_with_rag": ans4,
            "retrieved_docs": [doc.page_content[:150] + "..." for doc in docs3]  # Just use docs from base+RAG for reference
        })

        time.sleep(2)

    results_df = pd.DataFrame(results)
    results_df.to_csv("model_comparison_results.csv", index=False)

    return results_df

# Visualize and compare results
def display_result_comparison(results_df):
    """Display pretty HTML comparison of results"""
    for i, row in results_df.iterrows():
        display(HTML(f"""
        <div style="border: 1px solid #ddd; padding: 10px; margin-bottom: 20px; border-radius: 5px;">
            <h3 style="color: #2c3e50;">Question {i+1}: {row['question']}</h3>

            <div style="margin-top: 10px;">
                <h4 style="color: #3498db;">Base Model (No RAG)</h4>
                <p style="padding: 10px; background-color: #f8f9fa; border-radius: 5px;">{row['base_no_rag']}</p>
            </div>

            <div style="margin-top: 10px;">
                <h4 style="color: #2ecc71;">Finetuned Model (No RAG)</h4>
                <p style="padding: 10px; background-color: #f8f9fa; border-radius: 5px;">{row['finetuned_no_rag']}</p>
            </div>

            <div style="margin-top: 10px;">
                <h4 style="color: #e74c3c;">Base Model with RAG</h4>
                <p style="padding: 10px; background-color: #f8f9fa; border-radius: 5px;">{row['base_with_rag']}</p>
            </div>

            <div style="margin-top: 10px;">
                <h4 style="color: #9b59b6;">Finetuned Model with RAG</h4>
                <p style="padding: 10px; background-color: #f8f9fa; border-radius: 5px;">{row['finetuned_with_rag']}</p>
            </div>

            <div style="margin-top: 10px;">
                <h4 style="color: #34495e;">Retrieved Context (First Snippet)</h4>
                <p style="padding: 10px; background-color: #f8f9fa; border-radius: 5px; font-size: 0.8em;">{row['retrieved_docs'][0] if len(row['retrieved_docs']) > 0 else 'No context'}</p>
            </div>
        </div>
        """))

# Run evaluation
print("Starting evaluation of all 4 approaches...")
results_df = evaluate_models()
display_result_comparison(results_df)

Using device: cuda


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Loaded existing FAISS index
Starting evaluation of all 4 approaches...
Processing question 1/20: What are the admissions deadlines for SUTD?


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
  docs = retriever.get_relevant_documents(question)
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Processing question 2/20: Is there financial aid available?


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Processing question 3/20: What is the minimum score for the Mother Tongue Language?


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Processing question 4/20: Do I require reference letters?


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Processing question 5/20: Can polytechnic diploma students apply?


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Processing question 6/20: Do I need SAT score?


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Processing question 7/20: How many PhD students does SUTD have?


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Processing question 8/20: How much are the tuition fees for Singaporeans?


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Processing question 9/20: How much are the tuition fees for international students?


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Processing question 10/20: Is there a minimum CAP?


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Processing question 11/20: What is the difference between CSD in SUTD and Computer Science major in nus?


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Processing question 12/20: What is the teaching style like in SUTD?


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Processing question 13/20: What are the possible career path for ESD pillar?


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Processing question 14/20: How heavy are the project workload like in SUTD?


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Processing question 15/20: How does the employment rates of SUTD compare to other universities like NUS and NTU?


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Processing question 16/20: Who can I approach if I have more questions about SUTD?


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Processing question 17/20: What is the difference between CSD and DAI?


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Processing question 18/20: Can you tell me about the admission process?


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Processing question 19/20: How can I prepare for the admission process?


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Processing question 20/20: What is the meaning of pillar?


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


# Bonus points: LLM-as-judge evaluation (RIO)

Implement an LLM-as-judge pipeline to assess the quality of the different system (finetuned vs. non-fintuned, RAG vs no RAG)



In [None]:
# QUESTION: Implement an LLM-as-judge pipeline to assess the quality of the different system (finetuned vs. non-fintuned, RAG vs no RAG)
import pandas as pd
from openai import OpenAI
import time


# Initialize the OpenAI client with your API key
client = OpenAI(api_key="")

# Initialize OpenAI client (assuming you're using OpenAI's API)
def evaluate_response(question, answer, retrieved_docs):
    """Create prompt for LLM judge with 2-shot examples"""

    # Two-shot examples
    example1 = """
    Example 1:
    Question: "What is the difference between CSD in SUTD and Computer Science major in NUS?"
    Answer: "Unfortunately, I don't have enough information about the NUS Computer Science program to provide an accurate comparison between the two programs. However, according to the text, the main difference lies in the fact that the CSD at SUTD integrates traditional disciplines of Computer Science and Information Systems, whereas it is unclear what specific differences exist between the NUS Computer Science program and the CSD program at SUTD.
    However, according to the SUTD website, NUS offers a Bachelor of Science (Computer Science) or a Bachelor of Engineering (Computer Science).
    The SUTD CSD programme focuses more on the design of software and hardware systems interacting with humans, unlike the traditional focus of many CS programmes on algorithms and theory.
    Thus, while the exact differences remain unclear without further information, SUTD's CSD appears to be more focused on practical applications of computing"
    Accuracy: 3
    Relevant: 4
    Grounded: 5
    Comments: This is not as accurate as it does not provide the factual answer to the question. However it is somewhat relevant as it provides context and information that could pertain to the actual factual answer. This is grounded as it follows the documentation query in RAG.
    """

    example2 = """
    Example 2:
    Question: "What is the meaning of pillar? Give a precise answer in 3-5 sentences."
    Answer: "The 'pillar' refers to one of the main areas of focus within SUTD's academic structure. It represents a specific domain or discipline that comprises various academic programs, research initiatives, and faculty members. In this case, the Information and Computing Technologies Standards (ISTD) pillar focuses on the intersection of technology and society, encompassing aspects such as human-centered design, data science, artificial intelligence, and more. Each pillar has its unique goals, objectives, and research directions, contributing to the overall diversity and richness of SUTD's academic landscape. By supporting pillars like ISTD, SUTD aims to foster innovation and creativity in the field of information and computing technologies."
    Accuracy: 5
    Relevant: 5
    Grounded: 5
    Comments: This is factually right. It is totally relevant to the question asked. It is based on the document query.
    """

    # Format retrieved docs for context
    context = "\n".join([f"- {doc}" for doc in retrieved_docs])

    prompt = f"""You are an expert evaluator for SUTD university chatbot responses. Evaluate the following response based on:
1. Accuracy (1-5): Factual correctness about SUTD (5=perfect, 1=completely wrong)
2. Relevant (1-5): How directly this answers the question (5=perfect focus, 1=irrelevant)
3. Grounded (1-5): How specific/well-supported (5=with sources/details, 1=vague/unsupported). Check similarity to retrieved docs.

{example1}
{example2}

Now evaluate this response:

Question: "{question}"
Answer: "{answer}"

Context from retrieved documents:
{context}

Provide your evaluation in the following format:
Accuracy: [1-5]
Relevant: [1-5]
Grounded: [1-5]
Comments: [Your analysis of why you gave these scores, comparing to context when appropriate]"""

    try:
        response = client.chat.completions.create(
            model="gpt-3.5-turbo",
            messages=[
                {"role": "system", "content": "You are an expert evaluator of university chatbot responses."},
                {"role": "user", "content": prompt}
            ],
            temperature=0.3
        )
        evaluation_text = response.choices[0].message.content.strip()

        eval_dict = {
            "Accuracy": int(evaluation_text.split("Accuracy:")[1].split()[0]),
            "Relevant": int(evaluation_text.split("Relevant:")[1].split()[0]),
            "Grounded": int(evaluation_text.split("Grounded:")[1].split()[0]),
            "Comments": evaluation_text.split("Comments:")[1].strip()
        }

        return eval_dict

    except Exception as e:
        print(f"Error evaluating response: {e}")
        return {
            "Accuracy": 0,
            "Relevant": 0,
            "Grounded": 0,
            "Comments": "Evaluation failed"
        }

def evaluate_all_models(results_df):
    """Evaluate all responses in the results dataframe"""
    evaluations = []

    for _, row in results_df.iterrows():
        print(f"Evaluating responses for question: {row['question'][:50]}...")

        # Evaluate each of the 4 model responses
        eval_base_no_rag = evaluate_response(row['question'], row['base_no_rag'], row['retrieved_docs'])
        time.sleep(2)  # Rate limiting

        eval_finetuned_no_rag = evaluate_response(row['question'], row['finetuned_no_rag'], row['retrieved_docs'])
        time.sleep(2)

        eval_base_with_rag = evaluate_response(row['question'], row['base_with_rag'], row['retrieved_docs'])
        time.sleep(2)

        eval_finetuned_with_rag = evaluate_response(row['question'], row['finetuned_with_rag'], row['retrieved_docs'])
        time.sleep(2)

        # Add evaluations to results
        evaluations.append({
            "question": row['question'],
            "base_no_rag_accuracy": eval_base_no_rag["Accuracy"],
            "base_no_rag_relevant": eval_base_no_rag["Relevant"],
            "base_no_rag_grounded": eval_base_no_rag["Grounded"],
            "base_no_rag_comments": eval_base_no_rag["Comments"],

            "finetuned_no_rag_accuracy": eval_finetuned_no_rag["Accuracy"],
            "finetuned_no_rag_relevant": eval_finetuned_no_rag["Relevant"],
            "finetuned_no_rag_grounded": eval_finetuned_no_rag["Grounded"],
            "finetuned_no_rag_comments": eval_finetuned_no_rag["Comments"],

            "base_with_rag_accuracy": eval_base_with_rag["Accuracy"],
            "base_with_rag_relevant": eval_base_with_rag["Relevant"],
            "base_with_rag_grounded": eval_base_with_rag["Grounded"],
            "base_with_rag_comments": eval_base_with_rag["Comments"],

            "finetuned_with_rag_accuracy": eval_finetuned_with_rag["Accuracy"],
            "finetuned_with_rag_relevant": eval_finetuned_with_rag["Relevant"],
            "finetuned_with_rag_grounded": eval_finetuned_with_rag["Grounded"],
            "finetuned_with_rag_comments": eval_finetuned_with_rag["Comments"],
        })

    evaluation_df = pd.DataFrame(evaluations)
    evaluation_df.to_csv("llm_judge_evaluations.csv", index=False)

    return evaluation_df

# Load results and generate evaluations
results_df = pd.read_csv("model_comparison_results.csv")
evaluation_df = evaluate_all_models(results_df)

# Display evaluation results
def display_evaluations(evaluation_df):
    """Display evaluation results in a readable format"""
    for _, row in evaluation_df.iterrows():
        print(f"\nQuestion: {row['question']}")

        print("\nBase Model (No RAG):")
        print(f"Accuracy: {row['base_no_rag_accuracy']}, Relevant: {row['base_no_rag_relevant']}, Grounded: {row['base_no_rag_grounded']}")
        print(f"Comments: {row['base_no_rag_comments']}")

        print("\nFinetuned Model (No RAG):")
        print(f"Accuracy: {row['finetuned_no_rag_accuracy']}, Relevant: {row['finetuned_no_rag_relevant']}, Grounded: {row['finetuned_no_rag_grounded']}")
        print(f"Comments: {row['finetuned_no_rag_comments']}")

        print("\nBase Model with RAG):")
        print(f"Accuracy: {row['base_with_rag_accuracy']}, Relevant: {row['base_with_rag_relevant']}, Grounded: {row['base_with_rag_grounded']}")
        print(f"Comments: {row['base_with_rag_comments']}")

        print("\nFinetuned Model with RAG):")
        print(f"Accuracy: {row['finetuned_with_rag_accuracy']}, Relevant: {row['finetuned_with_rag_relevant']}, Grounded: {row['finetuned_with_rag_grounded']}")
        print(f"Comments: {row['finetuned_with_rag_comments']}")
        print("\n" + "="*80 + "\n")

display_evaluations(evaluation_df)

#--- ADD YOUR SOLUTION HERE (40 points)---

Evaluating responses for question: What are the admissions deadlines for SUTD?...
Evaluating responses for question: Is there financial aid available?...
Evaluating responses for question: What is the minimum score for the Mother Tongue La...
Evaluating responses for question: Do I require reference letters?...
Evaluating responses for question: Can polytechnic diploma students apply?...
Evaluating responses for question: Do I need SAT score?...
Evaluating responses for question: How many PhD students does SUTD have?...
Evaluating responses for question: How much are the tuition fees for Singaporeans?...
Evaluating responses for question: How much are the tuition fees for international st...
Evaluating responses for question: Is there a minimum CAP?...
Evaluating responses for question: What is the difference between CSD in SUTD and Com...
Evaluating responses for question: What is the teaching style like in SUTD?...
Evaluating responses for question: What are the possible career path 

# Bonus points: chatbot UI (RIO)

Implement a web UI frontend for your chatbot that you can demo in class.


In [31]:
# QUESTION: Implement a web UI frontend for your chatbot that you can demo in class.
# !pip install --upgrade gradio
import gradio as gr
import pandas as pd
import time

responses = []
show_evaluations = False

def toggle_evaluations():
    global show_evaluations
    show_evaluations = not show_evaluations
    return "LLM Judge Ratings: " + ("ON" if show_evaluations else "OFF")

def get_answers(question):
    global responses, show_evaluations

    if not question:
        return "", "", "", "", "Please enter a question."

    try:
        progress = gr.Progress(track_tqdm=True)
        progress.update(0.1)  # Just update percentage (no desc)

        # Generate responses
        base_no_rag = generate_base_no_rag(question)
        finetuned_no_rag = generate_finetuned_no_rag(question)
        base_with_rag, base_docs = generate_base_with_rag(question)
        finetuned_with_rag, finetuned_docs = generate_finetuned_with_rag(question)

        eval_base_no_rag = eval_finetuned_no_rag = eval_base_with_rag = eval_finetuned_with_rag = None

        if show_evaluations:
            progress.update(0.5)  # Midpoint update
            eval_base_no_rag = evaluate_response(question, base_no_rag, [])
            time.sleep(1)
            eval_finetuned_no_rag = evaluate_response(question, finetuned_no_rag, [])
            time.sleep(1)
            eval_base_with_rag = evaluate_response(question, base_with_rag, [doc.page_content for doc in base_docs])
            time.sleep(1)
            eval_finetuned_with_rag = evaluate_response(question, finetuned_with_rag, [doc.page_content for doc in finetuned_docs])

        # Save and return formatted response
        responses.insert(0, {
            "question": question,
            "base_no_rag": base_no_rag,
            "finetuned_no_rag": finetuned_no_rag,
            "base_with_rag": base_with_rag,
            "finetuned_with_rag": finetuned_with_rag,
            "docs": [doc.page_content for doc in base_docs],
            "evals": {
                "base_no_rag": eval_base_no_rag,
                "finetuned_no_rag": eval_finetuned_no_rag,
                "base_with_rag": eval_base_with_rag,
                "finetuned_with_rag": eval_finetuned_with_rag,
            }
        })

        progress.update(1.0)  # Done
        return format_response(responses[0])

    except Exception as e:
        print("❌ Error in get_answers:", e)
        return "Error", "Error", "Error", "Error", f"Error: {str(e)}"


def format_response(response):
    base_no_rag = response['base_no_rag']
    finetuned_no_rag = response['finetuned_no_rag']
    base_with_rag = response['base_with_rag']
    finetuned_with_rag = response['finetuned_with_rag']
    docs = response['docs']

    if show_evaluations and response['evals']['base_no_rag']:
        eval = response['evals']['base_no_rag']
        base_no_rag += f"\n\n**Evaluation:** Accuracy: {eval['Accuracy']}/5 | Relevant: {eval['Relevant']}/5 | Grounded: {eval['Grounded']}/5\nComments: {eval['Comments']}"

    if show_evaluations and response['evals']['finetuned_no_rag']:
        eval = response['evals']['finetuned_no_rag']
        finetuned_no_rag += f"\n\n**Evaluation:** Accuracy: {eval['Accuracy']}/5 | Relevant: {eval['Relevant']}/5 | Grounded: {eval['Grounded']}/5\nComments: {eval['Comments']}"

    if show_evaluations and response['evals']['base_with_rag']:
        eval = response['evals']['base_with_rag']
        base_with_rag += f"\n\n**Evaluation:** Accuracy: {eval['Accuracy']}/5 | Relevant: {eval['Relevant']}/5 | Grounded: {eval['Grounded']}/5\nComments: {eval['Comments']}"

    if show_evaluations and response['evals']['finetuned_with_rag']:
        eval = response['evals']['finetuned_with_rag']
        finetuned_with_rag += f"\n\n**Evaluation:** Accuracy: {eval['Accuracy']}/5 | Relevant: {eval['Relevant']}/5 | Grounded: {eval['Grounded']}/5\nComments: {eval['Comments']}"

    docs_text = "\n\n".join([f"Document {i+1}: {doc[:200]}..." if len(doc) > 200 else f"Document {i+1}: {doc}"
                            for i, doc in enumerate(docs)])

    return base_no_rag, finetuned_no_rag, base_with_rag, finetuned_with_rag, docs_text

with gr.Blocks(title="SUTD Chatbot Comparison") as demo:
    gr.Markdown("# SUTD Chatbot System Comparison")
    gr.Markdown("Compare responses from 4 different model configurations")

    with gr.Row():
        question = gr.Textbox(label="Enter your question about SUTD:",
                            placeholder="e.g. What are the admissions deadlines for SUTD?")
        submit_btn = gr.Button("Get Answers")

    toggle_btn = gr.Button("Toggle LLM Judge Ratings")
    eval_status = gr.Textbox(label="Evaluation Status", value="LLM Judge Ratings: OFF", interactive=False)

    with gr.Row():
        with gr.Column():
            gr.Markdown("### Base Model (No RAG)")
            base_no_rag = gr.Textbox(label="", lines=5, interactive=False)
        with gr.Column():
            gr.Markdown("### Finetuned Model (No RAG)")
            finetuned_no_rag = gr.Textbox(label="", lines=5, interactive=False)

    with gr.Row():
        with gr.Column():
            gr.Markdown("### Base Model (With RAG)")
            base_with_rag = gr.Textbox(label="", lines=5, interactive=False)
        with gr.Column():
            gr.Markdown("### Finetuned Model (With RAG)")
            finetuned_with_rag = gr.Textbox(label="", lines=5, interactive=False)

    with gr.Accordion("View Retrieved Documents (for RAG models)", open=False):
        docs_display = gr.Textbox(label="", lines=10, interactive=False)



    submit_btn.click(
        fn=get_answers,
        inputs=question,
        outputs=[base_no_rag, finetuned_no_rag, base_with_rag, finetuned_with_rag, docs_display]
    )

    toggle_btn.click(
        fn=toggle_evaluations,
        outputs=eval_status
    )

if __name__ == "__main__":
    demo.launch()



#--- ADD YOUR SOLUTION HERE (40 points)---

It looks like you are running Gradio on a hosted a Jupyter notebook. For the Gradio app to work, sharing must be enabled. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://790a1d147c59755a3b.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


# End

This concludes assignment 4.

Please submit this notebook with your answers and the generated output cells as a **Jupyter notebook file** via github.


Every group member should do the following submission steps:
1. Create a private github repository **sutd_5055mlop** under your github user.
2. Add your instructors as collaborator: ddahlmeier and lucainiaoge
3. Save your submission as assignment_04_GROUP_NAME.ipynb where GROUP_NAME is the name of the group you have registered.
4. Push the submission files to your repo
5. Submit the link to the repo via eDimensions



**Assignment due 21 April 2025 11:59pm**