# Group Project / Assignment 4: Instruction finetuning a Llama-3.2 model
**Assignment due 21 April 11:59pm**

Welcome to the fourth and final assignment for 50.055 Machine Learning Operations. The third and fourth assignment together form the course group project. You will continue the work on a chatbot which can answer questions about SUTD to prospective students.


**This assignment is a group assignment.**

- Read the instructions in this notebook carefully
- Add your solution code and answers in the appropriate places. The questions are marked as **QUESTION:**, the places where you need to add your code and text answers are marked as **ADD YOUR SOLUTION HERE**. The assignment is more open-ended than previous assignments, i.e. you have more freedom how to solve the problem and how to structure your code.
- The completed notebook, including your added code and generated output will be your submission for the assignment.
- The notebook should execute without errors from start to finish when you select "Restart Kernel and Run All Cells..". Please test this before submission.
- Use the SUTD Education Cluster to solve and test the assignment. If you work on another environment, minimally test your work on the SUTD Education Cluster.

**Rubric for assessment** 

Your submission will be graded using the following criteria. 
1. Code executes: your code should execute without errors. The SUTD Education cluster should be used to ensure the same execution environment.
2. Correctness: the code should produce the correct result or the text answer should state the factual correct answer.
3. Style: your code should be written in a way that is clean and efficient. Your text answers should be relevant, concise and easy to understand.
4. Partial marks will be awarded for partially correct solutions.
5. Creativity and innovation: in this assignment you have more freedom to design your solution, compared to the first assignments. You can show of your creativity and innovative mindset. 
6. There is a maximum of 310 points for this assignment.

**ChatGPT policy** 

If you use AI tools, such as ChatGPT, to solve the assignment questions, you need to be transparent about its use and mark AI-generated content as such. In particular, you should include the following in addition to your final answer:
- A copy or screenshot of the prompt you used
- The name of the AI model
- The AI generated output
- An explanation why the answer is correct or what you had to change to arrive at the correct answer

**Assignment Notes:** Please make sure to save the notebook as you go along. Submission Instructions are located at the bottom of the notebook.



### Finetuning LLMs

The goal of the assignment is to build a more advanced chatbot that can talk to prospective students and answer questions about SUTD.

We will finetune a smaller 1B LLM on question-answer pairs which we synthetically generate. Then we will compare the finetuned and non-finetuned LLMs with and without RAG to see if we were able to improve the SUTD chatbot answer quality. 

We'll be leveraging `langchain`, `llama 3.2` and `Google AI STudio with Gemini 2.0`.

Check out the docs:
- [LangChain](https://docs.langchain.com/docs/)
- [Llama 3.2](https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_2/)
- [Google AI Studio](https://aistudio.google.com/)

Note: Google AI Studio provides a lot of free tokens but has certain rate limits. Write your code in a way that it can handle these limits.

# Install dependencies
Use pip to install all required dependencies of this assignment in the cell below. Make sure to test this on the SUTD cluster as different environments have different software pre-installed.  

In [None]:
# QUESTION: Install and import all required packages
# The rest of your code should execute without any import or dependency errors.

# **--- ADD YOUR SOLUTION HERE (10 points) ---**






# Generate training data
The first step of the assignment is generating synthetic question-answer pairs which can be used for finetuning an LLM model. 
Use the Google AI studio with the Gemini models to create -high-quality QA training data.


In [None]:
# QUESTION: Use langchain and the Google AI Studio APIs and a model from the Gemini 2.0 family
# to create a text-generation chain that can produce and parse JSON output.
# Test it by having the LLM generate a JSON array of 3 fruits

#--- ADD YOUR SOLUTION HERE (20 points)---




## Generate topics
When generating data, it is often helpful to guide the generation process through some hierachical structure. 
Before we create question-answer pairs, let's generate some topics which the questions should be about.



In [None]:
# QUESTION: Create a function 'generate_topics' which generates topics which prospective students might care about.
#
# Generate a list of 20 topics 

#--- ADD YOUR SOLUTION HERE (20 points)---




In [None]:
# # test topic generation
# print(generate_topics(3))

In [None]:
# Generate a list of 20 topics 
# We save a copy to disk and reload it from there if the file exists



## Generate questions
Now generate a set of questions about each topic

In [None]:
# QUESTION: Create a function 'generate_questions' which generates quetions about a given topic. 
# Generate a list of 10 questions per topics. In total you should have 200 questions. 
#

#--- ADD YOUR SOLUTION HERE (20 points)---


In [None]:
# # test it
# print(generate_questions("Academic Reputation and Program Quality", 3))


In [None]:
# # QUESTION: Now let's put it together and generate 10 questions for each topic. Save the questions in a local file.

#--- ADD YOUR SOLUTION HERE (20 points)---




## Generate Answers

Now create answers for the questions. 

You can use the Google AI Studio Gemini model (assuming that they are good enough to generate good answers), your RAG system from assignment 3 or any other method you choose to generate answers for your question dataset.

Note: it is normal that some LLM calls fail, even with retry, so maybe you end up with less than 200 QA pairs but it should be at least 160 QA pairs.

In [None]:
# QUESTION: Generate answers to al your questions using Gemini, your SUTD RAG system or any other method.
# Split your dataset in to 80% training and 20% test dataset.
# Store all questions and answer pairs in a huggingface dataset `sutd_qa_dataset` and push it to your Huggingface hub. 

#--- ADD YOUR SOLUTION HERE (40 points)---





In [None]:
# # test the chain
# question = "When was SUTD founded?"

# # Now run the answer generation chain
# response = generate_answer(question)
# print("\nModel Response:")
# print(response["answer"])

In [None]:
# now run the chain for all questions to collect context and generate answers



# Finetune Llama 3.2 1B model

Now use your SUTD QA dataset training data set to finetune a smaller Llama 3.2 1B LLM using parameter-efficient finetuning (PEFT). 
We recommend the unsloth library but you are free to choose other frameworks. You can decide the parameters for the finetuning. 
Push your finetuned model to Huggingface. 

Then we will compare the finetuned and non-finetuned LLMs with and without RAG to see if we were able to improve the SUTD chatbot answer quality. 


In [None]:
# QUESTION: Finetune a Llama 3.2 1B model on the training split of your SUTD sQA dataset.
# You need to prepare your dataset accordingly and set the hyperparameters for the training.
# Push your finetuned model to the Hugginface model hub {YOUR_HF_NAME}/llama-3.2-1B-sutdqa

#--- ADD YOUR SOLUTION HERE (50 points)---



In [None]:
# QUESTION: Load a non-finetuned Llama 3.2 1B model and your finetuned SUTD QA Llama 3.2 1B model
# Ask it a simple test question (e.g. "What is special about SUTD?") to check that both models can generated answers

#--- ADD YOUR SOLUTION HERE (10 points)---



In [None]:
# # try out the llms

# query = "What is special about SUTD?"

# print("Question:", query)
# response_base = llm_base.invoke(query,  pipeline_kwargs={"max_new_tokens": 512})
# print("Answer base:", response_base)

# print("---------")
# response_finetune = llm_finetune.invoke(query, pipeline_kwargs={"max_new_tokens": 512})
# print("Answer finetune:", response_finetune)

# Integrate and evaluate

Now integrate both the non-finetuned Llama 3.2 1B model and your finetuned model into your SUTD chatbot RAG system. 
Generate responses to the 20 questions you have collected in assignment 3 using these 4 appraoches
1. non-finetuned Llama 3.2 1B model without RAG
2. finetuned Llama 3.2 1B SUTD QA model without RAG
3. non-finetuned Llama 3.2 1B model with RAG
4. finetuned Llama 3.2 1B SUTD QA model with RAG

Compare the responses and decide what system produces the most accurate and high quality responses

In [1]:
from typing import Literal
import bs4
from langchain import hub
from langchain_community.document_loaders import WebBaseLoader
from langchain_core.documents import Document
from langchain_core.vectorstores import InMemoryVectorStore
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langgraph.graph import START, StateGraph
from typing_extensions import Annotated, List, TypedDict
from langchain_huggingface import HuggingFaceEmbeddings, HuggingFacePipeline
from datasets import load_dataset
from transformers import AutoModelForCausalLM,AutoTokenizer,pipeline
from sentence_transformers import SentenceTransformer
import torch
from langchain_core.output_parsers import PydanticOutputParser
import numpy
from langchain_core.messages import SystemMessage
from langgraph.prebuilt import ToolNode
from pypdf import PdfReader
import os
from os import listdir
from bs4 import BeautifulSoup
import pandas as pd
import re 
import requests
from tqdm import tqdm
from peft import PeftModel, PeftConfig

USER_AGENT environment variable not set, consider setting it to identify your requests.


## Load base model and finetuned model

In [None]:
# Model identifiers
model_base_id = "meta-llama/Llama-3.2-1B"
model_finetune_id = "reenee1601/llama-3.2-1B-sutdqa-merged"

device = "cuda" if torch.cuda.is_available() else "cpu"

# ----------------------------
# Load BASE model + tokenizer
# ----------------------------
tokenizer_base = AutoTokenizer.from_pretrained(
    model_base_id,
    padding_side="left"
)

# Ensure base tokenizer has padding token
if tokenizer_base.pad_token is None:
    tokenizer_base.pad_token = tokenizer_base.eos_token
    tokenizer_base.pad_token_id = tokenizer_base.eos_token_id

base_model = AutoModelForCausalLM.from_pretrained(
    model_base_id,
    torch_dtype=torch.float16,
    device_map=0 if device == "cuda" else "cpu"
)

# ----------------------------
# Load FINETUNED model + tokenizer
# ----------------------------
tokenizer_finetune = AutoTokenizer.from_pretrained(
    model_finetune_id,
    padding_side="left"
)

# Ensure finetuned tokenizer has padding token
if tokenizer_finetune.pad_token is None:
    tokenizer_finetune.pad_token = tokenizer_finetune.eos_token
    tokenizer_finetune.pad_token_id = tokenizer_finetune.eos_token_id

finetuned_model = AutoModelForCausalLM.from_pretrained(
    model_finetune_id,
    torch_dtype=torch.float16,
    device_map=0 if device == "cuda" else "cpu"
)

# ----------------------------
# Create text-generation pipelines
# ----------------------------
llm_base = pipeline(
    "text-generation",
    model=base_model,
    tokenizer=tokenizer_base,
    max_new_tokens=512,
    temperature=0.4,
    pad_token_id=tokenizer_base.pad_token_id,
    torch_dtype=torch.float16,
    device_map=0 if device == "cuda" else "cpu",
)

llm_finetune = pipeline(
    "text-generation",
    model=finetuned_model,
    tokenizer=tokenizer_finetune,
    max_new_tokens=512,
    temperature=0.4,
    pad_token_id=tokenizer_finetune.pad_token_id,
    torch_dtype=torch.float16,
    device_map=0 if device == "cuda" else "cpu",
)

In [3]:
query = "What courses are available in SUTD?"

formatted_input = f"Question: {query}\nYou are a helpful and friendly assistant who provides detailed and informative answers to prospective students about their queries regarding the Singapore University of Technology and Design (SUTD). Elaborate on your response while keeping it concise and relevant. Answer:"

# Generate response
response = llm_base(
    formatted_input,
    max_new_tokens=512,
    temperature=0.4,
    pad_token_id=tokenizer_base.pad_token_id
)

print({"answer": response[0]['generated_text'].split("Answer:")[-1].strip()})

{'answer': 'SUTD is a private university that offers a wide range of courses in various fields of study. The university has a strong emphasis on innovation and research, and it is known for its cutting-edge technologies and interdisciplinary approach to education. SUTD offers a range of undergraduate and graduate programs, including engineering, science, business, design, and humanities. The university also has a strong focus on internationalization, with students from around the world studying and working at SUTD. The university is committed to providing a supportive and inclusive environment for its students, and it offers a range of services and support to help students succeed. SUTD is a private university that offers a wide range of courses in various fields of study. The university has a strong emphasis on innovation and research, and it is known for its cutting-edge technologies and interdisciplinary approach to education. SUTD offers a range of undergraduate and graduate progra

In [4]:
query = "What courses are available in SUTD?"

formatted_input = f"Question: {query}\nYou are a helpful and friendly assistant who provides detailed and informative answers to prospective students about their queries regarding the Singapore University of Technology and Design (SUTD). Elaborate on your response while keeping it concise and relevant. Answer:"

# Generate response
response = llm_finetune(
    formatted_input,
    max_new_tokens=512,
    temperature=0.4,
    pad_token_id=tokenizer_finetune.pad_token_id
)

print({"answer": response[0]['generated_text'].split("Answer:")[-1].strip()})

{'answer': "SUTD offers a wide range of courses, but they are not a single university. They are a separate institution within Singapore's education system. Prospective students should understand this and not expect a single, comprehensive course catalogue like a traditional university. Instead, SUTD's courses are structured as modules, with specific modules offered each semester. While SUTD does offer general electives, these are not the primary focus of the curriculum like a traditional university. Students should expect courses to be tailored to the specific focus of their chosen majors, and the modules may change slightly from semester to semester depending on the program. The official course catalog is updated regularly on the SUTD website, so prospective students should regularly check the website for the most up-to-date information. It's crucial to understand that SUTD's courses are not equivalent to a traditional university degree; they are designed to be highly focused and spec

## Non-RAG

### 1. Non-finetuned Llama 3.2 1B model without RAG

In [5]:
questions = [
            "What are the admissions deadlines for SUTD?",
            "Is there financial aid available?",
            "What is the minimum score for the Mother Tongue Language?",
            "Do I require reference letters?",
            "Can polytechnic diploma students apply?",
            "Do I need SAT score?",
            "How many PhD students does SUTD have?",
            "How much are the tuition fees for Singaporeans?",
            "How much are the tuition fees for international students?",
            "Is there a minimum CAP?",
            "If I am a polytechnic student with CGPA 3.0, am I still able to go SUTD?",
            "Is first year housing compulsory?",
            "Is ILP compulsory?",
            "Does SUTD help me in sourcing internships or jobs?",
            "I want to create a startup during my undergraduate years. What assistance does SUTD provide?",
            "I am new to programming but I want to join Computer Science & Design. Will SUTD provide any bridging courses in the first year?",
            "I want to work in cybersecurity after graduation. What course and modules should I take at SUTD?",
            "What career path does DAI open for me?",
            "Who can I contact to query about my admission application?",
            "When does school start for freshmore?"
            ]

df = pd.DataFrame(columns=["query", "answer"])

for question in tqdm(questions):
    formatted_input = f"Question: {question}\nYou are a helpful and friendly assistant who provides detailed and informative answers to prospective students about their queries regarding the Singapore University of Technology and Design (SUTD). Elaborate on your response while keeping it concise and relevant. Answer:"
    
    response = llm_base(
        formatted_input,
        max_new_tokens=512,
        temperature=0.4,
        pad_token_id=tokenizer_base.pad_token_id
    )
    
    answer = response[0]['generated_text'].split("Answer:")[-1].strip()
    df.loc[len(df)] = [question, answer]

df.to_csv('results_base.csv', index=False)

 45%|████▌     | 9/20 [17:52<20:21, 111.02s/it]  You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
100%|██████████| 20/20 [27:42<00:00, 83.15s/it]


### 2. Finetuned Llama 3.2 1B SUTD QA model without RAG

In [6]:
questions = [
            "What are the admissions deadlines for SUTD?",
            "Is there financial aid available?",
            "What is the minimum score for the Mother Tongue Language?",
            "Do I require reference letters?",
            "Can polytechnic diploma students apply?",
            "Do I need SAT score?",
            "How many PhD students does SUTD have?",
            "How much are the tuition fees for Singaporeans?",
            "How much are the tuition fees for international students?",
            "Is there a minimum CAP?",
            "If I am a polytechnic student with CGPA 3.0, am I still able to go SUTD?",
            "Is first year housing compulsory?",
            "Is ILP compulsory?",
            "Does SUTD help me in sourcing internships or jobs?",
            "I want to create a startup during my undergraduate years. What assistance does SUTD provide?",
            "I am new to programming but I want to join Computer Science & Design. Will SUTD provide any bridging courses in the first year?",
            "I want to work in cybersecurity after graduation. What course and modules should I take at SUTD?",
            "What career path does DAI open for me?",
            "Who can I contact to query about my admission application?",
            "When does school start for freshmore?"
            ]

df = pd.DataFrame(columns=["query", "answer"])

for question in tqdm(questions):
    formatted_input = f"Question: {question}\nYou are a helpful and friendly assistant who provides detailed and informative answers to prospective students about their queries regarding the Singapore University of Technology and Design (SUTD). Elaborate on your response while keeping it concise and relevant. Answer:"
    
    response = llm_finetune(
        formatted_input,
        max_new_tokens=512,
        temperature=0.4,
        pad_token_id=tokenizer_finetune.pad_token_id
    )
    
    answer = response[0]['generated_text'].split("Answer:")[-1].strip()
    df.loc[len(df)] = [question, answer]

df.to_csv('results_finetune.csv', index=False)

100%|██████████| 20/20 [40:07<00:00, 120.37s/it]


## RAG

### Download documents

In [None]:
# Separated by different loaders because different webpage has content on different html element
loader = WebBaseLoader(
    web_paths=("https://en.wikipedia.org/wiki/Singapore_University_of_Technology_and_Design", 
            "https://www.sutd.edu.sg/research/research-centres/designz/about/introduction/",
            "https://www.sutd.edu.sg/admissions/undergraduate/education-expenses/fees/tuition-fees/#tabs",
            "https://www.sutd.edu.sg/admissions/undergraduate/education-expenses/fees/tuition-grant-eligibility/#tabs",
            "https://www.sutd.edu.sg/admissions/undergraduate/education-expenses/financial-estimates/#tabs",
            "https://www.sutd.edu.sg/admissions/undergraduate/education-expenses/student-insurance-scheme/#tabs",
            "https://www.sutd.edu.sg/admissions/undergraduate/appeal/",
            "https://www.sutd.edu.sg/admissions/undergraduate/admission-requirements/overview",
            "https://www.sutd.edu.sg/admissions/undergraduate/scholarship/sutd-administered/",
            "https://www.sutd.edu.sg/admissions/undergraduate/scholarship/external-sponsored/#tabs",
            "https://www.sutd.edu.sg/admissions/undergraduate/scholarship/awards/sutd-design-innovator-award/",
            "https://www.sutd.edu.sg/admissions/undergraduate/financing-options-and-aid/financial-aid/overview/#tabs",
            "https://www.sutd.edu.sg/admissions/undergraduate/financing-options-and-aid/other-financing-options/overview#tabs",
            "https://www.sutd.edu.sg/admissions/undergraduate/financing-options-and-aid/sutd-community-grant/#tabs",
            "https://www.sutd.edu.sg/admissions/undergraduate/early-matriculation/",
            "https://www.sutd.edu.sg/admissions/undergraduate/integrated-learning-programme/",
            "https://www.sutd.edu.sg/campus-life/student-life/student-organisations-fifth-row/",
            "https://www.sutd.edu.sg/campus-life/student-life/part-time-work-scheme/",
            "https://www.sutd.edu.sg/campus-life/student-life/student-awards/student-achievement-awards/overview/#tabs",
            "https://www.sutd.edu.sg/admissions/undergraduate/admission-requirements/international-qualifications",
            "https://www.sutd.edu.sg/admissions/undergraduate/application-guide/",
            "https://www.sutd.edu.sg/istd/139-2/",
                "https://www.sutd.edu.sg/course/10-013-modelling-and-analysis/",
            "https://www.sutd.edu.sg/course/10-015-physical-world/",
            "https://www.sutd.edu.sg/course/10-014-computational-thinking-for-design/",
            "https://www.sutd.edu.sg/course/02-001-global-humanities-literature-philosophy-and-ethics/",
            "https://www.sutd.edu.sg/course/10-018-modelling-space-and-systems/",
            "https://www.sutd.edu.sg/course/10-017-technological-world/",
            "https://www.sutd.edu.sg/course/10-016-science-for-a-sustainable-world/",
            "https://www.sutd.edu.sg/course/03-007-design-thinking-and-innovation/"
            ),
    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(
            name=("main"),
        )
    ),
)
docs = loader.load()

loader = WebBaseLoader(
    web_paths=("https://www.sutd.edu.sg/admissions/undergraduate/faq/?faq-category=1655%2C1650%2C1653%2C1654%2C1652%2C1753%2C1586%2C1740%2C937%2C1749%2C815%2C1750%2C1751%2C1752%2C1754%2C1755%2C1756%2C1757#faq-listing",
            "https://www.sutd.edu.sg/admissions/undergraduate/faq/?faq-category=1655%2C1650%2C1653%2C1654%2C1652%2C1753%2C1586%2C1740%2C937%2C1749%2C815%2C1750%2C1751%2C1752%2C1754%2C1755%2C1756%2C1757&paged=2#faq-listing",
            "https://www.sutd.edu.sg/admissions/undergraduate/faq/?faq-category=1655%2C1650%2C1653%2C1654%2C1652%2C1753%2C1586%2C1740%2C937%2C1749%2C815%2C1750%2C1751%2C1752%2C1754%2C1755%2C1756%2C1757&paged=3#faq-listing",
            "https://www.sutd.edu.sg/admissions/undergraduate/faq/?faq-category=1655%2C1650%2C1653%2C1654%2C1652%2C1753%2C1586%2C1740%2C937%2C1749%2C815%2C1750%2C1751%2C1752%2C1754%2C1755%2C1756%2C1757&paged=4#faq-listing",
            "https://www.sutd.edu.sg/admissions/undergraduate/faq/?faq-category=1655%2C1650%2C1653%2C1654%2C1652%2C1753%2C1586%2C1740%2C937%2C1749%2C815%2C1750%2C1751%2C1752%2C1754%2C1755%2C1756%2C1757&paged=5#faq-listing",
            "https://www.sutd.edu.sg/admissions/undergraduate/faq/?faq-category=1655%2C1650%2C1653%2C1654%2C1652%2C1753%2C1586%2C1740%2C937%2C1749%2C815%2C1750%2C1751%2C1752%2C1754%2C1755%2C1756%2C1757&paged=6#faq-listing",
            "https://www.sutd.edu.sg/admissions/undergraduate/faq/?faq-category=1655%2C1650%2C1653%2C1654%2C1652%2C1753%2C1586%2C1740%2C937%2C1749%2C815%2C1750%2C1751%2C1752%2C1754%2C1755%2C1756%2C1757&paged=7faq-listing",
            "https://www.sutd.edu.sg/admissions/undergraduate/faq/?faq-category=1655%2C1650%2C1653%2C1654%2C1652%2C1753%2C1586%2C1740%2C937%2C1749%2C815%2C1750%2C1751%2C1752%2C1754%2C1755%2C1756%2C1757&paged=8#faq-listing",
            "https://www.sutd.edu.sg/admissions/undergraduate/faq/?faq-category=1655%2C1650%2C1653%2C1654%2C1652%2C1753%2C1586%2C1740%2C937%2C1749%2C815%2C1750%2C1751%2C1752%2C1754%2C1755%2C1756%2C1757&paged=9#faq-listing",
            "https://www.sutd.edu.sg/admissions/undergraduate/faq/?faq-category=1655%2C1650%2C1653%2C1654%2C1652%2C1753%2C1586%2C1740%2C937%2C1749%2C815%2C1750%2C1751%2C1752%2C1754%2C1755%2C1756%2C1757&paged=10#faq-listing",
            ),
    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(
            name=("p"),
        )
    ),
)
docs_faq = loader.load()

docs += docs_faq

loader = WebBaseLoader(
    web_paths=("https://www.sutd.edu.sg/campus-life/housing/freshmore-terms-1-2/rooms-and-amenities/#tabs",
            "https://www.sutd.edu.sg/campus-life/housing/freshmore-terms-1-2/check-in-out-ay2025/#tabs",
            "https://www.sutd.edu.sg/campus-life/housing/freshmore-terms-1-2/payment-ay2025/#tabs",
            "https://www.sutd.edu.sg/campus-life/housing/freshmore-terms-1-2/#tabs",
            "https://www.sutd.edu.sg/admissions/undergraduate/local-diploma/criteria-for-admission",
            "https://www.sutd.edu.sg/admissions/undergraduate/local-diploma/application-timeline/#tabs",
            "https://www.sutd.edu.sg/istd/education/undergraduate/faq/why-istd/#tabs",
            ),
    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(
            id=("component-grid-group"),
        )
    ),
)

extra = loader.load()
docs+=extra

loader = WebBaseLoader(
    web_paths=("https://www.sutd.edu.sg/istd/education/undergraduate/faq/faq/#tabs",
            "https://www.sutd.edu.sg/istd/education/undergraduate/faq/faq/?paged=2#faq-listing",
            "https://www.sutd.edu.sg/esd/education/undergraduate/faq/?post_tag=54",
            "https://www.sutd.edu.sg/epd/education/undergraduate/faq/?post_tag=719",
            "https://www.sutd.edu.sg/epd/education/undergraduate/faq/?post_tag=719&paged=2#faq-listing",
            "https://www.sutd.edu.sg/epd/education/undergraduate/faq/?post_tag=719&paged=3#faq-listing",
            ),
    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(
            id=("rich-text-generator"),
        )
    ),
)

extra = loader.load()
docs+=extra

loader = WebBaseLoader(
    web_paths=("https://www.sutd.edu.sg/education/undergraduate/freshmore-subjects/",
            ),
    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(
            class_=("acf-innerblocks-container"),
        )
    ),
)

extra = loader.load()
docs+=extra

def scrape_course(url):
    try:
        res = requests.get(url, timeout=10)
        res.raise_for_status()
        soup = BeautifulSoup(res.text, "html.parser")

        title_tag = soup.find("h1")
        title = title_tag.get_text(strip=True) if title_tag else "No Title Found"

        rich_text_span = soup.find("span", {"id": "rich-text-generator"})
        description = ""
        
        if rich_text_span:
            li_tags = rich_text_span.find_all("li")
            p_tags = rich_text_span.find_all("p")
            h_tags = rich_text_span.find_all(re.compile("^h[1-6]$"))  

            description = "\n".join([tag.get_text(strip=True) for tag in li_tags + p_tags + h_tags])

        if not description:
            fallback_span = soup.find("span", class_="richText richtext-paragraph-margin")
            first_paragraph = fallback_span.find("p") if fallback_span else None

            if not first_paragraph:
                fallback_div = soup.find("div", class_="wp-block-column is-vertically-aligned-center")
                first_paragraph = fallback_div.find("p") if fallback_div else None

            if not first_paragraph:
                fallback_div = soup.find("div", class_="wp-block-column")
                first_paragraph = fallback_div.find("p") if fallback_div else None

            if not first_paragraph:
                list_items = soup.find_all("li")
                if list_items:
                    first_paragraph = list_items[0].get_text(strip=True)

            if first_paragraph:
                description = first_paragraph

        # Extract description
        description = description if description else "No Description Found"
        print(f"Title: {title}")
        print(f"Description: {description}")
        print("-" * 80)

        return title, description

    except Exception as e:
        return "Error", f"Failed to fetch: {url} - {str(e)}"


def save_to_html(course_data, output_file="courses.html"):
    with open(output_file, "w", encoding="utf-8") as file:
        file.write("<html><body><h1>Course Titles and Descriptions</h1>")
        for title, description in course_data:
            file.write(f"<h2>{title}</h2>")
            file.write(f"<p>{description}</p>")
        file.write("</body></html>")

def scrape_courses_from_file(input_file="course_links.txt"):
    course_data = []
    with open(input_file, "r", encoding="utf-8") as file:
        for line in file:
            url = line.strip()
            if url:  
                title, description = scrape_course(url)
                course_data.append((title, description))
    
    return course_data

course_data = scrape_courses_from_file()
save_to_html(course_data)


def scrape_local(link, about):
    with open(link, encoding="utf-8") as fp:
        soup = BeautifulSoup(fp, 'html.parser')
    
    for course_tag in soup.find_all('h2'):
        course_title = course_tag.get_text(strip=True)
        description_tag = course_tag.find_next('p') 
        description = description_tag.get_text(strip=True) if description_tag else ""
        
        new_entry = Document(
            page_content=course_title+": "+description,
            metadata={
                "source": course_title,
                "category": about,
                "updated": "2025-03-31" 
            }
        )
        docs.append(new_entry)

scrape_local("./courses.html", "course_info")


with open("./calendar2025.html", encoding="utf-8") as fp:
    soup = BeautifulSoup(fp, 'html.parser')
    
for h2_tag in soup.find_all('h2'):
    section = {
        'title': h2_tag.get_text(strip=True),
        'h3_sections': [],
        'paragraphs': []
    }
    
    # Get all siblings until the next h2 tag
    current = h2_tag.next_sibling
    current_h3 = None
    h3_section = None
    
    while current and (not isinstance(current, type(h2_tag)) or current.name != 'h2'):
        if hasattr(current, 'name'):
            if current.name == 'h3':
                current_h3 = current.get_text(strip=True)
                h3_section = {'title': current_h3, 'paragraphs': []}
                section['h3_sections'].append(h3_section)
            elif current.name == 'p':
                if h3_section:
                    h3_section['paragraphs'].append(current.get_text(strip=True))
                else:
                    section['paragraphs'].append(current.get_text(strip=True))
        current = current.next_sibling
    
    # Convert the section dictionary to a meaningful text representation
    section_text = f"{section['title']}\n\n"
    
    # Add paragraphs directly under the trimester
    for paragraph in section['paragraphs']:
        section_text += f"{paragraph}\n"
    
    # Add h3 sections
    for h3_section in section['h3_sections']:
        section_text += f"\n{h3_section['title']}:\n"
        for paragraph in h3_section['paragraphs']:
            section_text += f"- {paragraph}\n"
    
    new_entry = Document(
        page_content=section_text,  # Use the text representation instead of the dictionary
        metadata={
            "source": "calendar2025.html",
            "category": "academic_calendar",
            "updated": "2025-03-31",
            "section_data": section  # Optionally keep the structured data in metadata
        }
    )
    docs.append(new_entry)


path = "./pdf/"
all_pdf = listdir(path)
for i in all_pdf:
    if i.endswith(".pdf"):  # Fixed the condition to check for .pdf extension
        reader = PdfReader(path + i)  
        number_of_pages = len(reader.pages) 
        
        # Last page is excluded because it has no content
        text = ""
        for page_num in range(number_of_pages - 1):
            page = reader.pages[page_num]
            text += page.extract_text() 
        new_entry = Document(
            page_content=text,
            metadata={
                "source": i,
                "category": "course_info",
                # Update this date accordingly if there is updates
                "updated": "2025-03-31"  
            }
        )
        docs.append(new_entry)


# Create a translation table to remove \n, \t, and replace \xa0 with spaces
translation_table = str.maketrans(
    {'\n': None, '\t': None, '\xa0': ' '}
)

# Load and clean documents
for doc in docs:
    doc.page_content = doc.page_content.translate(translation_table).strip()

### Split documents

In [None]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
all_splits = text_splitter.split_documents(docs)

### Embedding and vector store

In [None]:
embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")
# embedding_model = SentenceTransformer("sentence-transformers/all-mpnet-base-v2")

vector_store = InMemoryVectorStore(embedding_model)
_ = vector_store.add_documents(all_splits)

In [None]:
query = "When was SUTD founded?"

# QUESTION: run the query against the vector store, print the top 5 search results

#--- ADD YOUR SOLUTION HERE (5 points)---
retrieved_docs = vector_store.similarity_search(
    query,
    k=5
)
print(retrieved_docs)

### 3. Non-finetuned Llama 3.2 1B model with RAG

In [None]:
# Example questions
query = "How can I increase my chances of admission into SUTD?"


#--- ADD YOUR SOLUTION HERE (40 points)---
# Define prompt for question-answering
prompt = hub.pull("rlm/rag-prompt")

class Search(TypedDict):
    """Search query."""

    query: Annotated[str, ..., "Search query to run."]
    section: Annotated[
        Literal["beginning", "middle", "end"],
        ...,
        "Section to query.",
    ]

# Define state for application
class State(TypedDict):
    question: str
    query: Search
    context: List[Document]
    answer: str


def analyze_query(state: State):

    raw_query = state["question"]
    
    # Manual parsing for structured output
    try:
        parsed_query = {
            "query": raw_query.split("Query:")[-1].split("Section:")[0].strip(),
            "section": "beginning" if "beginning" in raw_query.lower() 
                    else "middle" if "middle" in raw_query.lower()
                    else "end"
        }
        return {"query": parsed_query}
    except Exception as e:
        print(f"Query parsing failed: {e}")
        return {"query": {"query": state["question"], "section": "beginning"}}


def retrieve(state: State):
    query = state["query"]
    retrieved_docs = vector_store.similarity_search(
        query["query"],
        k=3
    )
    return {"context": retrieved_docs}


def generate(state: State):
    docs_content = "\n\n".join(doc.page_content for doc in state["context"])
    formatted_input = f"Context: {docs_content}\nQuestion: {state['question']}\nYou are a helpful and friendly assistant who provides detailed and informative answers to prospective students about their queries regarding the Singapore University of Technology and Design (SUTD). Elaborate on your response while keeping it concise and relevant. Answer:"
    
    # Generate response
    response = llm_base(
        formatted_input,
        max_new_tokens=512,
        temperature=0.4,
        pad_token_id=tokenizer_base.pad_token_id
    )
    
    return {"answer": response[0]['generated_text'].split("Answer:")[-1].strip()}

parser = PydanticOutputParser(pydantic_object=Search)
structured_chain = llm_base | parser

graph_builder = StateGraph(State).add_sequence([analyze_query, retrieve, generate])
graph_builder.add_edge(START, "analyze_query")
graph = graph_builder.compile()

for step in graph.stream(
    {"question": query},
    stream_mode="updates",
):
    print(f"{step}\n\n----------------\n")

In [None]:
questions = [
            "What are the admissions deadlines for SUTD?",
            "Is there financial aid available?",
            "What is the minimum score for the Mother Tongue Language?",
            "Do I require reference letters?",
            "Can polytechnic diploma students apply?",
            "Do I need SAT score?",
            "How many PhD students does SUTD have?",
            "How much are the tuition fees for Singaporeans?",
            "How much are the tuition fees for international students?",
            "Is there a minimum CAP?",
            "If I am a polytechnic student with CGPA 3.0, am I still able to go SUTD?",
            "Is first year housing compulsory?",
            "Is ILP compulsory?",
            "Does SUTD help me in sourcing internships or jobs?",
            "I want to create a startup during my undergraduate years. What assistance does SUTD provide?",
            "I am new to programming but I want to join Computer Science & Design. Will SUTD provide any bridging courses in the first year?",
            "I want to work in cybersecurity after graduation. What course and modules should I take at SUTD?",
            "What career path does DAI open for me?",
            "Who can I contact to query about my admission application?",
            "When does school start for freshmore?"
            ]

data = [] 
steps_order = ['analyze_query', 'retrieve', 'generate']  

for question in tqdm(questions):
    # Initialize fresh record for each question
    record = {step: [] for step in steps_order}
    
    step_counter = 0  
    
    for step_result in graph.stream(
        {"question": question},
        stream_mode="updates"
    ):
        if step_counter >= len(steps_order):
            break
            
        current_step = steps_order[step_counter]
        
        # Safely extract step data
        if current_step in step_result:
            record[current_step].append(step_result[current_step])
            
        step_counter += 1
    
    data.append(record)


# print(data)

flat_data = []
for record in data:
    flat_data.append({
        'query': record['analyze_query'][0]['query']['query'],
        'contexts': [doc.page_content for doc in record['retrieve'][0]['context']],
        'answer': record['generate'][0]['answer']
    })
    
df = pd.DataFrame(flat_data)
df

df.to_csv('results_base_rag.csv', index=False)

### 4. Finetuned Llama 3.2 1B model with RAG

In [None]:
# Example questions
query = "How can I increase my chances of admission into SUTD?"


#--- ADD YOUR SOLUTION HERE (40 points)---
# Define prompt for question-answering
prompt = hub.pull("rlm/rag-prompt")

class Search(TypedDict):
    """Search query."""

    query: Annotated[str, ..., "Search query to run."]
    section: Annotated[
        Literal["beginning", "middle", "end"],
        ...,
        "Section to query.",
    ]

# Define state for application
class State(TypedDict):
    question: str
    query: Search
    context: List[Document]
    answer: str


def analyze_query(state: State):

    raw_query = state["question"]
    
    # Manual parsing for structured output
    try:
        parsed_query = {
            "query": raw_query.split("Query:")[-1].split("Section:")[0].strip(),
            "section": "beginning" if "beginning" in raw_query.lower() 
                    else "middle" if "middle" in raw_query.lower()
                    else "end"
        }
        return {"query": parsed_query}
    except Exception as e:
        print(f"Query parsing failed: {e}")
        return {"query": {"query": state["question"], "section": "beginning"}}


def retrieve(state: State):
    query = state["query"]
    retrieved_docs = vector_store.similarity_search(
        query["query"],
        k=3
    )
    return {"context": retrieved_docs}


def generate(state: State):
    docs_content = "\n\n".join(doc.page_content for doc in state["context"])
    formatted_input = f"Context: {docs_content}\nQuestion: {state['question']}\nYou are a helpful and friendly assistant who provides detailed and informative answers to prospective students about their queries regarding the Singapore University of Technology and Design (SUTD). Elaborate on your response while keeping it concise and relevant. Answer:"
    
    # Generate response
    response = llm_finetune(
        formatted_input,
        max_new_tokens=512,
        temperature=0.4,
        pad_token_id=tokenizer_finetune.pad_token_id
    )
    
    return {"answer": response[0]['generated_text'].split("Answer:")[-1].strip()}

parser = PydanticOutputParser(pydantic_object=Search)
structured_chain = llm_finetune | parser

graph_builder = StateGraph(State).add_sequence([analyze_query, retrieve, generate])
graph_builder.add_edge(START, "analyze_query")
graph = graph_builder.compile()

for step in graph.stream(
    {"question": query},
    stream_mode="updates",
):
    print(f"{step}\n\n----------------\n")

In [None]:
questions = [
            "What are the admissions deadlines for SUTD?",
            "Is there financial aid available?",
            "What is the minimum score for the Mother Tongue Language?",
            "Do I require reference letters?",
            "Can polytechnic diploma students apply?",
            "Do I need SAT score?",
            "How many PhD students does SUTD have?",
            "How much are the tuition fees for Singaporeans?",
            "How much are the tuition fees for international students?",
            "Is there a minimum CAP?",
            "If I am a polytechnic student with CGPA 3.0, am I still able to go SUTD?",
            "Is first year housing compulsory?",
            "Is ILP compulsory?",
            "Does SUTD help me in sourcing internships or jobs?",
            "I want to create a startup during my undergraduate years. What assistance does SUTD provide?",
            "I am new to programming but I want to join Computer Science & Design. Will SUTD provide any bridging courses in the first year?",
            "I want to work in cybersecurity after graduation. What course and modules should I take at SUTD?",
            "What career path does DAI open for me?",
            "Who can I contact to query about my admission application?",
            "When does school start for freshmore?"
            ]

data = [] 
steps_order = ['analyze_query', 'retrieve', 'generate']  

for question in tqdm(questions):
    # Initialize fresh record for each question
    record = {step: [] for step in steps_order}
    
    step_counter = 0  
    
    for step_result in graph.stream(
        {"question": question},
        stream_mode="updates"
    ):
        if step_counter >= len(steps_order):
            break
            
        current_step = steps_order[step_counter]
        
        # Safely extract step data
        if current_step in step_result:
            record[current_step].append(step_result[current_step])
            
        step_counter += 1
    
    data.append(record)


# print(data)

flat_data = []
for record in data:
    flat_data.append({
        'query': record['analyze_query'][0]['query']['query'],
        'contexts': [doc.page_content for doc in record['retrieve'][0]['context']],
        'answer': record['generate'][0]['answer']
    })
    
df = pd.DataFrame(flat_data)
df

df.to_csv('results_finetune_rag.csv', index=False)

# Bonus points: LLM-as-judge evaluation 

Implement an LLM-as-judge pipeline to assess the quality of the different system (finetuned vs. non-fintuned, RAG vs no RAG)

In [None]:
# QUESTION: Implement an LLM-as-judge pipeline to assess the quality of the different system (finetuned vs. non-fintuned, RAG vs no RAG)

#--- ADD YOUR SOLUTION HERE (40 points)---



# Bonus points: chatbot UI

Implement a web UI frontend for your chatbot that you can demo in class. 

In [None]:
# QUESTION: Implement a web UI frontend for your chatbot that you can demo in class. 

#--- ADD YOUR SOLUTION HERE (40 points)---

# End

This concludes assignment 4.

Please submit this notebook with your answers and the generated output cells as a **Jupyter notebook file** via github.


Every group member should do the following submission steps:
1. Create a private github repository **sutd_5055mlop** under your github user.
2. Add your instructors as collaborator: ddahlmeier and lucainiaoge
3. Save your submission as assignment_04_GROUP_NAME.ipynb where GROUP_NAME is the name of the group you have registered. 
4. Push the submission files to your repo 
5. Submit the link to the repo via eDimensions



**Assignment due 21 April 2025 11:59pm**