<a href="https://colab.research.google.com/github/urness/CS167Fall2025/blob/main/Day27_Intro_to_Transformers_Part3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CS167: Day27
## Intro to Transformers part 3

#### CS167: Machine Learning, Fall 2025



## __Put the Model on Training Device (GPU or CPU)__


It's not necessary to have GPU for this notebook. However, it won't hurt.
We want to accelerate the training process using graphical processing unit (GPU). Fortunately, in Colab we can access for GPU. You need to enable it from _Runtime (or click on the down arrow near RAM & DISK in upper right)-->Change runtime type-->GPU or TPU_

Professor Urness tested this code with the regular CPU option.

## Setup

We start with setting up the lab by installing the required libraries (`transformers`, `datasets`, and `accelerate`) and ignoring the warnings.

# Part 1: Fine-Tuning

Some necessary import statements:

In [None]:
!pip install -q transformers datasets accelerate

from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    Trainer,
    TrainingArguments,
    DataCollatorForLanguageModeling,
)
from datasets import Dataset
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using {device} device")
EOS = "<|endofanswer|>"

Establish a small JSON dataset (a knowledge base) with course facts.


In [None]:
training_data = [
    {
        "question": "How many quizzes are in CS 167?",
        "answer": "There are 3 quizzes in CS 167."
    },
    {
        "question": "What is CS 167 about?",
        "answer": "CS 167 is an introduction to machine learning and data science."
    },
    {
        "question": "How should I prepare for CS 167 quizzes?",
        "answer": "Review lecture notes, complete the practice problems, and understand the concepts, not just the code."
    }
]


Convert each JSON object into a simple prompt → completion format.

This code uses the `Dataset` package, which is an efficient, standardized container for training data used in machine learning pipelines.

*Confession: This also makes our results a little cleaner for this small demonstration*

In [None]:
texts = [
    f"Question: {item['question']}\nAnswer: {item['answer']} {EOS}"
    for item in training_data
]

dataset = Dataset.from_dict({"text": texts})
# dataset will now contain the formatted versions of the "facts" provided earlier

Now, we can load the (rather small) model and tokenizer

In [None]:

# credit: part of this code generated with the help of ChatGPT
model_name = "gpt2"   # rather small transformer
tokenizer = AutoTokenizer.from_pretrained(model_name)

# pad token fix for GPT-2
tokenizer.pad_token = tokenizer.eos_token
tokenizer.add_special_tokens({"additional_special_tokens": [EOS]})

model = AutoModelForCausalLM.from_pretrained(model_name)
model.resize_token_embeddings(len(tokenizer))

Transformers cannot read or understand text directly. They do not operate on characters, words, or sentences, instead they work on vector embeddings. So, we need to tokenize the dataset we are going to incorporate into the model.


In [None]:
def tokenize(batch):
    return tokenizer(
        batch["text"],
        truncation=True,
        padding="max_length",
        max_length=150,
    )

tokenized = dataset.map(tokenize, batched=True)
tokenized.set_format(type="torch", columns=["input_ids", "attention_mask"])

# For GPT-style fine-tuning
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False
)

Set up training .. this may take a little time
In my experience (with just the CPU, it took 60 seconds)

In [None]:
# Training settings for fine-tuning GPT-2
training_args = TrainingArguments(
    output_dir="./cs167-gpt2",     # Where to save the fine-tuned model and checkpoints
    num_train_epochs=20,           # Number of full passes through the training data
    per_device_train_batch_size=1, # How many samples per training step (small dataset → batch size 1 is fine)
    learning_rate=5e-5,            # How quickly the model updates its weights during training
    logging_steps=5,               # Print training metrics (loss, etc.) every 5 steps
    save_steps=500,                # Save a checkpoint every 500 steps (rare, since dataset is tiny)
    save_total_limit=1,            # Keep only the most recent checkpoint to avoid clutter
    report_to="none",              # Turn off logging to external tools like WandB or TensorBoard
)

# Trainer object that handles the training loop, optimization, and batching
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized,
    data_collator=data_collator,
)

# Let's train it!
trainer.train()


Ask the new model a question!

In [None]:
import torch

# --------------------------------------------------------
# Takes a text prompt and asks the model to generate a continuation.
# --------------------------------------------------------
def answer(question):
    # Build the prompt in the same style the model was fine-tuned on
    prompt = f"Question: {question}\nAnswer:"

    # Tokenize the prompt so the model can process it
    inputs = tokenizer(prompt, return_tensors="pt").to(device)

    model.to(device).eval()  # move model and set eval mode
    # Generate the model's answer
    outputs = model.generate(
        **inputs,
        max_new_tokens=25,                  # Limit the length of the answer so it doesn't ramble
        pad_token_id=tokenizer.eos_token_id, # Required because GPT-2 has no pad token
        eos_token_id=tokenizer.eos_token_id, # Stop generation when EOS is reached
        do_sample=False,                    # Use greedy decoding → deterministic, no randomness
    )

    # Convert token IDs back into text
    text = tokenizer.decode(outputs[0], skip_special_tokens=True)

    # Extract only the answer portion and remove the custom EOS tag if present
    return text.split("Answer:")[-1].split(EOS)[0].strip()



# --------------------------------------------------------
# Example usage
# --------------------------------------------------------

question = "How many quizzes are in CS 167?"
answer_text = answer(question)
print(f"Q: {question}\nA: {answer_text}\n")

# question = "What is CS 167 about?"
# answer_text = answer(question)
# print(f"Q: {question}\nA: {answer_text}\n")

# question = "How should I prepare for CS 167 quizzes?"
# answer_text = answer(question)
# print(f"Q: {question}\nA: {answer_text}\n")

# Fine-Tuning Exercise
Change the prompt.
1. What kinds of questions can the model get correct?
2. What kinds of questions will the model get incorrect?
3. Why, do you think, this is happening?

# Part 2: Retrieval-Augmented Generation (RAG)

Some necessary import statements

In [None]:
!pip install -q transformers torch scikit-learn

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity


Establish a small JSON dataset (a knowledge base) with course facts.

In [None]:
training_data = [
    {
        "question": "How many quizzes are in CS 167?",
        "answer": "There are 3 quizzes in CS 167."
    },
    {
        "question": "What is CS 167 about?",
        "answer": "CS 167 is an introduction to machine learning and data science."
    },
    {
        "question": "How should I prepare for CS 167 quizzes?",
        "answer": "Review lecture notes, complete the practice problems, and make sure you understand the concepts, not just the code."
    }
]



TF-IDF = Term Frequency × Inverse Document Frequency

The following code:
- Splits text into words/features.
- Creates a dictionary of all unique words in your documents.
- Createas a TF-IDF matrix, which will help emphasize words that carry meaning


In [None]:
questions = [item["question"] for item in training_data]
vectorizer = TfidfVectorizer()
doc_vectors = vectorizer.fit_transform(questions)

# ----------------------------------------------------------
# Function: retrieve_context
# Given a user question, return the top-k most similar Q&A entries.
# ----------------------------------------------------------
def retrieve_context(user_question, k=2):
    q_vec = vectorizer.transform([user_question])
    sims = cosine_similarity(q_vec, doc_vectors)[0]
    top_indices = sims.argsort()[::-1][:k]
    return [training_data[i] for i in top_indices]

Now, we can load the (very small) model and tokenizer

In [None]:
# Load a model. Note that the google/flan-t5-small is good at question-answer
model_name = "google/flan-t5-small"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device);

Now, build a prompt, and put the most appropriate elements from the knowledge base into the context of the prompt.


In [None]:
# Build a prompt by inserting the retrieved Q&A context and the user's question in a structured format.
def build_prompt(user_question, retrieved_items):
    context_lines = []
    for item in retrieved_items:
        context_lines.append(f"Q: {item['question']}\nA: {item['answer']}")
    context_text = "\n\n".join(context_lines)

    prompt = (
        "You are a helpful teaching assistant for CS 167 at Drake University.\n"
        "Use ONLY the context to answer the student's question.\n"
        "If the answer is in the context, copy it or paraphrase it.\n"
        "If you don't know, say you don't know.\n\n"
        f"Context:\n{context_text}\n\n"
        f"Student question: {user_question}\n"
        "Answer in one short sentence."
    )
    return prompt

In [None]:
# Run the full RAG pipeline: retrieve context, build the prompt, generate an answer.
def rag_answer(user_question, max_new_tokens=32):
    retrieved = retrieve_context(user_question)
    prompt = build_prompt(user_question, retrieved)

    inputs = tokenizer(prompt, return_tensors="pt").to(device)
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=False,
        )
    answer = tokenizer.decode(outputs[0], skip_special_tokens=True).strip()
    return answer, retrieved, prompt

# Try it out!!

In [None]:
question = "How many quizzes are in CS 167?"
answer_text, retrieved, prompt = rag_answer(question)
print(f"Q: {question}\nA: {answer_text}\n")

# question = "What is CS 167 about?"
# answer_text, retrieved, prompt = rag_answer(question)
# print(f"Q: {question}\nA: {answer_text}\n")

# question = "How should I prepare for CS 167 quizzes?"
# answer_text, retrieved, prompt = rag_answer(question)
# print(f"Q: {question}\nA: {answer_text}\n")

# RAG Exercise: Change the prompt.
1. What kinds of questions can the RAG model get correct?
2. What kinds of questions will the RAG model get incorrect?
3. Compare RAG with Fine-Tuning

In [None]:
# The following prints out the context that was used to construct the answer:
print(prompt)

# Part 3: Just a regular model..

The following code will **not** utilize the knowledge base before answering -- it just uses the model. Does it still get the answer correct?

In [None]:
def answer_without_kb(user_question, max_new_tokens=32):
    """
    Ask model directly, no retrieved context.
    """
    prompt = (
        "You are a helpful teaching assistant for CS 167 at Drake University.\n"
        "Answer the following question based on your general knowledge.\n\n"
        f"Question: {user_question}\n"
        "Answer in one short sentence."
    )
    inputs = tokenizer(prompt, return_tensors="pt").to(device)
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=False,
        )
    return tokenizer.decode(outputs[0], skip_special_tokens=True).strip()

question = "How many quizzes are in CS 167?"
answer_text = answer_without_kb(question)
print(f"Q: {question}\nA: {answer_text}\n")

# question = "What is CS 167 about?"
# answer_text = answer_without_kb(question)
# print(f"Q: {question}\nA: {answer_text}\n")
