# Lab - Building Chatbot Functionality using GPT

## Lab Summary:
In this lab we explore the use of GPT2 to explore Chatbot functionality.

## Lab Goal:
Upon completion of this lab, the student should be able to:
<ul>
    <li> Apply Python to implement a chatbot within Jupyter Lab</li>
    <li> Apply Python to implement train a GPT2 model</li>
</ul>


## Packages and Classes
In this lab we will be using the following libraries:
<ol>
    <li> transformers </li>
    <li> numpy </li>
    <li> torch </li>
    <li> tqdm </li>
    <li> sklearn </li>
</ol>

#### <b>Only run if using Colab:</b>
##### Optional steps to mount Google Drive and navigate to folder with test data:


In [56]:
! pip install transformers numpy torch tqdm scikit-learn

680367.89s - pydevd: Sending message related to process being replaced timed-out after 5 seconds


Collecting scikit-learn
  Using cached scikit_learn-1.7.2-cp311-cp311-macosx_12_0_arm64.whl.metadata (11 kB)
Collecting scipy>=1.8.0 (from scikit-learn)
  Downloading scipy-1.16.3-cp311-cp311-macosx_14_0_arm64.whl.metadata (62 kB)
Collecting joblib>=1.2.0 (from scikit-learn)
  Using cached joblib-1.5.2-py3-none-any.whl.metadata (5.6 kB)
Collecting threadpoolctl>=3.1.0 (from scikit-learn)
  Using cached threadpoolctl-3.6.0-py3-none-any.whl.metadata (13 kB)
Using cached scikit_learn-1.7.2-cp311-cp311-macosx_12_0_arm64.whl (8.6 MB)
Using cached joblib-1.5.2-py3-none-any.whl (308 kB)
Downloading scipy-1.16.3-cp311-cp311-macosx_14_0_arm64.whl (20.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m20.9/20.9 MB[0m [31m19.0 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hUsing cached threadpoolctl-3.6.0-py3-none-any.whl (18 kB)
Installing collected packages: threadpoolctl, scipy, joblib, scikit-learn
Successfully installed joblib-1.5.2 scikit-learn-1.7.2 scipy-1.16.3 t

In [None]:
# from google.colab import drive
# drive.mount('/content/drive')

In [None]:
## Navigate to my google drive folder
# import os
# my_drive = '/content/drive/My Drive/'  ## Change this to your folder location.
# os.chdir(my_drive)
# print(os.getcwd())

### Step 1: Load and Preview Your Dataset

The first step is loading a dataset to train your GPT model with.

Before running this next code, place the training .txt file into the directory where this jupyter notebook is run. 

Alternatively, you may use the os module to navigate to the proper folder where the training data file is located.

In [57]:
import json

# Read the SQuAD JSON file
with open("input_squad.json", "r", encoding="utf-8") as f:
    squad_data = json.load(f)

# Create formatted text with special tokens
formatted_text = ""

# Iterate through the SQuAD data structure
for article in squad_data["data"]:
    for paragraph in article["paragraphs"]:
        context = paragraph["context"]
        for qa in paragraph["qas"]:
            question = qa["question"]
            for answer in qa["answers"]:
                # Format each QA pair following the semantic format
                qa_pair = f"<|startoftext|>\nQ: {question}\nA: {answer['text']}\n<|endoftext|>\n"
                formatted_text += qa_pair

# Save the formatted text
with open("formatted_squad.txt", "w", encoding="utf-8") as f:
    f.write(formatted_text)

# Update the document variable to use the new formatted file
document = "formatted_squad.txt"

# Preview the first few QA pairs
print(formatted_text[:500])

<|startoftext|>
Q: When did Beyonce start becoming popular?
A: in the late 1990s
<|endoftext|>
<|startoftext|>
Q: What areas did Beyonce compete in when she was growing up?
A: singing and dancing
<|endoftext|>
<|startoftext|>
Q: When did Beyonce leave Destiny's Child and become a solo singer?
A: 2003
<|endoftext|>
<|startoftext|>
Q: In what city and state did Beyonce  grow up? 
A: Houston, Texas
<|endoftext|>
<|startoftext|>
Q: In which decade did Beyonce become famous?
A: late 1990s
<|endoftext


In [58]:
# Read your training data
document="formatted_squad.txt"

with open(document, "r", encoding="utf-8") as f:
    text = f.read()

# Display the first few lines
print(text[:500])

<|startoftext|>
Q: When did Beyonce start becoming popular?
A: in the late 1990s
<|endoftext|>
<|startoftext|>
Q: What areas did Beyonce compete in when she was growing up?
A: singing and dancing
<|endoftext|>
<|startoftext|>
Q: When did Beyonce leave Destiny's Child and become a solo singer?
A: 2003
<|endoftext|>
<|startoftext|>
Q: In what city and state did Beyonce  grow up? 
A: Houston, Texas
<|endoftext|>
<|startoftext|>
Q: In which decade did Beyonce become famous?
A: late 1990s
<|endoftext


### Step 2: Tokenize the Text Using Hugging Face Tokenizer
We use OpenAI’s tokenizer GPT2Tokenizer to convert text to tokens.

In [59]:
from transformers import GPT2Tokenizer
import numpy as np
import os

In [60]:
# Load tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token  # Required for compatibility

# Encode full dataset (supports <|startoftext|> and <|endoftext|>)
tokens = tokenizer.encode(text)

# Save to binary format
os.makedirs("data/nlp_chatbot", exist_ok=True)
np.array(tokens, dtype=np.uint16).tofile("data/nlp_chatbot/train.bin")

print(f"Number of tokens: {len(tokens)}")
print("First few tokens:", tokens[:20])  # show first 20 token IDs

# Note: the token length warning may be ignored because our model will break our sequence into chunks smaller than 1024 during training.

Token indices sequence length is longer than the specified maximum sequence length for this model (2865834 > 1024). Running this sequence through the model will result in indexing errors


Number of tokens: 2865834
First few tokens: [27, 91, 9688, 1659, 5239, 91, 29, 198, 48, 25, 1649, 750, 37361, 344, 923, 5033, 2968, 30, 198, 32]


### Step 3: Save Tokens for Training

We will store the tokenized data in a local folder to retrieve during training.

In [61]:
import numpy as np
import os

In [62]:
# Create a local directory to save tokenized data.
directory = "data/nlp_chatbot"
os.makedirs(directory, exist_ok=True)

bin_filename = "train_squad.bin"
file = directory + "/" + bin_filename
# Save the tokens as a binary file, which is easier for a computer to process.
np.array(tokens, dtype=np.uint16).tofile(file)
print("binary file saved at:", file)

binary file saved at: data/nlp_chatbot/train_squad.bin


### Step 4: Define Training Dataset Loader

In [63]:
import torch

In [65]:
class CharDataset(torch.utils.data.Dataset):
    def __init__(self, data_path, block_size):
        self.data = np.fromfile(data_path, dtype=np.uint16)
        self.block_size = block_size

    def __len__(self):
        return len(self.data) - self.block_size

    def __getitem__(self, idx):
        chunk = self.data[idx : idx + self.block_size + 1]
        x = torch.tensor(chunk[:-1], dtype=torch.long)
        y = torch.tensor(chunk[1:], dtype=torch.long)
        return x, y

# Load it
block_size = 640
dataset = CharDataset(file, block_size)
data_loader = torch.utils.data.DataLoader(dataset, batch_size=4, shuffle=True)


### Step 5: Define a Tiny GPT-2 Model
To keep things minimal and educational, we use a super small GPT-like model.

In [66]:
# We'll use these libraries to build the GPT model layers and apply transformations.
import torch.nn as nn
import torch.nn.functional as F

In [69]:
# We define a Python class that creates a GPT model called "TinyGPT".
class TinyGPT(nn.Module):
    # The __init__ function hardcodes embedding size = 256, number of layers = 6, and block size = 64.
    def __init__(self, vocab_size, n_embd=256, n_head=4, n_layer=6, block_size=64):
        # Note: The params in this model can be modified to improve the response.
        # By changing these values, the time it takes to train will change.
        super().__init__()
        self.token_embedding = nn.Embedding(vocab_size, n_embd)
        self.position_embedding = nn.Embedding(block_size, n_embd)
        self.blocks = nn.ModuleList([
            nn.TransformerEncoderLayer(
                d_model=n_embd, nhead=n_head, dim_feedforward=4*n_embd, dropout=0.1, activation='gelu'
            ) for _ in range(n_layer)
        ])
        self.ln_f = nn.LayerNorm(n_embd)
        self.head = nn.Linear(n_embd, vocab_size)
        self.block_size = block_size

    def forward(self, idx, targets=None):
        B, T = idx.size()
        pos = torch.arange(T, device=idx.device).unsqueeze(0)
        x = self.token_embedding(idx) + self.position_embedding(pos)
        for block in self.blocks:
            x = block(x)
        x = self.ln_f(x)
        logits = self.head(x)

        loss = None
        if targets is not None:
            loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1))
        return logits, loss

### Step 6: Train the Model

In [None]:
# Now we'll train our model.
# Import some libraries to track the time it takes to run this.
from tqdm import tqdm
import time

# If you are using a computer with a modern GPU, you may take advantage of its cuda cores for speed.
# If you are using Colab, a virtual machine, a Mac, or a less powerful device, use the CPU. This will take longer.
device = 'mps' if torch.backends.mps.is_available() else 'cpu'

# We'll print which version we are using, in case you aren't sure.
print("device selected:", device)

# Recalling our vocab size, this is the number of unique tokens found in the corpus.
vocab_size = tokenizer.vocab_size

# Week 10: Increased block size to 640 for better context handling
model = TinyGPT(vocab_size, block_size=640).to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)

# Training loop

# Max iterations can be modified. Changing this will change performance and speed of running the model.
# If you have a powerful GPU, feel free to increase this substantially. 5000 is a good place to start.
# If you do not have a GPU, set this number relatively low (1000 - 2000)
# It will take some time to run (tested 13 minutes on Colab for 1000 iterations)

# Week 10 - Changed max_iters to 5000 for better training
max_iters = 5000

# Wrap the DataLoader in tqdm with max_iters
# progress_bar = tqdm(enumerate(data_loader), total=max_iters)
start_time = time.time()

for step, (x, y) in enumerate(data_loader):
    if step >= max_iters:
        break
    x, y = x.to(device), y.to(device)

    logits, loss = model(x, y)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    # Print the training loss at each n iterations.
    # Change this to see the status as your model trains. 
    # Larger Step Size = Fewer status updates.
    if step % 100 == 0:
        print(f"Step {step}: loss {loss.item():.4f}")

# Print the total time it took to execute.
end_time = time.time()
print(f"\nTraining completed in {end_time - start_time:.2f} seconds ({(end_time - start_time)/60:.2f} minutes)")

device selected: mps
Step 0: loss 10.9438
Step 100: loss 3.9917
Step 200: loss 3.8378
Step 300: loss 3.8291
Step 400: loss 3.9353
Step 500: loss 3.9937
Step 600: loss 3.9304
Step 700: loss 3.8189
Step 800: loss 3.5790
Step 900: loss 4.0048
Step 1000: loss 3.7531
Step 1100: loss 3.9578
Step 1200: loss 3.5190
Step 1300: loss 4.0271
Step 1400: loss 3.7375
Step 1500: loss 3.8185
Step 1600: loss 3.3993
Step 1700: loss 3.5294
Step 1800: loss 3.5661
Step 1900: loss 3.7365
Step 2000: loss 3.6595
Step 2100: loss 3.7975
Step 2200: loss 2.7651
Step 2300: loss 3.4224
Step 2400: loss 3.4511
Step 2500: loss 4.0015
Step 2600: loss 2.6763
Step 2700: loss 3.4103
Step 2800: loss 3.3179
Step 2900: loss 3.4624
Step 3000: loss 3.4222
Step 3100: loss 3.5236
Step 3200: loss 3.3356
Step 3300: loss 3.7960
Step 3400: loss 3.3644
Step 3500: loss 3.5388
Step 3600: loss 3.5517
Step 3700: loss 3.5464
Step 3800: loss 3.3694
Step 3900: loss 3.4799
Step 4000: loss 3.4452
Step 4100: loss 3.3629
Step 4200: loss 3.3403
S

### Step 7: Text Generation

We have trained a GPT model.

To use it, we'll create a function called "generate" that retreives tokens from the model up to a maximum.

In [73]:
def generate(model, idx, max_tokens=100):
    # model.eval() eliminates dropouts to stabilize responses.
    model.eval()
    for _ in range(max_tokens):  # from 0 to 100 (max_tokens), do these steps:
        idx_cond = idx[:, -block_size:]
        logits, _ = model(idx_cond)  # return the logits from the model.
        probs = torch.softmax(logits[:, -1, :], dim=-1)  # apply softmax output activation function to the logits.
        next_token = torch.multinomial(probs, num_samples=1)  # select the next token (word) randomly.
        idx = torch.cat([idx, next_token], dim=1)  # combine the tokens into one and save them into idx.
    return idx

In [76]:
# Example prompt
prompt = "Q: Can a whale swim?\nA:"

In [77]:
# Feed your prompt into your GPT model and print the result.
encoded = torch.tensor([tokenizer.encode(prompt)], dtype=torch.long).to(device)
output = generate(model, encoded, max_tokens=50)
decoded = tokenizer.decode(output[0].tolist())
cleaned = decoded.replace("<|startoftext|>", "").replace("<|endoftext|>", "").replace("<|>", "").strip()
print("GPT Response:\n")
print(cleaned)


GPT Response:

Q: Can a whale swim?
A:?
A: God host it hid?

Q: According to stay by his relocated to a Muslims from?







Q: Which medium?
<|


### Step 8: Build a text interface

In [78]:
# Make a friendly greeting:
GREETING_INPUTS = ("hello", "hi", "greetings", "sup", "what's up","hey",)
GREETING_RESPONSES = ["hi", "hey", "*nods*", "hi there", "hello", "I am glad! You are talking to me"]
def greeting(sentence):
    for word in sentence.split():
        if word.lower() in GREETING_INPUTS:
            return random.choice(GREETING_RESPONSES)

In [79]:
# Send an input (prompt) to the GPT model and get a response.
def response(prompt):
    robo_response="\n"
    encoded = torch.tensor([tokenizer.encode(prompt)], dtype=torch.long).to(device)
    output = generate(model, encoded, max_tokens=500)
    decoded = tokenizer.decode(output[0].tolist())
    cleaned = decoded.replace("<|startoftext|>", "").replace("<|endoftext|>", "").replace("<|>", "").strip()
    gpt_response = robo_response + cleaned
    return gpt_response
    

In [None]:
import re

#  Increasing both the training iterations and block size improved the answers slightly but did not make the model much more useful. 
#  I belive using a larger model or a more targeted training set for a specific use case would make this chatbot functional.

flag=True
print("Hi. I will answer your queries. You may not like what I have to say. To exit, type 'Bye'")
while(flag==True):
    user_response = input()
    user_response=user_response.lower()
    numeric_pattern = r'\d'
    if(user_response!='bye'):
        if(user_response=='thanks' or user_response=='thank you' ):
            flag=False
            print("GPT Response: You are welcome..")
        elif(user_response=='cool' or user_response=='got it'):
            flag = False
            print("GPT Respone: The Dude abides.")
        elif(bool(re.search(numeric_pattern, user_response))):
            flag = False
            print("I'm not good at math; use a calculator or count on your fingers.")
        else:
            if(greeting(user_response)!=None):
                print("GPT Response: "+greeting(user_response))
            else:
                print("GPT Response: ",end="")
                print(response(user_response))
    else:
        flag=False
        print("ROBO: Bye! ")

Hi. I will answer your queries. You may not like what I have to say. To exit, type 'Bye'
GPT Response: 
q: can fish swim?
A: 1565%


A: What type of brita become the first plan?

A: due up what did Jason Underophits
A: What era?
Q: U.

Q: economy's second trade and the United States
A: Eloid as some popular-French?
<|startoftext|startoftext|startoftext|startoftext|>


A: about $55 by a lockformat respond to military vault
A: Active transferring Endat?
<|startoftext|startoftext|startoftext|>
A: In what year did GE" International World Cup provides in the first person's neighbors
<|startoftext|startoftext|startoftext|>
A: sovereignty of General College of the intellect of the Jewish Prize

A: What was he without the first week?
Q: What chain?
A: 2, native school produces Spectre?
A: upper rate that is rocky nucle-two?
A: the US



Q: Do athletic community was finallying

A: When did the name of scientific category did Spielberg regain "gayers-D?
Q: Where did Japanese students dissolved i

# This week's Lab

We have successfully built our first chatbot. Your challenge is to now change this chatbot. 

The lab instructions can be found in a separate Word document in Blackboard.