520 Group Project

**Building the Model**

Preprocessing the UDC Dataset

Tokenize the data.

Format so that each dialogue instance (including multiple turns) is a single training example.

In [1]:
import pandas as pd

# Load the UDC dataset
# Assume data columns: 'context', 'response'
data = pd.read_csv('dialogueText_196_TEST_UTF-8_Short.csv')

# Sample Preprocessing (detailed preprocessing would be more involved)
def preprocess(context, response):
    return f"{context} [SEP] {response}"

data['processed_text'] = data.apply(lambda row: preprocess(row['context'], row['response']), axis=1)

print(data.head())

   folder dialogueID                      date  context response  \
0     301      1.tsv  2004-11-23T11:49:00.000Z  stuNNed      NaN   
1     301      1.tsv  2004-11-23T11:49:00.000Z  crimsun  stuNNed   
2     301      1.tsv  2004-11-23T11:49:00.000Z  stuNNed  crimsun   
3     301      1.tsv  2004-11-23T11:49:00.000Z  crimsun  stuNNed   
4     301      1.tsv  2004-11-23T11:50:00.000Z  stuNNed  crimsun   

                                                text         processed_text  
0   any ideas why java plugin takes so long to load?      stuNNed [SEP] nan  
1                                          java 1.4?  crimsun [SEP] stuNNed  
2                                                yes  stuNNed [SEP] crimsun  
3                       java 1.5 loads _much_ faster  crimsun [SEP] stuNNed  
4  noneus: how can i get 1.5 is there a .deb some...  stuNNed [SEP] crimsun  


Implement and Fine-tune GPT-2 Model

 Install and Import Necessary Libraries

In [2]:
!pip install transformers torch

from transformers import GPT2LMHeadModel, GPT2Tokenizer, GPT2Config, AdamW
import torch


Collecting transformers
  Downloading transformers-4.34.0-py3-none-any.whl (7.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.7/7.7 MB[0m [31m26.2 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.16.4 (from transformers)
  Downloading huggingface_hub-0.18.0-py3-none-any.whl (301 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.0/302.0 kB[0m [31m34.2 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers<0.15,>=0.14 (from transformers)
  Downloading tokenizers-0.14.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m75.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m74.3 MB/s[0m eta [36m0:00:00[0m
Colle

Initialize GPT-2 and Tokenizer

In [3]:
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Tokenize Data and Prepare DataLoader

In [4]:
import torch
from torch.utils.data import DataLoader, Dataset
from transformers import AutoTokenizer  # Make sure to install the transformers library

class UDCDataset(Dataset):
    def __init__(self, texts, tokenizer, max_length):
        self.tokenizer = tokenizer
        self.input_ids = []
        self.attn_masks = []
        for text in texts:
            encodings = tokenizer(text, truncation=True, max_length=max_length, padding='max_length', return_tensors='pt')
            self.input_ids.append(encodings['input_ids'])
            self.attn_masks.append(encodings['attention_mask'])

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx].squeeze(), self.attn_masks[idx].squeeze()

# Example usage:

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")  # for example, use BERT

# Let's assume data['processed_text'] exists
# You might want to load your dataset and preprocess text data as per your requirement before this step.

# Create Dataset and DataLoader
dataset = UDCDataset(data['processed_text'].tolist(), tokenizer, max_length=128)
dataloader = DataLoader(dataset, batch_size=8, shuffle=True)

print(data['processed_text'].head)


Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

<bound method NDFrame.head of 0                   stuNNed [SEP] nan
1               crimsun [SEP] stuNNed
2               stuNNed [SEP] crimsun
3               crimsun [SEP] stuNNed
4               stuNNed [SEP] crimsun
                    ...              
5195    ArthurArchnix [SEP] wolfprint
5196              wolfprint [SEP] nan
5197    wolfprint [SEP] ArthurArchnix
5198    wolfprint [SEP] ArthurArchnix
5199           Goshawk_ [SEP] sergiol
Name: processed_text, Length: 5200, dtype: object>


Fine-tuning Loop

In [5]:
from torch.optim import AdamW
from tqdm import tqdm

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

optimizer = AdamW(model.parameters(), lr=1e-5)

num_epochs = 3
for epoch in range(num_epochs):
    total_loss = 0
    model.train()
    for batch in tqdm(dataloader):
        input_ids, attn_masks = [b.to(device) for b in batch]
        optimizer.zero_grad()
        outputs = model(input_ids, labels=input_ids, attention_mask=attn_masks)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    print(f"Epoch: {epoch}, Loss: {total_loss/len(dataloader)}")

100%|██████████| 650/650 [01:08<00:00,  9.47it/s]


Epoch: 0, Loss: 0.6011475999309467


100%|██████████| 650/650 [01:05<00:00,  9.97it/s]


Epoch: 1, Loss: 0.21227125253814919


100%|██████████| 650/650 [01:05<00:00,  9.93it/s]

Epoch: 2, Loss: 0.09784945132640692





**Saving the Model**

In [9]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Assume `model` is your trained GPT-2 model
model = GPT2LMHeadModel.from_pretrained('gpt2')
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

# Define a path to save your model
save_directory = "/content/UDC_5000_model"

# Save the model and tokenizer
model.save_pretrained(save_directory)
tokenizer.save_pretrained(save_directory)


Using pad_token, but it is not set yet.
Using pad_token, but it is not set yet.


('/content/UDC_5000_model/tokenizer_config.json',
 '/content/UDC_5000_model/special_tokens_map.json',
 '/content/UDC_5000_model/vocab.json',
 '/content/UDC_5000_model/merges.txt',
 '/content/UDC_5000_model/added_tokens.json')

In [13]:
from google.colab import files
import shutil

# Compress the model directory
shutil.make_archive("/content/UDC_5000_model", 'zip', save_directory)

# Download the zipped model to your local machine
files.download("/content/UDC_5000_model.zip")


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [14]:
from google.colab import drive

# Mount your Google Drive
drive.mount('/content/drive')

# Copy the saved model directory to your Google Drive
!cp -r /content/my_model /content/drive/MyDrive/


Mounted at /content/drive
cp: cannot stat '/content/my_model': No such file or directory


**Model Evaluation**

In [18]:
!pip install rouge

Collecting rouge
  Downloading rouge-1.0.1-py3-none-any.whl (13 kB)
Installing collected packages: rouge
Successfully installed rouge-1.0.1


Load the Model and Tokenizer

In [16]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("/content/UDC_5000_model")  # specify path if saved locally

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)


GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)

Define Evaluation Metrics

Perplexity: Measures how well the probability distribution predicted by the model aligns with the actual distribution of the words in the text.

BLEU: Compares n-grams of the model's outputs with the reference outputs and calculates precision.

ROUGE: Used to calculate the overlap of n-grams between the produced text and a reference text.

In [19]:
from nltk.translate.bleu_score import sentence_bleu
from rouge import Rouge

def calculate_bleu(reference, candidate):
    return sentence_bleu([reference.split()], candidate.split(), weights=(0.25, 0.25, 0.25, 0.25))

def calculate_rouge(reference, candidate):
    rouge = Rouge()
    scores = rouge.get_scores(candidate, reference)
    return scores[0]  # returns multiple scores: ['rouge-1', 'rouge-2', 'rouge-l']

Generate Responses

Use the model to generate responses for the evaluation set and compare them to the actual responses.

In [25]:
def generate_response(prompt, max_length=50):
    input_ids = tokenizer.encode(prompt, return_tensors='pt').to(device)

    # Generate attention mask
    attention_mask = torch.ones(input_ids.shape, device=device)

    # Generate responses
    output_ids = model.generate(
        input_ids,
        attention_mask=attention_mask,
        max_length=max_length,
        pad_token_id=tokenizer.eos_token_id,  # set padding token to EOS token
    )

    response = tokenizer.decode(output_ids[0], skip_special_tokens=True)
    return response

# Example evaluation with a dummy prompt
prompt = "How do I reset my password?"
model_response = generate_response(prompt)

In [26]:
reference_response = "To reset your password, click on the 'Forgot Password' link..."

bleu_score = calculate_bleu(reference_response, model_response)
rouge_score = calculate_rouge(reference_response, model_response)

print(f"BLEU: {bleu_score}\nROUGE: {rouge_score}")


BLEU: 7.437597952034396e-232
ROUGE: {'rouge-1': {'r': 0.1, 'p': 0.06666666666666667, 'f': 0.07999999520000028}, 'rouge-2': {'r': 0.0, 'p': 0.0, 'f': 0.0}, 'rouge-l': {'r': 0.1, 'p': 0.06666666666666667, 'f': 0.07999999520000028}}


Testing Response generation


In [28]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Load pre-trained model and tokenizer
model_path = "/content/UDC_5000_model"
tokenizer = GPT2Tokenizer.from_pretrained(model_path)
model = GPT2LMHeadModel.from_pretrained(model_path)


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [29]:
import torch

def generate_response(user_input, max_length=50):
    # Encode the user input
    input_ids = tokenizer.encode(user_input, return_tensors='pt')

    # Generate a response
    with torch.no_grad():
        output = model.generate(input_ids, max_length=max_length, num_beams=5, temperature=1.5)

    # Decode and return the response
    response = tokenizer.decode(output[:, input_ids.shape[-1]:][0], skip_special_tokens=True)
    return response


In [30]:
while True:
    # Get user input
    user_input = input("You: ")

    # Check if the user wants to exit
    if user_input.lower() == 'exit':
        print("Chatbot: Goodbye!")
        break

    # Generate a response
    response = generate_response(user_input)

    # Display the model's response
    print("Chatbot:", response)


You: How do I reset my password?


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Chatbot: 

You can reset your password at any time by going to Settings > Security > Reset Password.

How do I reset my password? You can reset your password at any time by going to Settings > Security
You: Why won't my video card work?


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Chatbot: 

If your video card doesn't work, you'll need to replace it with a new one.

If your video card doesn't work, you'll need to replace it with a new one.
You: What kind would you suggest?


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Chatbot: 

I think it's a good question. I think it's a good question. I think it's a good question. I think it's a good question. I think it's a good question. I think


KeyboardInterrupt: ignored