# Final Team Project: Advanced Generative Chatbot Design


### Install and Import Libraries

In [1]:
# Install additional libraries
!pip install transformers torch rouge



In [2]:
# Load libraries
import pandas as pd
import torch
from nltk.translate.bleu_score import sentence_bleu
from rouge import Rouge
from torch.utils.data import DataLoader, Dataset, TensorDataset
from tqdm import tqdm
from transformers import AdamW, AutoTokenizer, get_linear_schedule_with_warmup, GPT2Config, GPT2LMHeadModel, GPT2Tokenizer, GPT2Config

# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Load and Preprocess the Dataset

In [3]:
# Load the UDC dataset
# Took the first 1MIL rows from the csv
df = pd.read_csv('/content/drive/MyDrive/dialogueText_196_sample.csv', engine='python')

# Print a sample of the dataset
df.head()

Unnamed: 0,folder,dialogueID,date,from,to,text
0,301,1.tsv,2004-11-23T11:49:00.000Z,stuNNed,,any ideas why java plugin takes so long to load?
1,301,1.tsv,2004-11-23T11:49:00.000Z,crimsun,stuNNed,java 1.4?
2,301,1.tsv,2004-11-23T11:49:00.000Z,stuNNed,crimsun,yes
3,301,1.tsv,2004-11-23T11:49:00.000Z,crimsun,stuNNed,java 1.5 loads _much_ faster
4,301,1.tsv,2004-11-23T11:50:00.000Z,stuNNed,crimsun,noneus: how can i get 1.5 is there a .deb some...


In [4]:
# Check the column types
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 999999 entries, 0 to 999998
Data columns (total 6 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   folder      999999 non-null  int64 
 1   dialogueID  999999 non-null  object
 2   date        999999 non-null  object
 3   from        999986 non-null  object
 4   to          797744 non-null  object
 5   text        999948 non-null  object
dtypes: int64(1), object(5)
memory usage: 45.8+ MB


In [5]:
# Convert the 'date' column to datetime format
df['date'] = pd.to_datetime(df['date'])

# Convert the 'text' column to text format
df['text'] = df['text'].astype(str)

# Extract year, month, day, and hour from 'date' column and create a new column 'hourly'
df['hourly'] = df['date'].dt.strftime('%Y-%m-%d %H')

# Filter groups with at least 3 rows based on the 'hourly' column
df_filtered = df.groupby('hourly').filter(lambda x: len(x) >= 3)

# Sort the data by 'hourly' and 'date'
df_filtered = df_filtered.sort_values(by=['hourly', 'date'])

# Group by 'hourly' and concatenate the text
df_combined = df_filtered.groupby('hourly').agg({
    'date': 'first',
    'text': ' '.join
}).reset_index()

# Adjust display options so full text displays
pd.set_option('display.max_colwidth', None)
pd.set_option('display.width', None)

# Print out sample rows
df_combined.head(10)


Unnamed: 0,hourly,date,text
0,2004-08-18 16,2004-08-18 16:25:00+00:00,"hum, just mail me if you want I'll follow the mail on the internal list yes"
1,2004-08-31 15,2004-08-31 15:39:00+00:00,in fact jdub said him that today was a good day to come finally looked at the ITX box... had the driver 'apm' scribbled in the X config... changing that to 'via' and kicking gdm helped a lot
2,2004-09-04 08,2004-09-04 08:10:00+00:00,"right click on the desktop is not working for me nautilus manages the desktop ? the icons are displayed ? opening a dir on the desktop works ? nothing setuid /bin/mount is going to be disabled in favour of pmount, which doesn't require entries in fstab done"
3,2004-09-04 14,2004-09-04 14:21:00+00:00,"speaking of hate mail, we need to re-enable the hostname question in d-i under all circumstances, I take it? I would say so done I will accept responsibility for the additional question :-)"
4,2004-09-05 05,2004-09-05 05:01:00+00:00,"if you need to start now, then do it. it's more important to get sounder 7 out than to have usplash that's a known bug in thom's acpid upload please do file a bug to remind him if you have time to do it, reverting ubuntu4 is fine with me"
5,2004-09-05 13,2004-09-05 13:07:00+00:00,"actually, if we're going the archive-copier route we absolutely want this approach archive-copier would handle base as well? does this include all of the shipseed additions we made in Oxford? that's all entirely automatic now, so yes barking?"
6,2004-09-05 14,2004-09-05 14:15:00+00:00,"archive-copier seems to do the right thing for me, as far as copying apt-cdrom add, too by the time I logged in on vt2 to look, /var/cache/apt/archives was empty oh, damn, I forgot to mention that; you also need to boot with KEEP_DEBS=yes apparently this happens when xresprobe can't yeah, I'm prodding it now. but it was detecting my LCD OK in Oxford this is the d-i-with-no-network case"
7,2004-09-06 05,2004-09-06 05:16:00+00:00,"Applications->System Tools->Terminal? you around? the entire menu? nice :D that's a sub-menu of Applications dude :) heh, if I didn't have enough with gtk and nautilus for 'big uploads I need to do', I just did abiword to complete the set :) yeah, still :) yes, I've seen that but I've not seen the nautilus one do you need some help for the uploads ? k I guess I can manage, but if I haven't done them by tomorrow at 17:00 I guess you could do one of gtk or nautilus ok, just let me know please test this for me: well specify a driver please which version? is more awake than I am"
8,2004-09-06 08,2004-09-06 08:16:00+00:00,"oh, forgot to mention...I did another round of sounder 7 testing using archive-copier and KEEP_DEBS=yes; seemed to work perfectly kewl that iftab thing in netcfg could go upstream, couldn't it? I assume so"
9,2004-09-06 11,2004-09-06 11:54:00+00:00,"yes why is it named 'fortune-mod' anyway? I've often wondered that come to look, fortunes-min is actually pretty reasonable"


In [6]:
# Check the number of entries
df_combined.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37057 entries, 0 to 37056
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype              
---  ------  --------------  -----              
 0   hourly  37057 non-null  object             
 1   date    37057 non-null  datetime64[ns, UTC]
 2   text    37057 non-null  object             
dtypes: datetime64[ns, UTC](1), object(2)
memory usage: 868.6+ KB


In [7]:
# Subset the data for analysis
df_subset = df_combined.iloc[:1000]
df_subset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype              
---  ------  --------------  -----              
 0   hourly  1000 non-null   object             
 1   date    1000 non-null   datetime64[ns, UTC]
 2   text    1000 non-null   object             
dtypes: datetime64[ns, UTC](1), object(2)
memory usage: 23.6+ KB


### Build the Model

In [8]:
# Load the tokenizer and pre-trained model
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")
model.to('cuda')

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)

In [9]:
# Set the padding token
tokenizer.pad_token = tokenizer.eos_token

# Tokenize the text and convert to input features
input_features = tokenizer(df_subset['text'].tolist(), padding='max_length', truncation=True, max_length=512, return_tensors='pt')
input_ids = input_features["input_ids"]
attention_mask = input_features["attention_mask"]

# Create a tensor dataset and DataLoader
dataset = TensorDataset(input_features['input_ids'], input_features['attention_mask'])
dataloader = DataLoader(dataset, batch_size=8, shuffle=True)

In [10]:
# Define the optimizer and the learning rate scheduler
optimizer = AdamW(model.parameters(), lr=5e-5)
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=0, num_training_steps=len(dataloader) * 3)

# Create a training loop and loss calculation
num_epochs = 3
for epoch in range(num_epochs):
    model.train()
    total_loss = 0
    for batch in dataloader:
        inputs, masks = batch
        inputs = inputs.to('cuda')
        masks = masks.to('cuda')

        # Forward pass
        outputs = model(inputs, attention_mask=masks, labels=inputs)
        loss = outputs.loss

        # Backward pass and optimization
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        scheduler.step()

    avg_train_loss = total_loss / len(dataloader)
    print(f"Epoch: {epoch+1}, Train Loss: {avg_train_loss}")



Epoch: 1, Train Loss: 0.0
Epoch: 2, Train Loss: 0.0
Epoch: 3, Train Loss: 0.0


### Model Evaluation

Define Evaluation Metrics

Perplexity: Measures how well the probability distribution predicted by the model aligns with the actual distribution of the words in the text.

BLEU: Compares n-grams of the model's outputs with the reference outputs and calculates precision.

ROUGE: Used to calculate the overlap of n-grams between the produced text and a reference text.

Generate Responses

Use the model to generate responses for the evaluation set and compare them to the actual responses.

Testing Response generation


In [11]:
def generate_response(user_input, max_length=50):
    # Ensure the model is in evaluation mode and on the desired device
    model.eval()
    model.to('cuda')

    # Encode the user input and send it to the same device as the model
    input_ids = tokenizer.encode(user_input, return_tensors='pt').to('cuda')

    # Generate attention mask
    attention_mask = torch.ones(input_ids.shape).to('cuda')

    # Generate a response
    with torch.no_grad():
        output = model.generate(
        input_ids,
        attention_mask=attention_mask,
        max_length=max_length,
        pad_token_id=tokenizer.eos_token_id
    )

    # Decode and return the response
    response = tokenizer.decode(output[:, input_ids.shape[-1]:][0], skip_special_tokens=True)
    return response

In [12]:
while True:
    # Get user input
    user_input = input("You: ")

    # Check if the user wants to exit
    if user_input.lower() == 'exit':
        print("Chatbot: Goodbye!")
        break

    # Generate a response
    response = generate_response(user_input)

    # Display the model's response
    print("Chatbot:", response)


You: How do i reset my password?
Chatbot:                                            
You: hello
Chatbot: , I'm not sure if you can use the same command for all the other packages, but I'm sure you can use the same command for all the other packages. I'm not sure if you can use the same command for all the other
You: reset password
Chatbot:  for the user.
You: help with ubuntu
Chatbot: ? I'm not sure if it's a bug or not, but I'm not sure if it's a bug or not, but I'm not sure if it's a bug or not, but I'm not sure if it
You: exit
Chatbot: Goodbye!


In [13]:
def calculate_bleu(reference, candidate):
    return sentence_bleu([reference.split()], candidate.split(), weights=(0.25, 0.25, 0.25, 0.25))

def calculate_rouge(reference, candidate):
    rouge = Rouge()
    scores = rouge.get_scores(candidate, reference)
    return scores[0]  # returns multiple scores: ['rouge-1', 'rouge-2', 'rouge-l']

In [16]:
# Example evaluation with a dummy prompt
prompt = "How do I reset my password?"
model_response = generate_response(prompt)
print(model_response)
reference_response = "To reset your password, click on the 'Forgot Password' link..."

bleu_score = calculate_bleu(reference_response, model_response)
rouge_score = calculate_rouge(reference_response, model_response)

print(f"BLEU: {bleu_score}\nROUGE: {rouge_score}")




sudo apt-get update sudo apt-get install reset sudo apt-get install reset sudo apt-get install reset sudo apt-get install reset sudo apt-get install reset sudo apt-get install reset
BLEU: 7.992219124248642e-232
ROUGE: {'rouge-1': {'r': 0.1, 'p': 0.2, 'f': 0.13333332888888905}, 'rouge-2': {'r': 0.0, 'p': 0.0, 'f': 0.0}, 'rouge-l': {'r': 0.1, 'p': 0.2, 'f': 0.13333332888888905}}


The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
