---

#Final Team Project: Advanced Generative Chatbot Design

##Final Project Team 8 - AAI-520: Natural Language Processing

##2023-10-23

###AAI520_Team8_Chatbot_GPT2.ipynb

###Data Source:

Kaggle - Ubuntu Dialogue Corpus

https://www.kaggle.com/datasets/rtatman/ubuntu-dialogue-corpus/data

https://huggingface.co/datasets/sedthh/ubuntu_dialogue_qa

###References:

https://pytorch.org/tutorials/beginner/basics/data_tutorial.html

https://huggingface.co/docs/transformers/index

https://www.nltk.org/api/nltk.translate.bleu_score.html#module-nltk.translate.bleu_score

https://pypi.org/project/rouge-score/

---

# Install Required Packages

In [1]:
# pretrained transformer models
!pip install transformers

# accelerate package required for using trainer with pytorch
!pip uninstall -y accelerate
!pip install accelerate>=0.20.1

# rouge score metrics
!pip install rouge-score

Collecting transformers
  Downloading transformers-4.34.1-py3-none-any.whl (7.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.7/7.7 MB[0m [31m30.6 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.16.4 (from transformers)
  Downloading huggingface_hub-0.18.0-py3-none-any.whl (301 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.0/302.0 kB[0m [31m39.8 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers<0.15,>=0.14 (from transformers)
  Downloading tokenizers-0.14.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m81.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m78.0 MB/s[0m eta [36m0:00:00[0m
Col

# Load Required Libraries

In [2]:
import nltk # nlp package
import numpy as np # array manipulation
import pandas as pd # data analysis
import random # random number generator
import re # regular expressions
import shutil # file operations
import spacy # nlp package
import string # string operations
import torch # deep learning framework
import zipfile # zip archive extraction
from collections import OrderedDict # ordered preprocessing steps
from google.colab import files # support file saving
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction # bleu score metrics
from pandas import option_context # context manager
from rouge_score import rouge_scorer # rouge score metrics
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score # metrics
from torch.utils.data import Dataset # create Torch datasets
from transformers import GPT2LMHeadModel, GPT2Tokenizer # GPT2 model and tokenizer
from transformers import Trainer, TrainingArguments, EarlyStoppingCallback # training loop
from transformers.trainer_utils import get_last_checkpoint # resume training from checkpoint

# Set Random Seeds

In [3]:
# set global random seeds for reproducibility
seed = 1234
random.seed(seed)
np.random.seed(seed)
generator = torch.manual_seed(seed)

# Data Exploration

## Load and Display Dataset

In [4]:
# upload .csv dataset file to Colab session storage
dataset = files.upload()

# define path dataset csv file, default location
path_dataset = '/content/dialogueQA.csv'

# read csv file into a dataframe
df = pd.read_csv(path_dataset)

# display dataframe
display(df)

Saving dialogueQA.csv to dialogueQA.csv


Unnamed: 0,INSTRUCTION,RESPONSE,SOURCE,METADATA
0,"hi, is there a CLI command to roll back any up...",your recourse is to re-install fresh the older...,ubuntu-dialogue,"{""user_question"": ""edd"", ""user_answer"": ""n8tus..."
1,A LiveCD iso can be burned to a DVD-R and run ...,"I hope so, or the custom DVDs I've done are wo...",ubuntu-dialogue,"{""user_question"": ""usrl"", ""user_answer"": ""Ghos..."
2,"hello, is there a way to adjust gamma settings...",for me i have my nvidia settings manager and i...,ubuntu-dialogue,"{""user_question"": ""nucco_"", ""user_answer"": ""sp..."
3,does ubuntu come with a firewall by default?,no iptables rule is loaded by deault on ubuntu,ubuntu-dialogue,"{""user_question"": ""aeleon"", ""user_answer"": ""er..."
4,Can someone tell me howto get rid of Google Ch...,sudo dpkg -l |grep -i chrom ----> sudo apt-get...,ubuntu-dialogue,"{""user_question"": ""frold"", ""user_answer"": ""shi..."
...,...,...,...,...
16168,is there any GUI irc client besides pidgin ?,xchat,ubuntu-dialogue,"{""user_question"": ""jameela"", ""user_answer"": ""p..."
16169,"Hello , if I have a log file and i like to see...",you can try watch 'tail /path/to/logfile',ubuntu-dialogue,"{""user_question"": ""Zedde"", ""user_answer"": ""ada..."
16170,guys im trying to install itask but when i try...,sudo aptitude install automake autoconf build-...,ubuntu-dialogue,"{""user_question"": ""silvernode"", ""user_answer"":..."
16171,is there anyway to recurse with sftp in it's n...,"I believe not, but try lftp instead (it suppor...",ubuntu-dialogue,"{""user_question"": ""psion"", ""user_answer"": ""Sev..."


## Adjust Columns

In [5]:
# drop unnecessary columns and rename/clean headers
df.drop(columns=['SOURCE', 'METADATA'], axis=1, inplace=True)
df.rename(mapper={'INSTRUCTION': 'question', 'RESPONSE': 'response'}, axis=1, inplace=True)

# display dataframe
display(df)

Unnamed: 0,question,response
0,"hi, is there a CLI command to roll back any up...",your recourse is to re-install fresh the older...
1,A LiveCD iso can be burned to a DVD-R and run ...,"I hope so, or the custom DVDs I've done are wo..."
2,"hello, is there a way to adjust gamma settings...",for me i have my nvidia settings manager and i...
3,does ubuntu come with a firewall by default?,no iptables rule is loaded by deault on ubuntu
4,Can someone tell me howto get rid of Google Ch...,sudo dpkg -l |grep -i chrom ----> sudo apt-get...
...,...,...
16168,is there any GUI irc client besides pidgin ?,xchat
16169,"Hello , if I have a log file and i like to see...",you can try watch 'tail /path/to/logfile'
16170,guys im trying to install itask but when i try...,sudo aptitude install automake autoconf build-...
16171,is there anyway to recurse with sftp in it's n...,"I believe not, but try lftp instead (it suppor..."


## Check for Missing Values

In [6]:
# print sum of missing values for each column
print('Dataframe Missing Values:')
print('-------------------------------')
print(df.isna().sum())

Dataframe Missing Values:
-------------------------------
question    0
response    0
dtype: int64


## Display Samples of Text

In [7]:
# define function to print first 20 samples
# using option_context to extend width of columns
def fcn_display_df_samples(df):
    print('First 20 samples for df:')
    print('------------------------')
    with option_context('display.max_colwidth', 150):
        display(df[['question', 'response']].head(20))

# call function to display first 20 samples
fcn_display_df_samples(df)

First 20 samples for df:
------------------------


Unnamed: 0,question,response
0,"hi, is there a CLI command to roll back any updates/upgrades I made recently?",your recourse is to re-install fresh the older version
1,"A LiveCD iso can be burned to a DVD-R and run with no problems, right?","I hope so, or the custom DVDs I've done are worthless. ;)"
2,"hello, is there a way to adjust gamma settings in totem? my videos aren't playing with the correct colours",for me i have my nvidia settings manager and i change the video gamma settings from there...
3,does ubuntu come with a firewall by default?,no iptables rule is loaded by deault on ubuntu
4,Can someone tell me howto get rid of Google Chrome? Im not able to uninstall it...,sudo dpkg -l |grep -i chrom ----> sudo apt-get remove 'on what appears'
5,wow. for the life of me i can never remember this command. whats the command that outputs your ati hardare information? shows if you have direct r...,glxinfo | grep dri ?
6,ack! what the heck kind of Linux distro doesn't install traceroute by default?,ubuntu
7,is there a way to see if a hard disk has bad blocks on ubuntu? fsck does the job?,"have you considered, however, monitoring your HD's state using the SMART sensors? (the 'smartmontools' package can be used to query them)"
8,anyone know how to turn off opening things with a single click...its driving me crazy and I want to go back to doubleclicking,"open a file browser, go to edit|preferences | behavious, and change double to single"
9,is there a graphical way to search for an nfs server on gutsy?,does Places > Network work for you?


# Text Preprocessing

## Clean Text Function

In [8]:
# define function to clean text
def clean_text(text):
  # create an ordered dictionary of patterns/replacement values
  # to support processing in defined order
  patterns = OrderedDict([
    ("ain't", "are not"),
    ("aren't", "are not"),
    ("can't", "cannot"),
    ("could've", "could have"),
    ("couldn't", "could not"),
    ("didn't", "did not"),
    ("doesn't", "does not"),
    ("don't", "do not"),
    ("hadn't", "had not"),
    ("hasn't", "has not"),
    ("haven't", "have not"),
    ("he'd", "he would"),
    ("he'll", "he will"),
    ("he's", "he is"),
    ("i'd", "i would"),
    ("i'll", "i will"),
    ("i'm", "i am"),
    ("i've", "i have"),
    ("isn't", "is not"),
    ("it's", "it is"),
    ("let's", "let us"),
    ("mustn't", "must not"),
    ("shan't", "shall not"),
    ("she'd", "she would"),
    ("she'll", "she will"),
    ("she's", "she is"),
    ("shouldn't", "should not"),
    ("that's", "that is"),
    ("there's", "there is"),
    ("they'd", "they would"),
    ("they'll", "they will"),
    ("they're", "they are"),
    ("they've", "they have"),
    ("we'd", "we would"),
    ("we'll", "we will"),
    ("we're", "we are"),
    ("we've", "we have"),
    ("weren't", "were not"),
    ("what's", "what is"),
    ("when's", "when is"),
    ("where's", "where is"),
    ("who's", "who is"),
    ("won't", "will not"),
    ("wouldn't", "would not"),
    ("you'd", "you would"),
    ("you'll", "you will"),
    ("you're", "you are"),
    ("you've", "you have"),
    ("plz", "please"),
    ("\bu\b", "you"),
    ("teh", "the"),
    ("becuase", "because"),
    ("alot", "a lot"),
    ("definately", "definitely"),
    ("hd's", "hard drives"),
    ("colours", "colors"),
    ("re-install", "reinstall"),
    ("howto", "how to"),
    (":(", ""),
    (":)", ""),
    (";)", "")
  ])

  # standardize text by making all lowercase
  text = text.lower()

  # iterate through ordered dictionary, substitute defined
  # replacement text where patterns match
  keys = list(patterns.keys())
  clean_text = []
  for key, value in patterns.items():
    pattern = re.escape(key)
    text = re.sub(pattern, value, text)

  # eliminate repeating punctuation marks if more than 1 in a row
  punctuation = re.escape('!"#$%&\'()*,:;<=>?@[\]^_`{|}~')
  pattern = f"([{punctuation}])\\1*"
  replacement = "\\1"
  text = re.sub(pattern, replacement, text)

  # eliminate repeating punctuation marks if more than 2 in a row
  pattern = r'([/\\\-\.])\1+'
  replacement = r'\1\1'
  text = re.sub(pattern, replacement, text)

  # eliminate html tags
  pattern = r'<.*?>'
  replacement = ''
  text = re.sub(pattern, replacement, text)

  return text

## Get Dialogue Function

In [9]:
# define function to get dialogue text
# calls clean_text function to clean text
# returns list of question/answer pairs with special tokens
def fcn_get_dialogue():
  dialogue = []
  for i in range (len(df)):
    question = str(df.loc[i, 'question']).strip()
    question = clean_text(question)
    response = str(df.loc[i, 'response']).strip()
    response = clean_text(response)
    dialogue.append(' '.join(['[BOS]', question, '[BOT]', response, '[EOS]']))
  return dialogue

# Build Model and Tokenizer

In [10]:
# load the pre-trained GPT2 model
model_name = 'gpt2'
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name, pad_token_id=tokenizer.eos_token_id)

# set model to use GPUs if available in runtime session
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

# freeze base model layers to only train seq classification layer
for param in model.base_model.parameters():
    param.requires_grad = False

# define special tokens
special_tokens = {
    'bos_token': '[BOS]',
    'eos_token': '[EOS]',
    'sep_token': '[SEP]',
    'pad_token': '[PAD]',
    'cls_token': '[CLS]',
    'mask_token': '[MASK]',
    'additional_special_tokens': ['[BOT]']
}

# add special tokens and resize model's token embeddings to accomodate
num_new_tokens = tokenizer.add_special_tokens(special_tokens)
embeddings = model.resize_token_embeddings(len(tokenizer))

#print model architecture
print(model)

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50264, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50264, bias=False)
)


## Review Special Tokens

In [11]:
# print special tokens part of tokenizer
print(tokenizer.all_special_tokens)

['[BOS]', '[EOS]', '<|endoftext|>', '[SEP]', '[PAD]', '[CLS]', '[MASK]', '[BOT]']


# Prepare Datasets

## Get Dialogue Text

In [12]:
# call function to get dialogue text
dialogue = fcn_get_dialogue()

# display samples of dialogue to see effects of preprocessing
display(dialogue[0:15])

['[BOS] hi, is there a cli command to roll back any updates/upgrades i made recently? [BOT] your recourse is to reinstall fresh the older version [EOS]',
 '[BOS] a livecd iso can be burned to a dvd-r and run with no problems, right? [BOT] i hope so, or the custom dvds i have done are worthless.  [EOS]',
 '[BOS] hello, is there a way to adjust gamma settings in totem? my videos are not playing with the correct colors [BOT] for me i have my nvidia settings manager and i change the video gamma settings from there.. [EOS]',
 '[BOS] does ubuntu come with a firewall by default? [BOT] no iptables rule is loaded by deault on ubuntu [EOS]',
 "[BOS] can someone tell me how to get rid of google chrome? im not able to uninstall it.. [BOT] sudo dpkg -l |grep -i chrom --> sudo apt-get remove 'on what appears' [EOS]",
 '[BOS] wow. for the life of me i can never remember this command. whats the command that outputs your ati hardare information? shows if you have direct rendering? [BOT] glxinfo | grep 

## Build Torch Datasets Class

In [13]:
# define function to build torch dataset required for pytorch training
class TorchDataset(Dataset):
    def __init__(self, texts, tokenizer, max_length):
        self.tokenizer = tokenizer
        self.input_ids = []
        self.attention_mask = []
        self.labels = []
        for text in texts:
            input_text = text.strip()
            input_encodings = tokenizer(input_text, truncation=True, max_length=max_length, padding='max_length', return_tensors='pt', add_special_tokens=True).to(device)
            self.input_ids.append(input_encodings['input_ids'])
            self.attention_mask.append(input_encodings['attention_mask'])
            self.labels.append(input_encodings['input_ids'])

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        out = {
                'input_ids': self.input_ids[idx].squeeze(),
                'attention_mask': self.attention_mask[idx].squeeze(),
                'labels': self.labels[idx].squeeze()
        }
        return out

## Build and Split Datasets

In [14]:
# first split dataset into 90% training and 10% testing, then
# split resulting training set into 80% training and 20% validation
train_and_val_size = round(len(dialogue) * 0.90)
train_size = round(train_and_val_size * 0.80)
val_size = train_and_val_size - train_size
test_size = len(dialogue) - train_and_val_size

# max length for padding
max_length = 128

# call function to build torch datasets required for training
train_dataset = TorchDataset(dialogue[:train_size], tokenizer, max_length=max_length)
val_dataset = TorchDataset(dialogue[train_size:(train_size + val_size)], tokenizer, max_length=max_length)
test_dataset = TorchDataset(dialogue[(train_and_val_size):(train_and_val_size + test_size)], tokenizer, max_length=max_length)

# print number of samples in each dataset
print('Dataset Number of Samples:')
print('--------------------------')
print('Train:', len(train_dataset), 'samples')
print('Val:', len(val_dataset), 'samples')
print('Test:', len(test_dataset), 'samples')

Dataset Number of Samples:
--------------------------
Train: 11645 samples
Val: 2911 samples
Test: 1617 samples


# Train Model

## Custom Trainer Class

In [15]:
# define custom trainer for computing loss function
class CustomTrainer(Trainer):
  def compute_loss(self, model, inputs, return_outputs=False):

      # forward pass
      outputs = model(**inputs)

      # obtain loss
      loss = outputs.loss

      return (loss, outputs) if return_outputs else loss

### Preprocess Logits for Metrics

In [16]:
# override default function to avoid memory issue during evaluation
# this only passes the necessary logits needed for metrics calculations
def preprocess_logits_for_metrics(logits, labels):
  pred_ids = torch.argmax(logits, dim=-1)
  return pred_ids, labels

### Compute Metrics Function

In [17]:
# define function to compute and return metric scores
def compute_metrics(preds):

  # predictions and labels are in batches
  # unbatch and add to list for each
  all_predictions = []
  all_labels = []

  # flatten to 1D tensors for metric calculations
  for batch in preds:
    all_predictions.extend(batch[0].flatten())
    all_labels.extend(batch[1].flatten())

  # calculate accuracy, precision, recall, and f1-score
  accuracy = accuracy_score(all_labels, all_predictions)
  precision = precision_score(all_labels, all_predictions, average='micro', zero_division=0)
  recall = recall_score(all_labels, all_predictions, average='micro', zero_division=0)
  f1 = f1_score(all_labels, all_predictions, average='micro')

  # return all metrics
  return {
      'accuracy': accuracy,
      'precision': precision,
      'recall': recall,
      'f1_score': f1
  }

### Training Arguments

In [21]:
# define training arguments for use with trainer
training_args = TrainingArguments(
    output_dir='UDC_Chatbot',       # output directory
    overwrite_output_dir=True,      # overwrite output content
    num_train_epochs=60,            # number of training epochs
    learning_rate=0.1,              # learning rate (default: 5e-05)
    per_device_train_batch_size=16, # batch size per device during training
    per_device_eval_batch_size=16,  # batch size for evaluation
    warmup_steps=0,                 # warmup steps
    weight_decay=0.01,              # weight decay
    logging_dir='./logs',           # directory for storing logs
    save_strategy='epoch',          # save at every number of defined steps
    evaluation_strategy='epoch',    # evaluate at end of each step
    load_best_model_at_end=True     # load best performing model
)

### Construct Trainer

In [22]:
# define a callback for early stopping
cb_earlystopping = EarlyStoppingCallback(
  early_stopping_patience=3,
  early_stopping_threshold=0.01
)

# construct trainer, specifying previously defined model, arguments, datasets, and metrics function
trainer = CustomTrainer(
    model=model,                                                  # specify model for training
    args=training_args,                                           # use previously defined training arguments
    train_dataset=train_dataset,                                  # training dataset
    eval_dataset=val_dataset,                                     # evaluation dataset
    tokenizer=tokenizer,                                          # specify tokenizer
    preprocess_logits_for_metrics=preprocess_logits_for_metrics,  # preprocess logits to avoid memory issues
    compute_metrics=compute_metrics,                              # function to compute metrics
    callbacks=[cb_earlystopping]                                  # early stopping callback
)

## Fine-Tune Model

In [None]:
# free up GPU memory prior to training
torch.cuda.empty_cache()

# train (fine-tune) the model
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1 Score
1,3.7931,3.737337,0.637049,0.637049,0.637049,0.637049
2,3.7547,3.68433,0.639852,0.639852,0.639852,0.639852
3,3.4347,3.52856,0.634865,0.634865,0.634865,0.634865
4,3.3769,3.420374,0.63355,0.63355,0.63355,0.63355
5,3.1702,3.193037,0.633599,0.633599,0.633599,0.633599
6,3.1411,3.149377,0.632096,0.632096,0.632096,0.632096
7,3.0143,3.069665,0.632512,0.632512,0.632512,0.632512
8,2.9035,2.991072,0.633156,0.633156,0.633156,0.633156
9,2.8179,2.854615,0.633312,0.633312,0.633312,0.633312
10,2.7515,2.814939,0.632415,0.632415,0.632415,0.632415


TrainOutput(global_step=21840, training_loss=2.329725655356606, metrics={'train_runtime': 8425.0821, 'train_samples_per_second': 41.465, 'train_steps_per_second': 2.592, 'total_flos': 2.28205928448e+16, 'train_loss': 2.329725655356606, 'epoch': 30.0})

## Save Fine-Tuned Model

In [34]:
# flag to save, change to True to save model after training
save = False

if save is True:
  # save the model
  model.save_pretrained('chatbot_model')
  tokenizer.save_pretrained('chatbot_model')

  # make and save zip archive of model
  shutil.make_archive("/content/chatbot_model", 'zip', "chatbot_model")
  files.download("/content/chatbot_model.zip")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# Evaluate Results on Test Set

In [35]:
# evaluate the current model on test set after training
prediction_output = trainer.predict(test_dataset=test_dataset)
print('Test Set Metrics:')
print('-----------------')
display(prediction_output.metrics)

Test Set Metrics:
-----------------


{'test_loss': 1.7149666547775269,
 'test_accuracy': 0.6352074320148331,
 'test_precision': 0.6352074320148331,
 'test_recall': 0.6352074320148331,
 'test_f1_score': 0.6352074320148331,
 'test_runtime': 16.0727,
 'test_samples_per_second': 100.605,
 'test_steps_per_second': 6.346}

# Test Model

## Import Fine-Tuned Model

In [4]:
# flag to import model, set to True if using saved model
import_model = True

if import_model is True:

  # define saved saved archive and path to extract to
  saved_model_archive = '/content/chatbot_model.zip'
  saved_model_extracted = '/content/chatbot_model'

  # extract zip archive
  with zipfile.ZipFile(saved_model_archive, 'r') as zip:
      zip.extractall(saved_model_extracted)

  # load previously trained model and tokenizer
  tokenizer = GPT2Tokenizer.from_pretrained(saved_model_extracted)
  model = GPT2LMHeadModel.from_pretrained(saved_model_extracted)

  # set model to use GPUs if available in runtime session
  device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
  model.to(device)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


## Text Postprocessing Function

In [17]:
# load english language model
nlp = spacy.load('en_core_web_sm')

# define function to postprocess generated chatbot text
def postprocess_text(text):
    try:
        # construct doc object and create list of sentences
        doc = nlp(text)
        sentences = list(doc.sents)

        # capitalize first letter of each sentence
        # only consider a sentence if greater than 3 chars
        capitalized_sentences = []
        for sent in sentences:
            if len(sent.text.strip()) >= 3:
                sentence = sent.text.strip()
                if not sentence.endswith('.') and not sentence.endswith('?'):
                    sentence += '.'
                capitalized_sentences.append(sentence.capitalize())

        # if response is more than one sentence, only return first two sentences
        if len(capitalized_sentences) == 1:
            response = capitalized_sentences[0]
        elif len(capitalized_sentences) > 1:
            response = ' '.join(capitalized_sentences[:2])
        else:
            response = "Sorry, I don't understand your question. Can you try asking it in another way?"

        # return response
        return response.strip()

    except:
        return "Sorry, I don't understand your question. Can you try asking it in another way?"


## Chatbot Response Function

In [15]:
# define function to generate chatbot response
def generate_response(user_input):

  # add tokens to user input text
  user_input = (' '.join(['[BOS]', user_input.strip().lower(), '[BOT]']))

  # encode input
  input_ids = tokenizer.encode(user_input, return_tensors='pt', add_special_tokens=True).to(device)

  # generate top_p (nucleus) sampling
  sample_outputs = model.generate(
      input_ids,
      do_sample=True,
      max_length=50,
      top_k=30,
      top_p=0.95,
      num_return_sequences=1,
      no_repeat_ngram_size=2,
      early_stopping=True,
      temperature=.7,
      num_beams=6
  )

  for i, sample_output in enumerate(sample_outputs):
    # obtain list of tokens
    output_tokens = sample_outputs[0].tolist()

    # find location of [BOT] token
    bot_token_id = 50263
    try:
        bot_token_index = output_tokens.index(bot_token_id)
        # print decoded text after the [BOT] token
        decoded_text = tokenizer.decode(output_tokens[bot_token_index + 1:], skip_special_tokens=True)
        response = (postprocess_text(decoded_text)) # call function to postprocess response
        return(response) # return chatbot response
    # if [BOT] token is not found
    except ValueError:
        print('Unable to find [BOT] token.')

## Enter Chat Function

In [8]:
# define function to enter chat room
def enter_chat():
  # flag to output text upon first entering the chat room
  entering_chat = True

  while True:
      # welcoming text
      if entering_chat:
        print(':'*25)
        print(':::UBUNTU SUPPORT CHAT:::')
        print(':'*25)
        print('Chatbot: Welcome to the Ubuntu Support Chat! How may I assist you today?')
        entering_chat = False

      # get user input
      user_input = input("You: ")

      # check if user wants to exit
      if user_input.lower() == 'exit':
          print('Chatbot: Thank you for contacting us today. I hope we were able to resolve your issue. Goodbye!')
          break

      # generate chatbot response
      response = generate_response(user_input)

      # print chatbot response
      print('Chatbot:', response)

## Bleu Score and Rouge Score

### Functions to Calculate

In [41]:
# define function to calculate bleu score
def calculate_bleu(references, hypotheses):
  smoothing = SmoothingFunction()
  return sentence_bleu(references, hypotheses, smoothing_function=smoothing.method7) # method 7

# define function to calculate rouge score
def calculate_rouge(target, prediction):
  scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True)
  return(scorer.score(target, prediction))

### Calculate for Example

In [72]:
# example evaluation with a dummy user input
user_input = "How can I move files between folders on Ubuntu?"

# generate chatbot response
response = generate_response(user_input)

# calculate bleu score
bleu = calculate_bleu(references=user_input, hypotheses=response)

# calculate rouge score
rouge = calculate_rouge(target=user_input, prediction=response)

# print metrics
print('BLEU Score:', f"{bleu:.4f}")
output_rouge = '\nROUGE Scores:\n'
for metric, score in rouge.items():
    output_rouge += f"{metric}: Precision = {score.precision:.4f}, Recall = {score.recall:.4f}, F1 = {score.fmeasure:.4f}\n"
print(output_rouge)

BLEU Score: 0.0815

ROUGE Scores:
rouge1: Precision = 0.0952, Recall = 0.2222, F1 = 0.1333
rougeL: Precision = 0.0476, Recall = 0.1111, F1 = 0.0667



## ::: Interact with the Chatbot :::

In [16]:
# call function to enter the chat room
enter_chat()

:::::::::::::::::::::::::
:::UBUNTU SUPPORT CHAT:::
:::::::::::::::::::::::::
Chatbot: Welcome to the Ubuntu Support Chat! How may I assist you today?
You: How can I move a file between folders?
Chatbot: Right click on the top panel >. Properties>preferences>appearance->prefs-click on properties > preferences > admin -> prefrences -> keyboard shortcuts -> properties.
You: How can I set default applications to open certain types of files?
Chatbot: Sudo apt-get install gedit /etc/apt/sources.list   sudo dpkg -lapt-repository pkg.
You: How can I connect a USB drive? 
Chatbot: Df -hdparmic koala contents of /etc/init.d/samba restart  (which maps names of the'sudo service gksudo nautilus.
You: Is there a keyboard combination to run another program?
Chatbot: Ctrl+alt+left arrow keys to killall firefox, but it is a foreground, and then choose 'firefox' in the background colorcoded in terminal.
You: Where can I find the recycle bin?
Chatbot: Http://www.ubuntulinux.org/wiki/abs/html/how toforge