<a href="https://colab.research.google.com/github/rashmibanthia/2.0.0/blob/master/Lab_3_Dialog_Models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1 style="padding-top: 25px;padding-bottom: 25px;text-align: left; padding-left: 10px; background-color: #DDDDDD; 
    color: black;"> <img style="float: left; padding-right: 10px; width: 45px" src="https://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/iacs.png"> <a href='https://www.computefest.seas.harvard.edu/' target='_blank'><strong>IACS: ComputeFest 2021</strong></a></h1>

# **Lab 3 - Dialog Models**

#### **Authors/Instructors:**
Chris Tanner, Shivas Jayaram, Eduardo Peynetti, Rohit Beri

checking changes in github

## **Workshop Outline**

Overview Dataset

Dialog task using GPT2 Model

- Language generation without finetuning
- Language generation with finetuning
- Nano Quiz

Dialog task using GPT2 Double Head Model

- What is a GPT2 Double Head Model?
- Dialogs using a finetuned but on different dataset
- Finetuning an already fintuned model to our dataset
- Nano Quiz

## **Setup Notebook**

#### Copy & setup Colab with GPU

1) Select "File" menu and pick "Save a copy in Drive"  
2) This notebooks is already setup to use GPU but if you want to change it. Go to "Runtime" menu and select "Change runtime type". Then in the popup in "Hardware accelerator" select "GPU" and then click "Save"   
3) If you want high RAM there is an option for that

#### Installs

In [None]:
!pip install datasets
!pip install transformers

Collecting datasets
[?25l  Downloading https://files.pythonhosted.org/packages/06/9b/d097f2238fc3c028495cf5f8c65378972b9f1b2cbb27f3c57c7219195aa9/datasets-1.2.1-py3-none-any.whl (159kB)
[K     |████████████████████████████████| 163kB 13.4MB/s 
Collecting pyarrow>=0.17.1
[?25l  Downloading https://files.pythonhosted.org/packages/d7/e1/27958a70848f8f7089bff8d6ebe42519daf01f976d28b481e1bfd52c8097/pyarrow-2.0.0-cp36-cp36m-manylinux2014_x86_64.whl (17.7MB)
[K     |████████████████████████████████| 17.7MB 351kB/s 
[?25hCollecting xxhash
[?25l  Downloading https://files.pythonhosted.org/packages/f7/73/826b19f3594756cb1c6c23d2fbd8ca6a77a9cd3b650c9dec5acc85004c38/xxhash-2.0.0-cp36-cp36m-manylinux2010_x86_64.whl (242kB)
[K     |████████████████████████████████| 245kB 59.1MB/s 
Installing collected packages: pyarrow, xxhash, datasets
  Found existing installation: pyarrow 0.14.1
    Uninstalling pyarrow-0.14.1:
      Successfully uninstalled pyarrow-0.14.1
Successfully installed datasets-1

#### Imports

In [None]:
import os
import requests
import zipfile
import tarfile
import json
import time
import sys
import math
import logging
import numpy as np
import pandas as pd
from argparse import ArgumentParser
from subprocess import call
import textwrap

from collections import defaultdict
from multiprocessing import Pool
from tqdm.auto import tqdm, trange
from itertools import chain

import torch
import torch.nn.functional as F
from torch.cuda import amp
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler, TensorDataset
from torch.utils.data.distributed import DistributedSampler

from transformers.optimization import AdamW, get_linear_schedule_with_warmup
from transformers import GPT2Config, GPT2LMHeadModel, GPT2DoubleHeadsModel, GPT2Tokenizer

#### Setup Logger

In [None]:
# Setup Logger
if '__file__' not in globals():
  __file__ = "."
logger = logging.getLogger(__file__)

# Logger config
logging.basicConfig(level=logging.INFO)

#### Verify Setup

In [None]:
logger.info('__Python VERSION: %s', sys.version)
logger.info("torch version: %s", torch.__version__)
logger.info('CUDNN VERSION: %s', torch.backends.cudnn.version())
logger.info('Number CUDA Devices: %s', torch.cuda.device_count())
cuda_available = torch.cuda.is_available()
device = torch.device("cuda:0" if cuda_available else "cpu")
device_count = 0

if cuda_available:
  device_count = torch.cuda.device_count()
  logger.info('Devices:')
  logger.info('Active CUDA Device: %s', torch.cuda.current_device())
  logger.info('Available device count: %s', device_count)
  logger.info('Current cuda device: %s', torch.cuda.current_device())
else:
  logger.info('No CUDA Devices are available')

logger.info('Device: %s', device)
  

# nvidia-smi
call(["nvidia-smi", "--format=csv", "--query-gpu=index,name,driver_version,memory.total,memory.used,memory.free"])

INFO:.:__Python VERSION: 3.6.9 (default, Oct  8 2020, 12:12:24) 
[GCC 8.4.0]
INFO:.:torch version: 1.7.0+cu101
INFO:.:CUDNN VERSION: 7603
INFO:.:Number CUDA Devices: 1
INFO:.:Devices:
INFO:.:Active CUDA Device: 0
INFO:.:Available device count: 1
INFO:.:Current cuda device: 0
INFO:.:Device: cuda:0


0

#### Utils

In [None]:
def download_file(packet_url, base_path="", extract=False, headers=None):
  if base_path != "":
    if not os.path.exists(base_path):
      os.mkdir(base_path)
  packet_file = os.path.basename(packet_url)
  with requests.get(packet_url, stream=True, headers=headers) as r:
      r.raise_for_status()
      with open(os.path.join(base_path,packet_file), 'wb') as f:
          for chunk in r.iter_content(chunk_size=8192):
              f.write(chunk)
  
  if extract:
    if packet_file.endswith(".zip"):
      with zipfile.ZipFile(os.path.join(base_path,packet_file)) as zfile:
        zfile.extractall(base_path)
    else:
      packet_name = packet_file.split('.')[0]
      with tarfile.open(os.path.join(base_path,packet_file)) as tfile:
        tfile.extractall(base_path)

## **Datasets**

#### Download

In [None]:
start_time = time.time()
download_file("https://storage.googleapis.com/computefest-2021/dog_data.zip", base_path="datasets", extract=True)
download_file("https://storage.googleapis.com/computefest-2021/dogs_qa.csv", base_path="datasets", extract=False)
download_file("https://storage.googleapis.com/computefest-2021/personadogchat03.json", base_path="datasets", extract=False)
download_file("https://storage.googleapis.com/computefest-2021/bad_words.csv", base_path="datasets", extract=False)
execution_time = (time.time() - start_time)/60.0
print("Download execution time (mins)",execution_time)

Download execution time (mins) 0.03277856111526489


#### Explore Files

##### data_dictionary.txt

In [None]:
datasets_path = "datasets"

# Data Dictionary
data_dictionary_path = os.path.join(datasets_path,"dog_data","data_dictionary.txt")
with open(data_dictionary_path, 'r') as file:
  data_dictionary = file.read()

print("Data Dictionary:")
print(data_dictionary)

Data Dictionary:
dogs.csv - one row for every dog taken into custody since 1/1/2017.
-fields:
—-"AnimalID" - public facing unique id
--"AnimalInternal-ID" - internal unique id - USE THIS to link to the other tables (dogs_photos.csv and dogs_website_memos.csv)
--"AnimalName" 
--"AnimalType" - always "Dog"
--"AnimalSex" - Male, Female or Unknown
--"AnimalCurrentWeightPounds" - decimal weight in pounds. NOTE: data quality of this field is mediocre at best. Staff are good about recording at least one weight around the time of intake but not as diligent about recording a weight prior to outcome.
--"AnimalDOB" -  DOB formatted as YYYYMMDD
--"AnimalBreed" - concatenation of primary and secondary breed fields delimited by " /". 
--"AnimalColor" - concatenation of primary and secondary colors fields delimited by " /". 
--"AnimalPattern" - animal pattern NOTE: not often populated for dogs. More often used for cats


dogs_photos.csv - one row for every photo uploaded to a dogs profile.
-fields:
-

##### dogs.csv

In [None]:
# dogs.csv
dogs_path = os.path.join(datasets_path,"dog_data","dogs.csv")
dogs = pd.read_csv(dogs_path)

# Compute age of dog
dogs['DOB'] = pd.to_datetime(dogs['AnimalDOB'], format='%Y%m%d')
dogs["Year"] = pd.DatetimeIndex(dogs['DOB']).year
dogs["Age"] = (pd.to_datetime('now') - dogs['DOB']).astype('<m8[Y]')

print("Shape:",dogs.shape)
dogs.head()

Shape: (17212, 13)


Unnamed: 0,AnimalID,AnimalInternal-ID,AnimalName,AnimalType,AnimalSex,AnimalCurrentWeightPounds,AnimalDOB,AnimalBreed,AnimalColor,AnimalPattern,DOB,Year,Age
0,45628,1444011,Emma,Dog,Female,53.3,20150306,"Retriever, Yellow Labrador /Mix",Blond /None,,2015-03-06,2015,5.0
1,45629,1444014,Rizzoli,Dog,Female,4.7,20161222,Mixed Breed (Small),Tan /None,,2016-12-22,2016,4.0
2,45630,1444017,Isles,Dog,Female,3.1,20161222,Mixed Breed (Small),White /None,,2016-12-22,2016,4.0
3,45631,1444020,Cory,Dog,Male,4.7,20161222,Mixed Breed (Small),Sable /None,,2016-12-22,2016,4.0
4,45632,1444023,Topanga,Dog,Female,8.0,20161222,Mixed Breed (Small),Tan /None,,2016-12-22,2016,4.0


##### dogs_website_memos.csv

In [None]:
# dogs_website_memos.csv
dogs_website_memos_path = os.path.join(datasets_path,"dog_data","dogs_website_memos.csv")
with open(dogs_website_memos_path, 'r') as file:
  dogs_website_memos = file.read()

dogs_website_memos = dogs_website_memos.replace('\n\n','')
dogs_website_memos = dogs_website_memos.replace('\n \n','')
dogs_website_memos = dogs_website_memos.replace('"\n','"<EOL>')
dogs_website_memos = dogs_website_memos.replace('\\"','')
dogs_website_memos = dogs_website_memos.replace('\n','')
dogs_website_memos = dogs_website_memos.replace('<EOL>','\n')
print(dogs_website_memos[:5000])

dogs_website_memos = [row for row in dogs_website_memos.split(sep='\n')]
dogs_website_memos = dogs_website_memos[1:] # Remove header
dogs_website_memos = dogs_website_memos[:-1] # Remove last empty row

dogs_memos = []
for row in dogs_website_memos:
    dogs_memos.append({
        "AnimalInternal-ID": int(row.split(',"')[0]),
        "MemoText": row.split(',"')[1]
    })
dogs_website_memos = pd.DataFrame(dogs_memos)
print("Shape:", dogs_website_memos.shape)
dogs_website_memos.head()

"AnimalInternal-ID","MemoText"
1468738,"Meet Cornell, he's a social butterfly deluxe and he loves him some human contact. Cornell would love to go home with someone who wants to give and receive a ton of love and who will also have lots of fun training this pup on the commands he'll use to be the best of the best.Cornell does really well in his crate and is house trained. He's great with other dogs and would be happy to have some doggie siblings in the house if you have some. He hasn't been observed with kitties to date and he is also good with kids.Cornell is a very food motivated, smart and very affectionate pup, and adores being with his people. He's also friendly to all people he encounters. He hasn't quite got the hang of fetch yet, but does like balls and toy time. He definitely thinks walk time is super fun, too. He's kind of a typical pup, loves wherever he is led in life and enjoys the recharge time snuggling up as well.Cornell would really be a happy young guy if you were to 

Unnamed: 0,AnimalInternal-ID,MemoText
0,1468738,"Meet Cornell, he's a social butterfly deluxe a..."
1,1468727,Shaya is a puppy with potential extraordinaire...
2,1468736,Would you like to love good Luna? She's a wond...
3,1470308,Stanley is a seasoned elder statesman who stil...
4,1479031,Khaleesi is a purebred American Bulldog that ...


In [None]:
# Wrap text to 80 characters.
wrapper = textwrap.TextWrapper(width=80) 

print(wrapper.fill(dogs_website_memos.iloc[2].MemoText))

Would you like to love good Luna? She's a wonderful adolescent who is ready to
make the transition to adulthood in her new family's home.Luna has plenty of
experience with both people and other dogs. She's very affectionate with all
people. There is nothing she likes better than a good petting session. Luna has
also spent a lot of time with other dogs and is a great playmate. She seems to
get along with most every other dog.Luna generally stays fairly calm around the
house, though she does have active spells. She's great on a leash and loves
walking. She'd even be up to being your running partner! She enjoys a good romp
with dog friends, and she can also entertain herself with her favorite toys. And
once she's burned off some energy, she loves to get in some quality napping -
preferably in her crate (she is crate trained).Luna would fit into most any type
of family, though she would most enjoy one that could give her plenty of
affection and activity. She loves children and would do wel

#### dogs_qa.csv

In [None]:
# dogs_qa.csv
dogs_qa_path = os.path.join(datasets_path,"dogs_qa.csv")
dogs_qa = pd.read_csv(dogs_qa_path)
print("Shape:",dogs_qa.shape)
dogs_qa.head()

Shape: (499, 3)


Unnamed: 0,breed,question,answer
0,"Terrier, Pit Bull/Mix",Are Pitbull Terriers good family dogs?,When raised with the proper training and socia...
1,"Terrier, Pit Bull/Mix",Does terrier mix mean pit bull?,A terrier mix combines one parent from a terri...
2,"Terrier, Pit Bull/Mix",What dog will kill a pitbull?,"So, what dog can beat a Pitbull? A Rottweiler ..."
3,"Terrier, Pit Bull/Mix",Do pitbulls like to cuddle?,"Even if a Pit Bull does not like other dogs, t..."
4,"Terrier, Pit Bull/Mix",Do pitbulls turn on their owners?,They can become aggressive and if you have an ...


##### bad_words.csv

In [None]:
# bad_words.csv
bad_words_path = os.path.join(datasets_path,"bad_words.csv")
bad_words = pd.read_csv(bad_words_path,header=None)
print("Shape:",bad_words.shape)
bad_words_list = bad_words[0].values.tolist()

Shape: (451, 1)


## **GPT2**

#### Overview

Comparing GPT2 with BERT:

<table>
<tr><td width="400"><strong>GPT2</strong></td><td width="400"><strong>BERT</strong><td></tr>
<tr><td>Auto-regressive model (A word is predicted using words from its left context only)</td><td>Masked Language Model</td></tr>
<tr><td>Made up of only the Decoder with stacked transformer blocks</td><td>Made up of only the Encoder with stacked transformer blocks</td></tr>

<tr><td>Unidirectional language model</td><td>Bidirectional language model</td></tr>
<tr><td>Good for writing text</td><td>Good for fill in the blanks</td></tr>
</table>

<br>  


**Language Model**: 

A model that understands language and how words appear in context to one another. The model is trained using unsupervised approaches such as next word prediction in a sentence or next sentence prediction.

**Question Answering Model**: 

In the most common terminology, they are models that can find an answer when given a context text. Similar to reading comprehension 

**Dialog Model**: 

A model that you can converse with. It keeps track of the context/history and can identify user intents and provide specific answers. E.g: Chatbot

<br> 

We want to build a model that is capable of having a dialog/conversation in a natural way. For this we will attempt to use the GPT2 model. GPT2 was trained on 40GB of Internet text and understand language very well. 
First we will try to use the pretrained GPT2 out of the box and then we will fine tune with just one dogs data to see how a pre trained language model can be adpated to a custom dataset.

We will perform this task using the pretrained GPT2 model from the library <strong>transformers</strong>:
<img src="https://storage.googleapis.com/public_colab_images/nlp/gpt2/gpt2finetuning01.png"/>

#### Load Pretrained Model/Tokenizer

In [None]:
# load pretrained gpt2 tokenizer and model
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')

INFO:filelock:Lock 140482321795336 acquired on /root/.cache/huggingface/transformers/684fe667923972fb57f6b4dcb61a3c92763ad89882f3da5da9866baf14f2d60f.c7ed1f96aac49e745788faa77ba0a26a392643a50bb388b9c04ff469e555241f.lock


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1042301.0, style=ProgressStyle(descript…

INFO:filelock:Lock 140482321795336 released on /root/.cache/huggingface/transformers/684fe667923972fb57f6b4dcb61a3c92763ad89882f3da5da9866baf14f2d60f.c7ed1f96aac49e745788faa77ba0a26a392643a50bb388b9c04ff469e555241f.lock





INFO:filelock:Lock 140482471119672 acquired on /root/.cache/huggingface/transformers/c0c761a63004025aeadd530c4c27b860ec4ecbe8a00531233de21d865a402598.5d12962c5ee615a4c803841266e9c3be9a691a924f72d395d3a6c6c81157788b.lock


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…

INFO:filelock:Lock 140482471119672 released on /root/.cache/huggingface/transformers/c0c761a63004025aeadd530c4c27b860ec4ecbe8a00531233de21d865a402598.5d12962c5ee615a4c803841266e9c3be9a691a924f72d395d3a6c6c81157788b.lock





INFO:filelock:Lock 140482321796400 acquired on /root/.cache/huggingface/transformers/fc674cd6907b4c9e933cb42d67662436b89fa9540a1f40d7c919d0109289ad01.7d2e0efa5ca20cef4fb199382111e9d3ad96fd77b849e1d4bed13a66e1336f51.lock


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=665.0, style=ProgressStyle(description_…

INFO:filelock:Lock 140482321796400 released on /root/.cache/huggingface/transformers/fc674cd6907b4c9e933cb42d67662436b89fa9540a1f40d7c919d0109289ad01.7d2e0efa5ca20cef4fb199382111e9d3ad96fd77b849e1d4bed13a66e1336f51.lock





INFO:filelock:Lock 140482300978120 acquired on /root/.cache/huggingface/transformers/752929ace039baa8ef70fe21cdf9ab9445773d20e733cf693d667982e210837e.323c769945a351daa25546176f8208b3004b6f563438a7603e7932bae9025925.lock


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=548118077.0, style=ProgressStyle(descri…

INFO:filelock:Lock 140482300978120 released on /root/.cache/huggingface/transformers/752929ace039baa8ef70fe21cdf9ab9445773d20e733cf693d667982e210837e.323c769945a351daa25546176f8208b3004b6f563438a7603e7932bae9025925.lock





In [None]:
# Generate token for bad words
bad_words_tokens = [tokenizer.encode(x, add_special_tokens=False) for x in bad_words_list]

#### Language generation without finetuning

In [None]:
# Tokenize input text
input_ids = tokenizer.encode("Is Emma a good dog?", return_tensors='pt')
print("input_ids",input_ids)
# Use model to generate text
output = model.generate(input_ids, 
                        max_length=40, 
                        num_return_sequences=5, 
                        do_sample=True, 
                        early_stopping=True,
                        bad_words_list=bad_words_tokens)
print("Generated text:")
print('---------------------------------------------')
for i in range(len(output)):
  print(tokenizer.decode(output[i], skip_special_tokens=True))
  print('---------------------------------------------')

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


input_ids tensor([[ 3792, 18966,   257,   922,  3290,    30]])
Generated text:
---------------------------------------------
Is Emma a good dog? The answer, unfortunately, might not be a sure thing. As a pet for over 100 years, Emma was an incredibly important part of Emma Thompson's life. For an
---------------------------------------------
Is Emma a good dog? Oh, how she'd like to be. "I've been working on it for five months, and this thing is awesome."

Diana's favorite treats are
---------------------------------------------
Is Emma a good dog? Does she make her friends better? Or is she a coward or a hater? The latter questions every dog of any breed, from Labrador to Maltese to American Shepherd
---------------------------------------------
Is Emma a good dog?

Let's take a look at the data we have on Emma's breedable size, because it's not very interesting. As an example, we looked at the
---------------------------------------------
Is Emma a good dog? This is not a problem. Th

#### Finetuning GPT2

We see that the GPT2 model is not able to generate any meaningful text to the context of our problem. So we will use Transfer learning help us solve this problem

#### Prepare Data

##### Meet Emma

Pull just Emma's data to explore the language model

In [None]:
# 1444011
animal_id = 1444011
emma_data = dogs[dogs["AnimalInternal-ID"] == animal_id]
print('Metadata:')
display(emma_data)
print('Memo about dog:')
print(wrapper.fill(dogs_website_memos[dogs_website_memos["AnimalInternal-ID"] == animal_id]["MemoText"].values[0]))

print('Common question/answers from Google about the breed:')
breed_qa = dogs_qa[dogs_qa["breed"] == "Retriever, Labrador/Mix"]
print(breed_qa.question.iloc[0])
print(wrapper.fill(breed_qa.answer.iloc[0]))
print('\n')
print(breed_qa.question.iloc[1])
print(wrapper.fill(breed_qa.answer.iloc[1]))
print('\n')
print(breed_qa.question.iloc[2])
print(wrapper.fill(breed_qa.answer.iloc[2]))

INFO:numexpr.utils:NumExpr defaulting to 2 threads.


Metadata:


Unnamed: 0,AnimalID,AnimalInternal-ID,AnimalName,AnimalType,AnimalSex,AnimalCurrentWeightPounds,AnimalDOB,AnimalBreed,AnimalColor,AnimalPattern,DOB,Year,Age
0,45628,1444011,Emma,Dog,Female,53.3,20150306,"Retriever, Yellow Labrador /Mix",Blond /None,,2015-03-06,2015,5.0


Memo about dog:
Emma is a blonde princess who definitely likes the finer things in life - like
kisses and hugs from her humans. She is so happy to be alive and even happier
when her hair can be flying out a car window!This very affectionate, funny girl,
now in foster care, always wants to start her day by getting those hugs and
kisses from her person. And then she's ready for a day of playing with her
brother and her foster dog buddy and playing fetch and tug-o-war.When playtime
is done, Emma is also ready for her schooling. This spontaneous girl is a fast
learner who already answers to the Come, Stop, No commands and is working on
Sit, Stay and Place. She's crate trained, house trained and is always happy to
match her person's activity level - up for a morning jog, then hanging out on
the couch. And she's sure she's the best smeller in the world - bloodhounds and
police dogs have nothing on her!Emma is a spirited, eager-to-please girl who's
perfect for a family who wants a companion o

##### Generating QA Text

In [None]:
def generate_qa_text(dog_data,breed_qa):
  questions = [
  ['What is her name?',
    'What is {}\'s type?'.format(name),
    'What is {}\'s gender?'.format(name),
    'What is {}\'s weight?'.format(name),
    'What is {}\'s date of birth?'.format(name),
   'When was {} born?'.format(name),
   'What is {}\'s age?'.format(name),
    'What is {}\'s breed?'.format(name),
    'What is {}\'s color?'.format(name),
    ]

    for name, a_type, a_sex, weight, dob, year, age, breed, color in 
        zip(dog_data.AnimalName, 
            dog_data.AnimalType, 
            dog_data.AnimalSex, 
            dog_data.AnimalCurrentWeightPounds,
            dog_data.DOB,
            dog_data.Year,
            dog_data.Age,
            dog_data.AnimalBreed,
            dog_data.AnimalColor)
  ]

  answers = [
  ['Her name is {}'.format(name),
    '{} is a {}'.format(name, a_type),
    '{} is {}'.format(name, a_sex),
    '{}\'s weight is {}'.format(name, weight),
    '{}\'s date of birth is {}'.format(name, dob),
   '{}\ was born on {}'.format(name, year),
   '{}\ is {} years old'.format(name, age),
    '{}\'s breed is {}'.format(name, breed),
    '{}\'s color is {}'.format(name, color),
    ]

    if a_sex=='Female'

    else

    ['His name is {}'.format(name),
    '{} is a {}'.format(name, a_type),
    '{} is {}'.format(name, a_sex),
    '{}\'s weight is {}'.format(name, weight),
    '{}\'s date of birth is {}'.format(name, dob),
   '{}\ was born on {}'.format(name, year),
   '{}\ is {} years old'.format(name, age),
    '{}\'s breed is {}'.format(name, breed),
    '{}\'s color is {}'.format(name, color),
    ]

    for name, a_type, a_sex, weight, dob,year, age, breed, color in 
        zip(dog_data.AnimalName, 
            dog_data.AnimalType, 
            dog_data.AnimalSex, 
            dog_data.AnimalCurrentWeightPounds,
            dog_data.DOB,
            dog_data.Year,
            dog_data.Age,
            dog_data.AnimalBreed,
            dog_data.AnimalColor)
  ]

  qa_text_df = pd.DataFrame({'question': questions[0], 'answer': answers[0]})

  qa_text_df = qa_text_df.append(breed_qa[['question', 'answer']])
  qa_text_df = qa_text_df.reset_index(drop=True)
  return qa_text_df

In [None]:
emma_df = generate_qa_text(emma_data,breed_qa)
print("Shape:",emma_df.shape)
emma_df.head()

Shape: (38, 2)


Unnamed: 0,question,answer
0,What is her name?,Her name is Emma
1,What is Emma's type?,Emma is a Dog
2,What is Emma's gender?,Emma is Female
3,What is Emma's weight?,Emma's weight is 53.3
4,What is Emma's date of birth?,Emma's date of birth is 2015-03-06 00:00:00


In [None]:
emma_df.tail()

Unnamed: 0,question,answer
33,What is the smallest breed of Labrador?,"Besides being smaller in size, miniature labra..."
34,What is a Labrador and golden retriever mix ca...,"Loving, devoted, and energetic, Goldador mixed..."
35,How can I tell if my lab is mixed?,Lab Mixed Breeds The best way to tell the diff...
36,What is a lab hound mix called?,The Bassador is a mixed breed dog–a cross betw...
37,What breed of dog goes well with a Labrador?,Boston Terrier. This is one of the breeds that...


In [None]:
# Save to text file
emma_df.to_csv('emma.txt', header=None, index=None, sep=' ')

#### Train

We will use a script from the Huggingface library to finetune our model. There are a lot of well written scripts to train language models in the following repo:

https://github.com/huggingface/transformers/tree/master/examples

In [None]:
# Download run_clm.py
!wget https://raw.githubusercontent.com/huggingface/transformers/master/examples/language-modeling/run_clm.py

--2021-01-19 15:28:58--  https://raw.githubusercontent.com/huggingface/transformers/master/examples/language-modeling/run_clm.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 17046 (17K) [text/plain]
Saving to: ‘run_clm.py’


2021-01-19 15:28:58 (119 MB/s) - ‘run_clm.py’ saved [17046/17046]



In [None]:
!python run_clm.py \
    --output_dir='emma_model/' \
    --model_type=gpt2 \
    --model_name_or_path=gpt2 \
    --do_train \
    --train_file='emma.txt' \
    --do_eval \
    --validation_file='emma.txt' \
    --per_device_train_batch_size 1  \
    --num_train_epochs 50 \
    --overwrite_output_dir

2021-01-19 15:29:04.534844: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
01/19/2021 15:29:05 - INFO - __main__ -   Training/evaluation parameters TrainingArguments(output_dir=emma_model/, overwrite_output_dir=True, do_train=True, do_eval=True, do_predict=False, evaluation_strategy=EvaluationStrategy.NO, prediction_loss_only=False, per_device_train_batch_size=1, per_device_eval_batch_size=8, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=5e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=50.0, max_steps=-1, lr_scheduler_type=SchedulerType.LINEAR, warmup_steps=0, logging_dir=runs/Jan19_15-29-05_3c45f7627b2a, logging_first_step=False, logging_steps=500, save_steps=500, save_total_limit=None, no_cuda=False, seed=42, fp16=False, fp16_opt_level=O1, fp16_backend=auto, local_rank=-1, tpu_num_cores=None, tpu_metrics_debug=False, deb

#### Predict

In [None]:
# Load the finetuned model and tokenizer from the training step
tokenizer = GPT2Tokenizer.from_pretrained('./emma_model/')
model = GPT2LMHeadModel.from_pretrained('./emma_model/')

In [None]:
# Generate token for bad words
bad_words_tokens = [tokenizer.encode(x, add_special_tokens=False) for x in bad_words_list]

In [None]:
# Tokenize input text
input_ids = tokenizer.encode("Is Emma a good dog?", return_tensors='pt')
print("input_ids",input_ids)
# Use model to generate text
output = model.generate(input_ids, 
                        max_length=30, 
                        num_return_sequences=5, 
                        do_sample=True, 
                        temperature=1,
                        early_stopping=True,
                        bad_words_list=bad_words_tokens)
print("Generated text:")
print('---------------------------------------------')
for i in range(len(output)):
  print(tokenizer.decode(output[i], skip_special_tokens=True))
  print('---------------------------------------------')

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


input_ids tensor([[ 3792, 18966,   257,   922,  3290,    30]])
Generated text:
---------------------------------------------
Is Emma a good dog?

For anyone who has spent any time in animal shelter, you are about to find yourself in a situation where you
---------------------------------------------
Is Emma a good dog? Or an easy fix for her problems?

The answers are many — and very surprising, like she could do any
---------------------------------------------
Is Emma a good dog?

Is this a new thing? How would a dog like her own life fit in? (Note: I believe
---------------------------------------------
Is Emma a good dog? Well, why not.

If Emma could be a better person, the girl would need help becoming a more mature
---------------------------------------------
Is Emma a good dog? When the book becomes available on Kindle.

I wrote the book myself and it took me almost three years to finish
---------------------------------------------


Results are ok, not so goo. Now let us reduce the temperature

In [None]:
# Reduce the temperature when generating text
output = model.generate(input_ids, 
                        max_length=30, 
                        num_return_sequences=5, 
                        do_sample=True, 
                        temperature=0.3,
                        early_stopping=True,
                        bad_words_list=bad_words_tokens)
print("Generated text:")
print('---------------------------------------------')

for i in range(len(output)):
  print(tokenizer.decode(output[i], skip_special_tokens=True))
  print('---------------------------------------------')

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generated text:
---------------------------------------------
Is Emma a good dog?

I'm not sure. I'm not sure if she's a good dog.

I'm not sure
---------------------------------------------
Is Emma a good dog?

I think so.

I'm not sure if Emma is a good dog, but I think she's
---------------------------------------------
Is Emma a good dog?

I'm not sure if she's a good dog, but I think she's a good dog.


---------------------------------------------
Is Emma a good dog?

I don't know, but I think Emma is a good dog. She's a good dog, and she
---------------------------------------------
Is Emma a good dog?

No.

I am not a dog.

I am a human.

I am not
---------------------------------------------


Let's try another question

In [None]:
# Tokenize inputs
input_ids = tokenizer.encode("What is Emma's breed?", return_tensors='pt')
print("input_ids",input_ids)
# Use model to generate text
output = model.generate(input_ids, 
                        max_length=30, 
                        num_return_sequences=5, 
                        do_sample=True, 
                        temperature=0.3,
                        early_stopping=True,
                        bad_words_list=bad_words_tokens)
print("Generated text:")
print('---------------------------------------------')

for i in range(len(output)):
  print(tokenizer.decode(output[i], skip_special_tokens=True))
  print('---------------------------------------------')

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


input_ids tensor([[ 2061,   318, 18966,   338, 15939,    30]])
Generated text:
---------------------------------------------
What is Emma's breed?

Emma's breed is a hybrid of the two breeds. Emma's breed is a hybrid of the two breeds
---------------------------------------------
What is Emma's breed?

Emma is a black Labrador Retriever. She is a very intelligent and loving dog. She is very
---------------------------------------------
What is Emma's breed?

Emma is a male with a very short tail. She has a short, dark brown hairline and a
---------------------------------------------
What is Emma's breed?

Emma is a very rare breed. It is not a breed that is bred for the sake of breeding.
---------------------------------------------
What is Emma's breed?

Emma is a very rare breed in the UK. It is only found in the UK, and is not
---------------------------------------------


In [None]:
# Tokenize inputs
input_ids = tokenizer.encode("What does Emma enjoy doing?", return_tensors='pt')
print("input_ids",input_ids)
# Use model to generate text
output = model.generate(input_ids, 
                        max_length=30, 
                        num_return_sequences=5, 
                        do_sample=True, 
                        top_k=25,
                        top_p=0.45,
                        early_stopping=True,
                        bad_words_list=bad_words_tokens)
print("Generated text:")

for i in range(len(output)):
  print(tokenizer.decode(output[i], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


input_ids tensor([[ 2061,   857, 18966,  2883,  1804,    30]])
Generated text:
What does Emma enjoy doing?

Emma is a very nice girl, and she loves to play with herself. She's a very good dancer
What does Emma enjoy doing? She loves to do it. She loves to be a part of the team. She loves to be a part of the
What does Emma enjoy doing?

Emma is a very active and creative person. She enjoys reading and writing, and she enjoys writing and writing
What does Emma enjoy doing? She loves to read and write. She enjoys playing with her toys and doing whatever she can to make her life better.
What does Emma enjoy doing?

Emma enjoys being a part of the community. She loves to talk about things and be a part of it


#### Nano Quiz

##### Question 1

* **Why did GPT2 fail at answering specific questions??**

##### Answer: 

GPT2 is a generative model and hence generates a sequence of words that are most likely to occur after the given prompt (question in this case). Although most of the time the anwer does exist in the generated text after the question, the model is unable to give a specific answer to the question

##### Question 2

* **How does temperature help in generating answers?**

##### Answer: 

A lower temperature drags large probabilities closer to 1, small probabilities closer to 0. So words with higher importance would be used ot generate the next word in the sequence

##### Question 3

* **How does the answers generated by GPT2 differ from the answers generated by BERT from Lab 2**

##### Answer: 

BERT found an **exact match** of the answer from a given context. GPT2 **generated** an answer based on the given prompt. GPT2's generated answer need not be an exact sequence of words that exist in the original data

## **GPT2 Double Head Model**

#### Overview

We have seen how a Question Answering model works, we also saw how a Language generation model works. Let's attempt to combine some these ideas from the two models into one that can both answer questions as well as generate them. For this we will extend the GPT2 model.

**Causal Transformer**: 

We saw that GPT2 is the made up of only the Decoder with stacked transformer blocks. Also the model predicts words using only words from the left context. So if we look at our example on Emma.
<img src="https://storage.googleapis.com/public_colab_images/nlp/gpt2/causaltransformer02.png" width="800"/>

**Double Head Model**: 

Now how do we adapt this language model into a dialog task? In a question answering model we had to feed in a context and the model returned an answer. The language model generated text based on previous words. So if use the GPT2 model as a base and for the input we add some context to the data such as:
- Information about the dog, or its `persona`
- The `history` of the dialogue with the user
- The `answer` of the dog

And as a head we add:
- Language Model Head
- Multiple Choice Head

The GPT2 has by default one language model head which takes the hidden states from the final transform block and pass it to a linear layer to compute the logits. We then add another head called mutiple choice head, which takes the hidden states from the final transform block and summarizes the sequences to a single vector of a sequence hidden states. This could be done using `last` which is to take the last token hidden state, or `first` which is to take the first token hidden state, or `mean` which is to take the mean of all tokens hidden states.

<img src="https://storage.googleapis.com/public_colab_images/nlp/gpt2/gpt2doubleheadmodel.png" />


**Word Embeddings**: Word embeddings are where each word in the dataset is mapped to a numberical vector. Each of these vector has a sense of context between the words. So for exmaple words with simialr meaning or concepts come together in the vector space.

**Positional Embedding**: A transformer based model has no sense of the sequence of an input. So to give the model some sense of order we add a piece of information to each word about its position in the sentence. So positional embedding is a n-dimensional vector that contains information about a specific position in a sentence.

**Segment Embedding**: Our input consists of persona, history, and answer. So we want add information about each segment in the input.


**Finetuning Options**: 

There are multiple options to perform transfer learning and finetuing for our final dialog model:
<img src="https://storage.googleapis.com/public_colab_images/nlp/gpt2/gpt2dhfinetuning01.png" width="800"/>

- PERSONA-CHAT dataset size - 17,000
- Our dog dataset (small) 800

#### Load Pretrained Model/Tokenizer

In [None]:
model_url = "https://computefest2021images.s3.amazonaws.com/language_models/trained_model_epochs_1.zip"
start_time = time.time()
download_file(model_url, base_path="models", extract=True)
execution_time = (time.time() - start_time)/60.0
logger.info("Download execution time (mins): %s",execution_time)

INFO:.:Download execution time (mins): 0.29630059003829956


In [None]:
# Load trained model
model = GPT2DoubleHeadsModel.from_pretrained("./models/trained_model/")
# Convert model parameter tensors to CUDA tensors
model.to(device)
# Load trained Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("./models/trained_model/")

#### Utils

In [None]:
SPECIAL_TOKENS = ["<bos>", "<eos>", "<speaker1>", "<speaker2>", "<pad>"]
ATTR_TO_SPECIAL_TOKEN = {
    "bos_token": "<bos>",
    "eos_token": "<eos>",
    "pad_token": "<pad>",
    "additional_special_tokens": ["<speaker1>", "<speaker2>"],
}
MODEL_INPUTS = ["input_ids", "mc_token_ids", "lm_labels", "mc_labels", "token_type_ids"]
PADDED_INPUTS = ["input_ids", "lm_labels", "token_type_ids"]

In [None]:
# Utils for tokenization & data preparation
process_count = 1
multiprocessing_chunksize = 500

def tokenize_multi(data):
  obj, tokenizer = data
  if isinstance(obj, str):
      return tokenizer.convert_tokens_to_ids(tokenizer.tokenize(obj))
  if isinstance(obj, dict):
      return dict((n, tokenize_multi((o, tokenizer))) for n, o in obj.items())
  return list(tokenize_multi((o, tokenizer)) for o in obj)

def tokenize(obj):
  if isinstance(obj, str):
    return tokenizer.convert_tokens_to_ids(tokenizer.tokenize(obj))
  if isinstance(obj, dict):
    return dict((n, tokenize(o)) for n, o in obj.items())

  data = [(d, tokenizer) for d in obj]
  with Pool(process_count) as p:
    tokenized_data = list(
        tqdm(p.imap(tokenize_multi, data, chunksize=multiprocessing_chunksize), total=len(data))
    )
  return tokenized_data

def build_input_from_segments(persona, history, reply, tokenizer, lm_labels=False, with_eos=True):
  """ Build a sequence of input from 3 segments: persona, history and last reply. """
  bos, eos, speaker1, speaker2 = tokenizer.convert_tokens_to_ids(SPECIAL_TOKENS[:-1])
  sequence = [[bos] + list(chain(*persona))] + history + [reply + ([eos] if with_eos else [])]
  sequence = [sequence[0]] + [
      [speaker2 if (len(sequence) - i) % 2 else speaker1] + s for i, s in enumerate(sequence[1:])
  ]
  instance = {}
  instance["input_ids"] = list(chain(*sequence))
  instance["token_type_ids"] = [speaker2 if i % 2 else speaker1 for i, s in enumerate(sequence) for _ in s]
  instance["mc_token_ids"] = len(instance["input_ids"]) - 1
  instance["lm_labels"] = [-100] * len(instance["input_ids"])
  if lm_labels:
      instance["lm_labels"] = ([-100] * sum(len(s) for s in sequence[:-1])) + [-100] + sequence[-1][1:]
  return instance

def pad_dataset(dataset, padding=0):
  """ Pad the dataset. This could be optimized by defining a Dataset class and padding at the batch level,
  but this is simpler. """
  max_l = max(len(x) for x in dataset["input_ids"])
  for name in PADDED_INPUTS:
      dataset[name] = [x + [padding if name != "lm_labels" else -100] * (max_l - len(x)) for x in dataset[name]]
  return dataset

def prepare_datasets(dataset, num_candidates):
  datasets = defaultdict(list)
  for dialog in dataset:
    persona = dialog["personality"].copy()
    for _ in range(args.personality_permutations):
      for utterance in dialog["utterances"]:
          history = utterance["history"][-(2 * args.max_history + 1) :]
          for j, candidate in enumerate(utterance["candidates"][-num_candidates:]):
              lm_labels = bool(j == num_candidates - 1)
              instance = build_input_from_segments(persona, history, candidate, tokenizer, lm_labels)
              for input_name, input_array in instance.items():
                  datasets[input_name].append(input_array)
          datasets["mc_labels"].append(num_candidates - 1)
          datasets["n_candidates"] = num_candidates
      # permuted personalities
      persona = [persona[-1]] + persona[:-1]
  return datasets

def top_filtering(logits, top_k=0.0, top_p=0.9, threshold=-float("Inf"), filter_value=-float("Inf")):
  top_k = min(top_k, logits.size(-1))
  if top_k > 0:
      # Remove all tokens with a probability less than the last token in the top-k tokens
      indices_to_remove = logits < torch.topk(logits, top_k)[0][..., -1, None]
      logits[indices_to_remove] = filter_value

  if top_p > 0.0:
      # Compute cumulative probabilities of sorted tokens
      sorted_logits, sorted_indices = torch.sort(logits, descending=True)
      cumulative_probabilities = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)

      # Remove tokens with cumulative probability above the threshold
      sorted_indices_to_remove = cumulative_probabilities > top_p
      # Shift the indices to the right to keep also the first token above the threshold
      sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
      sorted_indices_to_remove[..., 0] = 0

      # Back to unsorted indices and set them to -infinity
      indices_to_remove = sorted_indices[sorted_indices_to_remove]
      logits[indices_to_remove] = filter_value

  indices_to_remove = logits < threshold
  logits[indices_to_remove] = filter_value

  return logits

def generate_sequence(personality, history, tokenizer, model, current_output=None):
  with torch.no_grad():
    with amp.autocast():
      special_tokens_ids = tokenizer.convert_tokens_to_ids(SPECIAL_TOKENS)
      if current_output is None:
          current_output = []

      # Args
      max_length = 20
      temperature = 0.7
      top_k = 0
      top_p = 0.9
      do_sample = True
      min_length = 1

      for i in range(max_length):
          instance = build_input_from_segments(
              personality, history, current_output, tokenizer, with_eos=False
          )

          input_ids = torch.tensor(instance["input_ids"], device=device).unsqueeze(0)
          token_type_ids = torch.tensor(instance["token_type_ids"], device=device).unsqueeze(0)

          logits = model(input_ids, token_type_ids=token_type_ids)
          logits = logits[0]

          logits = logits[0, -1, :] / temperature
          logits = top_filtering(logits, top_k=top_k, top_p=top_p)
          probs = F.softmax(logits, dim=-1)

          prev = torch.topk(probs, 1)[1] if not do_sample else torch.multinomial(probs, 1)
          if i < min_length and prev.item() in special_tokens_ids:
              while prev.item() in special_tokens_ids:
                  if probs.max().item() == 1:
                      break  # avoid infinite loop
                  prev = torch.multinomial(probs, num_samples=1)

          if prev.item() in special_tokens_ids:
              break
          current_output.append(prev.item())

  return current_output

#### Without finetuning

In [None]:
# Personality
test_personality=[
  'I am Emma',
  'I am a Dog',
  'My gender is Female',
  'My weight is 53.0',
  'I was born on 2009',
  'I am 11 years old',
  'My breed is Retriever, Yellow Labrador',
  'My color is White/Yello',
  'I am house trained','i like to play with toys']

# History
test_history = [
    "Hi",
    "woof woof"
]
# New chat message
test_message = "what do you like to play with?"

print(test_personality)
print(test_history)
print(test_message)

['I am Emma', 'I am a Dog', 'My gender is Female', 'My weight is 53.0', 'I was born on 2009', 'I am 11 years old', 'My breed is Retriever, Yellow Labrador', 'My color is White/Yello', 'I am house trained', 'i like to play with toys']
['Hi', 'woof woof']
what do you like to play with?


In [None]:
# Tokenize
personality = [tokenizer.encode(s.lower()) for s in test_personality]
history = [tokenizer.encode(s) for s in test_history]
history.append(tokenizer.encode(test_message))
# Generate output
output = generate_sequence(personality, history, tokenizer, model)

print("Generated text:")
print(tokenizer.decode(output, skip_special_tokens=True))

Generated text:
i like to play with my dog


#### With Finetuning

In [None]:
# Setup Arguments
parser = ArgumentParser()
parser.add_argument("--epochs", type=int, default=1, help="Number of training epochs")
parser.add_argument("--train_batch_size", type=int, default=4, help="Batch size for training")
parser.add_argument("--validation_batch_size", type=int, default=4, help="Batch size for validation")
parser.add_argument("--num_candidates", type=int, default=2, help="Number of candidates for training")
parser.add_argument("--max_history", type=int, default=2, help="Number of previous exchanges to keep in history")
parser.add_argument("--personality_permutations", type=int, default=1, help="Number of permutations of personality sentences")
parser.add_argument("--gradient_accumulation_steps", type=int, default=1, help="Accumulate gradients on several steps")
parser.add_argument("--learning_rate", type=float, default=4e-05, help="Learning rate")
parser.add_argument("--lm_coef", type=float, default=2.0, help="LM loss coefficient")
parser.add_argument("--mc_coef", type=float, default=1.0, help="Multiple-choice loss coefficient")
parser.add_argument("--weight_decay", type=float, default=0.0, help="Optimizer weight decay")
parser.add_argument("--warmup_steps", type=int, default=0, help="Number of warmup steps")
parser.add_argument("--warmup_ratio", type=float, default=0.06, help="Warmup ratio")
parser.add_argument("--adam_epsilon", type=float, default=1e-08, help="Adam optimizer epsilon")
parser.add_argument("--verbose", type=int, default=1, help="Verbose logging")
parser.add_argument("--max_norm", type=float, default=1.0, help="Clipping gradient norm")
parser.add_argument("--model_dir", type=str, default="model_outputs", help="Path to save model")

args = parser.parse_args("")
logger.info("Arguments: %s", args)

INFO:.:Arguments: Namespace(adam_epsilon=1e-08, epochs=1, gradient_accumulation_steps=1, learning_rate=4e-05, lm_coef=2.0, max_history=2, max_norm=1.0, mc_coef=1.0, model_dir='model_outputs', num_candidates=2, personality_permutations=1, train_batch_size=4, validation_batch_size=4, verbose=1, warmup_ratio=0.06, warmup_steps=0, weight_decay=0.0)


In [None]:
# If you want to try to fine tune from GPT2 pretrained weights directly here is the code
# # Model
# model = GPT2DoubleHeadsModel.from_pretrained("gpt2")

# # Tokenizer
# tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
# # Add special tokens to the tokenizer and model
# orig_num_tokens = len(tokenizer.encoder)
# # Add special tokens
# num_added_tokens = tokenizer.add_special_tokens(ATTR_TO_SPECIAL_TOKEN)
# if num_added_tokens > 0:
#   model.resize_token_embeddings(new_num_tokens=orig_num_tokens + num_added_tokens)

# # Convert model parameter tensors to CUDA tensors
# model.to(device)

# print("model type:",type(model))

#### Prepare Data

In [None]:
# Read the personachat json file
personachat_file = os.path.join("datasets","personadogchat03.json")
with open(personachat_file, "r", encoding="utf-8") as f:
  personachat = json.loads(f.read())

In [None]:
# Tokenize dataset
train_processed = tokenize(personachat)

print("train count:",len(train_processed))
print(train_processed[:2])

train_num_candidates = len(train_processed[0]["utterances"][0]["candidates"])
if args.num_candidates > 0:
  train_num_candidates = min(args.num_candidates, train_num_candidates)

# Prepare dataset inputs & outputs
train_processed = prepare_datasets(train_processed, train_num_candidates)
print("After adding inputs/outputs:")
print("train_processed keys:", train_processed.keys())
print("input_ids:",len(train_processed["input_ids"][0]),train_processed["input_ids"][0])
print("token_type_ids:",len(train_processed["token_type_ids"][0]),train_processed["token_type_ids"][0])
print("mc_token_ids:",len(train_processed["mc_token_ids"]))
print("lm_labels:",len(train_processed["lm_labels"][0]),train_processed["lm_labels"][0])
print("mc_labels:",len(train_processed["mc_labels"]))
print("n_candidates:",train_processed["n_candidates"])

# Pad datasets
train_processed = pad_dataset(train_processed, padding=tokenizer.convert_tokens_to_ids(SPECIAL_TOKENS[-1]))
print("After Padding:")
print("input_ids:",len(train_processed["input_ids"][0]),train_processed["input_ids"][0])
print("token_type_ids:",len(train_processed["token_type_ids"][0]),train_processed["token_type_ids"][0])
print("mc_token_ids:",len(train_processed["mc_token_ids"]))
print("lm_labels:",len(train_processed["lm_labels"][0]),train_processed["lm_labels"][0])
print("mc_labels:",len(train_processed["mc_labels"]))

HBox(children=(FloatProgress(value=0.0, max=807.0), HTML(value='')))


train count: 807
[{'personality': [[40, 716, 21714, 18971], [40, 716, 257, 8532], [3666, 5279, 318, 15396], [3666, 3463, 318, 5996, 13, 15], [40, 373, 4642, 319, 3717, 2931, 1314], [3666, 15939, 318, 4990, 380, 964, 11, 12550, 45246], [3666, 3124, 318, 2635, 14, 14202], [40, 716, 2156, 8776]], 'utterances': [{'candidates': [[1662, 1107, 1312, 588, 852, 1363, 1804, 2147, 837, 340, 318, 7427, 5145], [31373, 612, 837, 1545, 5145, 644, 389, 345, 510, 284, 428, 845, 3734, 1110, 5633], [1014, 3608, 764, 1312, 2513, 319, 262, 10481, 290, 2342, 262, 26428, 790, 1755, 764], [40909, 837, 484, 389, 9616, 616, 11077, 290, 1312, 3613, 1637, 981, 287, 4152], [72, 1101, 7926, 837, 1312, 1101, 407, 5385, 351, 607, 764, 1312, 2883, 3555], [258, 318, 5650, 837, 523, 339, 7622, 502, 845, 922, 1664, 611, 345, 651, 644, 1312, 1612], [72, 1842, 1642, 649, 8242, 503, 286, 1468, 290, 4379, 616, 3988, 287, 1398, 1657, 510], [3810, 318, 991, 379, 262, 2479, 339, 655, 7832, 284, 711, 319, 616, 220, 13323, 505, 

In [None]:
# Create Tensors
train_tensor_datasets = []
validate_tensor_datasets = []
for input_name in MODEL_INPUTS:
  train_tensor = torch.tensor(train_processed[input_name])
  if input_name != "mc_labels":
      train_tensor = train_tensor.view((-1, train_processed["n_candidates"]) + train_tensor.shape[1:])
  train_tensor_datasets.append(train_tensor)

# Tensor Dataset
train_tensor_dataset = TensorDataset(*train_tensor_datasets)

# Create Data Loaders
train_data_sampler = RandomSampler(train_tensor_dataset)
train_data_loader = DataLoader(train_tensor_dataset, sampler=train_data_sampler, batch_size=args.train_batch_size)

logger.info("Train DataLoader (Batch, Candidates, Seq length): {}".format(train_tensor_dataset.tensors[0].shape))

INFO:.:Train DataLoader (Batch, Candidates, Seq length): torch.Size([7263, 2, 125])


#### Train

In [None]:
training_steps = len(train_data_loader) // args.gradient_accumulation_steps * args.epochs

warmup_steps = math.ceil(training_steps * args.warmup_ratio)
warmup_steps = warmup_steps if args.warmup_steps == 0 else args.warmup_steps
print("warmup_steps:", warmup_steps)

# Optimizer
optimizer = AdamW(model.parameters(), lr=args.learning_rate, eps=args.adam_epsilon)

# Learning rate scheduler
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=warmup_steps, num_training_steps=training_steps)

warmup_steps: 109


In [None]:
# Free Memory
torch.cuda.empty_cache()

disable = True if args.verbose == 0 else False
global_step = 0
training_progress_scores = None
tr_loss, logging_loss = 0.0, 0.0
model.zero_grad()
train_iterator = trange(int(args.epochs), desc="Epoch", disable=disable)
epoch_number = 0
best_eval_metric = None
early_stopping_counter = 0
logging_steps = 50

# Create directory to save model
model_dir = args.model_dir
os.makedirs(model_dir, exist_ok=True)

scaler = amp.GradScaler()

start_time = time.time()
for _ in train_iterator:
    model.train()
    train_iterator.set_description(f'Epoch {epoch_number + 1} of {args.epochs}')
    batch_iterator = tqdm(
        train_data_loader,
        desc=f'Running Epoch {epoch_number} of {args.epochs}',
        disable=disable,
        mininterval=0,
    )
    for step, batch in enumerate(batch_iterator):
        batch = tuple(t.to(device) for t in batch)
        input_ids, mc_token_ids, lm_labels, mc_labels, token_type_ids = batch

        with amp.autocast():
          model_outputs = model(
              input_ids,
              token_type_ids=token_type_ids,
              mc_token_ids=mc_token_ids,
              mc_labels=mc_labels,
              labels=lm_labels,
          )
          mc_loss = model_outputs["mc_loss"]
          lm_loss = model_outputs["loss"]
          loss = lm_loss * args.lm_coef + mc_loss * args.mc_coef

        current_loss = loss.item()

        print("\rRunning loss: %f" % current_loss, end="")

        if args.gradient_accumulation_steps > 1:
          loss = loss / args.gradient_accumulation_steps

        scaler.scale(loss).backward()

        tr_loss += loss.item()
        if (step + 1) % args.gradient_accumulation_steps == 0:
            scaler.unscale_(optimizer)
            torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_norm)
            scaler.step(optimizer)
            scaler.update()

            # Update learning rate schedule
            scheduler.step()
            model.zero_grad()
            global_step += 1

            if logging_steps > 0 and global_step % logging_steps == 0:
                logging_loss = tr_loss

    epoch_number += 1

execution_time = (time.time() - start_time)/60.0
logger.info("Execution time (mins): %s",execution_time)

HBox(children=(FloatProgress(value=0.0, description='Epoch', max=1.0, style=ProgressStyle(description_width='i…

HBox(children=(FloatProgress(value=0.0, description='Running Epoch 0 of 1', max=1816.0, style=ProgressStyle(de…

Running loss: 3.742670Running loss: 6.735260



Running loss: 0.000042

INFO:.:Execution time (mins): 5.8799940824508665


Running loss: 0.000052



#### Predict

In [None]:
# Personality
test_personality=[
  'I am Emma',
  'I am a Dog',
  'My gender is Female',
  'My weight is 53.0',
  'I was born on 2009',
  'I am 11 years old',
  'My breed is Retriever, Yellow Labrador',
  'My color is White/Yello',
  'I am house trained','i like to play with toys']

# History
test_history = [
    "Hi",
    "woof woof"
]

print(test_personality)
print(test_history)

['I am Emma', 'I am a Dog', 'My gender is Female', 'My weight is 53.0', 'I was born on 2009', 'I am 1 years old', 'My breed is Retriever, Yellow Labrador', 'My color is White/Yello', 'I am house trained', 'i like to play with toys']
['Hi', 'woof woof']


In [None]:
# New chat message
test_message = "How old are you?" # what do you like to play with?, Are you house trained?, How old are you?

# Tokenize test inputs
personality = [tokenizer.encode(s.lower()) for s in test_personality]
history = [tokenizer.encode(s) for s in test_history]
history.append(tokenizer.encode(test_message))
# Generate output
output = generate_sequence(personality, history, tokenizer, model)

print("Question:")
print(test_message)

print("Answer:")
print(tokenizer.decode(output, skip_special_tokens=True))

Question:
How old are you?
Answer:
i am 1 year old


Now we see some promissing results 🐶🐶🐶👏👏👏

#### Nano Quiz

##### Question 1

* **What is the purpose of the two heads in the GPT2 Double Head Model?**

##### Answer:

One head is the generate the languge and the other to check correctness of the generated answer

##### Question 2

* **What was the problem with finetuning our dogs dataset directly with GPT2 pretrained weights?**

##### Answer:

Our dialog task requires a variety of dialog data. We dont have a large engough dataset for conversation with a dog. 

##### Question 3

* **What are word, position, and segments embeddings in the above model?**

##### Answer:

Word embeddings are where each word in the dataset is mapped to a numberical vector.

Position embeddings give the model a sense of order we add a piece of information to each word about its position in the sentence

Segment embeddings give the model an idea of what the persona, history, and answers are


## **Save Model/Tokenizer**

In [None]:
# Save
model_dir = "trained_model"
os.makedirs(model_dir, exist_ok=True)

model.save_pretrained(model_dir)
tokenizer.save_pretrained(model_dir)

('trained_model/tokenizer_config.json',
 'trained_model/special_tokens_map.json',
 'trained_model/vocab.json',
 'trained_model/merges.txt',
 'trained_model/added_tokens.json')

In [None]:
!zip -r finetuned_model_epochs_1.zip trained_model

  adding: trained_model/ (stored 0%)
  adding: trained_model/merges.txt (deflated 53%)
  adding: trained_model/tokenizer_config.json (deflated 67%)
  adding: trained_model/added_tokens.json (deflated 42%)
  adding: trained_model/config.json (deflated 50%)
  adding: trained_model/pytorch_model.bin (deflated 9%)
  adding: trained_model/special_tokens_map.json (deflated 42%)
  adding: trained_model/vocab.json (deflated 63%)


## **References**

### Research Papers
* [Attention is all you need (2017)](https://arxiv.org/abs/1706.03762)
* [GPT-2 (2019)](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)

### Code

* [Building a State-of-the-Art Conversational AI with Transfer Learning](https://github.com/huggingface/transfer-learning-conv-ai)
* [Summary of the models](https://huggingface.co/transformers/model_summary.html)

### Articles

* [How to build a State-of-the-Art Conversational AI with Transfer Learning](https://medium.com/huggingface/how-to-build-a-state-of-the-art-conversational-ai-with-transfer-learning-2d818ac26313)
* [The Illustrated GPT-2](http://jalammar.github.io/illustrated-gpt2/)
* [The Illustrated BERT, ELMo, and co.](http://jalammar.github.io/illustrated-bert/)