<a href="https://colab.research.google.com/github/sean-halpin/dialoGPT_Virtual_Character/blob/main/simpsons_script_preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [9]:
from google.colab import drive
drive.mount('/content/drive/')

Mounted at /content/drive/


Let's move to the desired folder in which we will store all our data.

In [10]:
import os
os.chdir("/content/drive/My Drive/Colab Notebooks")

Upload the dataset, e.g. `https://github.com/sean-halpin/dialoGPT_Virtual_Character/blob/main/simpsons_script_lines.zip`

In [11]:
import io, zipfile

In [12]:
f2e = "simpsons_script_lines.zip"
archive = zipfile.ZipFile(f2e, 'r')
archive.extractall("dataset/")


In [None]:
# from google.colab import files

# uploaded = files.upload()

In [None]:
# for fn in uploaded:
#   print(fn)
#   archive = zipfile.ZipFile(fn, 'r')
#   archive.extractall("dataset/")


# Dependancies 

In [13]:
! pip -q install transformers

[K     |████████████████████████████████| 3.8 MB 14.6 MB/s 
[K     |████████████████████████████████| 596 kB 64.0 MB/s 
[K     |████████████████████████████████| 895 kB 57.0 MB/s 
[K     |████████████████████████████████| 67 kB 7.1 MB/s 
[K     |████████████████████████████████| 6.5 MB 29.7 MB/s 
[?25h

In [14]:
from transformers import AutoModelWithLMHead, AutoTokenizer
import torch

tokenizer = AutoTokenizer.from_pretrained("microsoft/DialoGPT-small", force_download=False)
model = AutoModelWithLMHead.from_pretrained("microsoft/DialoGPT-small", force_download=False)

Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/641 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/0.99M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]



Downloading:   0%|          | 0.00/335M [00:00<?, ?B/s]

## Model initial configuration

Let's train our own chatbot. For start, we will need basic configuration and a dataset.
Configuration and training scripts are mostly based on this [script](https://github.com/huggingface/transformers/blob/master/examples/language-modeling/run_language_modeling.py) from Huggingface and great [tutorial](https://nathancooper.io/i-am-a-nerd/chatbot/deep-learning/gpt2/2020/05/12/chatbot-part-1.html) from Nathan Cooper.

In [15]:
"""
Fine-tuning the library models for language modeling on a text file (GPT, GPT-2, BERT, RoBERTa).
GPT and GPT-2 are fine-tuned using a causal language modeling (CLM) loss while BERT and RoBERTa are fine-tuned
using a masked language modeling (MLM) loss.
"""

import glob
import logging
import os
import pickle
import random
import re
import shutil
from typing import Dict, List, Tuple

import pandas as pd
import numpy as np
import torch

from sklearn.model_selection import train_test_split

from torch.nn.utils.rnn import pad_sequence
from torch.utils.data import DataLoader, Dataset, RandomSampler, SequentialSampler
from torch.utils.data.distributed import DistributedSampler
from tqdm.notebook import tqdm, trange

from pathlib import Path

from transformers import (
    MODEL_WITH_LM_HEAD_MAPPING,
    WEIGHTS_NAME,
    AdamW,
    AutoConfig,
    AutoModelWithLMHead,
    AutoTokenizer,
    PreTrainedModel,
    PreTrainedTokenizer,
    get_linear_schedule_with_warmup,
)


try:
    from torch.utils.tensorboard import SummaryWriter
except ImportError:
    from tensorboardX import SummaryWriter

# Configs
logger = logging.getLogger(__name__)

MODEL_CONFIG_CLASSES = list(MODEL_WITH_LM_HEAD_MAPPING.keys())
MODEL_TYPES = tuple(conf.model_type for conf in MODEL_CONFIG_CLASSES)

## Prepare Dataset

In [16]:
import pandas as pd
all_lines = pd.read_csv("dataset/simpsons_script_lines.csv")
all_lines.head(10)

  exec(code_obj, self.user_global_ns, self.user_ns)


Unnamed: 0,id,episode_id,number,raw_text,timestamp_in_ms,speaking_line,character_id,location_id,raw_character_text,raw_location_text,spoken_words,normalized_text,word_count
0,9549,32,209,"Miss Hoover: No, actually, it was a little of ...",848000,True,464.0,3.0,Miss Hoover,Springfield Elementary School,"No, actually, it was a little of both. Sometim...",no actually it was a little of both sometimes ...,31.0
1,9550,32,210,Lisa Simpson: (NEAR TEARS) Where's Mr. Bergstrom?,856000,True,9.0,3.0,Lisa Simpson,Springfield Elementary School,Where's Mr. Bergstrom?,wheres mr bergstrom,3.0
2,9551,32,211,Miss Hoover: I don't know. Although I'd sure l...,856000,True,464.0,3.0,Miss Hoover,Springfield Elementary School,I don't know. Although I'd sure like to talk t...,i dont know although id sure like to talk to h...,22.0
3,9552,32,212,Lisa Simpson: That life is worth living.,864000,True,9.0,3.0,Lisa Simpson,Springfield Elementary School,That life is worth living.,that life is worth living,5.0
4,9553,32,213,Edna Krabappel-Flanders: The polls will be ope...,864000,True,40.0,3.0,Edna Krabappel-Flanders,Springfield Elementary School,The polls will be open from now until the end ...,the polls will be open from now until the end ...,33.0
5,9554,32,214,Martin Prince: (HOARSE WHISPER) I don't think ...,877000,True,38.0,3.0,Martin Prince,Springfield Elementary School,I don't think there's anything left to say.,i dont think theres anything left to say,8.0
6,9555,32,215,Edna Krabappel-Flanders: Bart?,881000,True,40.0,3.0,Edna Krabappel-Flanders,Springfield Elementary School,Bart?,bart,1.0
7,9556,32,216,Bart Simpson: Victory party under the slide!,882000,True,8.0,3.0,Bart Simpson,Springfield Elementary School,Victory party under the slide!,victory party under the slide,5.0
8,9557,32,217,(Apartment Building: Ext. apartment building -...,889000,False,,374.0,,Apartment Building,,,
9,9558,32,218,Lisa Simpson: (CALLING) Mr. Bergstrom! Mr. Ber...,889000,True,9.0,374.0,Lisa Simpson,Apartment Building,Mr. Bergstrom! Mr. Bergstrom!,mr bergstrom mr bergstrom,4.0


In [17]:
spoken_lines = all_lines[all_lines['speaking_line'] == True]
chosen_spoken_lines = spoken_lines #[(spoken_lines['episode_id'] > 30) & (spoken_lines['episode_id'] < 220)]
chosen_spoken_lines.info()
chosen_spoken_lines.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 109032 entries, 0 to 131071
Data columns (total 13 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   id                  109032 non-null  int64  
 1   episode_id          109032 non-null  int64  
 2   number              109032 non-null  int64  
 3   raw_text            109032 non-null  object 
 4   timestamp_in_ms     109032 non-null  object 
 5   speaking_line       109032 non-null  object 
 6   character_id        109031 non-null  object 
 7   location_id         108696 non-null  float64
 8   raw_character_text  109031 non-null  object 
 9   raw_location_text   108696 non-null  object 
 10  spoken_words        109032 non-null  object 
 11  normalized_text     109017 non-null  object 
 12  word_count          109032 non-null  object 
dtypes: float64(1), int64(3), object(9)
memory usage: 11.6+ MB


Unnamed: 0,id,episode_id,number,raw_text,timestamp_in_ms,speaking_line,character_id,location_id,raw_character_text,raw_location_text,spoken_words,normalized_text,word_count
0,9549,32,209,"Miss Hoover: No, actually, it was a little of ...",848000,True,464.0,3.0,Miss Hoover,Springfield Elementary School,"No, actually, it was a little of both. Sometim...",no actually it was a little of both sometimes ...,31
1,9550,32,210,Lisa Simpson: (NEAR TEARS) Where's Mr. Bergstrom?,856000,True,9.0,3.0,Lisa Simpson,Springfield Elementary School,Where's Mr. Bergstrom?,wheres mr bergstrom,3
2,9551,32,211,Miss Hoover: I don't know. Although I'd sure l...,856000,True,464.0,3.0,Miss Hoover,Springfield Elementary School,I don't know. Although I'd sure like to talk t...,i dont know although id sure like to talk to h...,22
3,9552,32,212,Lisa Simpson: That life is worth living.,864000,True,9.0,3.0,Lisa Simpson,Springfield Elementary School,That life is worth living.,that life is worth living,5
4,9553,32,213,Edna Krabappel-Flanders: The polls will be ope...,864000,True,40.0,3.0,Edna Krabappel-Flanders,Springfield Elementary School,The polls will be open from now until the end ...,the polls will be open from now until the end ...,33


In [18]:
chosen_spoken_lines.normalized_text = chosen_spoken_lines.normalized_text.astype('string')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


In [19]:
chosen_spoken_lines.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 109032 entries, 0 to 131071
Data columns (total 13 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   id                  109032 non-null  int64  
 1   episode_id          109032 non-null  int64  
 2   number              109032 non-null  int64  
 3   raw_text            109032 non-null  object 
 4   timestamp_in_ms     109032 non-null  object 
 5   speaking_line       109032 non-null  object 
 6   character_id        109031 non-null  object 
 7   location_id         108696 non-null  float64
 8   raw_character_text  109031 non-null  object 
 9   raw_location_text   108696 non-null  object 
 10  spoken_words        109032 non-null  object 
 11  normalized_text     109017 non-null  string 
 12  word_count          109032 non-null  object 
dtypes: float64(1), int64(3), object(8), string(1)
memory usage: 11.6+ MB


In [20]:
chosen_spoken_lines.isnull().sum()

id                      0
episode_id              0
number                  0
raw_text                0
timestamp_in_ms         0
speaking_line           0
character_id            1
location_id           336
raw_character_text      1
raw_location_text     336
spoken_words            0
normalized_text        15
word_count              0
dtype: int64

In [21]:
chosen_spoken_lines = chosen_spoken_lines.dropna()

We will convert this dataset in a way that every response row will contain **n** previous responces as a context. For our purposes seven previous responces will be enough.

In [22]:
chosen_spoken_lines = chosen_spoken_lines.reset_index()

In [23]:
contexted = []

n = 9

for i in range(n, len(chosen_spoken_lines['spoken_words'])):
  row = []
  prev = i - 1 - n # we additionally substract 1, so row will contain current responce and 7 previous responces  
  for j in range(i, prev, -1):
    row.append(chosen_spoken_lines['spoken_words'][j])
  contexted.append([chosen_spoken_lines['raw_character_text'][i]] + row)  

In [24]:
len(contexted)

108671

In [25]:
columns = ['speaker', 'response', 'context'] 
columns = columns + ['context/'+str(i) for i in range(n-1)]
columns

['speaker',
 'response',
 'context',
 'context/0',
 'context/1',
 'context/2',
 'context/3',
 'context/4',
 'context/5',
 'context/6',
 'context/7']

In [26]:
df = pd.DataFrame.from_records(contexted, columns=columns)
df.head(5)

Unnamed: 0,speaker,response,context,context/0,context/1,context/2,context/3,context/4,context/5,context/6,context/7
0,Landlady,"Hey, hey, he Moved out this morning. He must h...",Mr. Bergstrom! Mr. Bergstrom!,Victory party under the slide!,Bart?,I don't think there's anything left to say.,The polls will be open from now until the end ...,That life is worth living.,I don't know. Although I'd sure like to talk t...,Where's Mr. Bergstrom?,"No, actually, it was a little of both. Sometim..."
1,Lisa Simpson,Do you know where I could find him?,"Hey, hey, he Moved out this morning. He must h...",Mr. Bergstrom! Mr. Bergstrom!,Victory party under the slide!,Bart?,I don't think there's anything left to say.,The polls will be open from now until the end ...,That life is worth living.,I don't know. Although I'd sure like to talk t...,Where's Mr. Bergstrom?
2,Landlady,I think he's taking the next train to Capital ...,Do you know where I could find him?,"Hey, hey, he Moved out this morning. He must h...",Mr. Bergstrom! Mr. Bergstrom!,Victory party under the slide!,Bart?,I don't think there's anything left to say.,The polls will be open from now until the end ...,That life is worth living.,I don't know. Although I'd sure like to talk t...
3,Lisa Simpson,"The train, how like him... traditional, yet en...",I think he's taking the next train to Capital ...,Do you know where I could find him?,"Hey, hey, he Moved out this morning. He must h...",Mr. Bergstrom! Mr. Bergstrom!,Victory party under the slide!,Bart?,I don't think there's anything left to say.,The polls will be open from now until the end ...,That life is worth living.
4,Landlady,"Yes, and it's been the backbone of our country...","The train, how like him... traditional, yet en...",I think he's taking the next train to Capital ...,Do you know where I could find him?,"Hey, hey, he Moved out this morning. He must h...",Mr. Bergstrom! Mr. Bergstrom!,Victory party under the slide!,Bart?,I don't think there's anything left to say.,The polls will be open from now until the end ...


In [27]:
homer_lines = df[df['speaker'] == "Homer Simpson"]

In [28]:
homer_lines.head(5)

Unnamed: 0,speaker,response,context,context/0,context/1,context/2,context/3,context/4,context/5,context/6,context/7
42,Homer Simpson,Never thrown a party? What about that big bash...,"Goodbye, Lisa honey. It'll be okay. Just read ...","So, I guess this is it? It you don't mind I'll...",All aboard!,"Thank you, Mr. Bergstrom.",Whenever you feel like you're alone and there'...,I'll tell you what...,"I, I understand. Mr. Bergstrom, I'm going to m...",That's the problem with being middle class. An...,But I need you too.
43,Homer Simpson,"Bart didn't get one vote?! Oh, this is the wor...",Never thrown a party? What about that big bash...,"Goodbye, Lisa honey. It'll be okay. Just read ...","So, I guess this is it? It you don't mind I'll...",All aboard!,"Thank you, Mr. Bergstrom.",Whenever you feel like you're alone and there'...,I'll tell you what...,"I, I understand. Mr. Bergstrom, I'm going to m...",That's the problem with being middle class. An...
47,Homer Simpson,Oh.,Mr. Bergstrom left today.,"Lisa, tell your father.",Nothing.,"Bart didn't get one vote?! Oh, this is the wor...",Never thrown a party? What about that big bash...,"Goodbye, Lisa honey. It'll be okay. Just read ...","So, I guess this is it? It you don't mind I'll...",All aboard!,"Thank you, Mr. Bergstrom."
49,Homer Simpson,And?,He's gone. Forever.,Oh.,Mr. Bergstrom left today.,"Lisa, tell your father.",Nothing.,"Bart didn't get one vote?! Oh, this is the wor...",Never thrown a party? What about that big bash...,"Goodbye, Lisa honey. It'll be okay. Just read ...","So, I guess this is it? It you don't mind I'll..."
51,Homer Simpson,"Hey, just because I don't care doesn't mean I ...",I didn't think you'd understand.,And?,He's gone. Forever.,Oh.,Mr. Bergstrom left today.,"Lisa, tell your father.",Nothing.,"Bart didn't get one vote?! Oh, this is the wor...",Never thrown a party? What about that big bash...


In [29]:
len(homer_lines)

23010

In [30]:
del homer_lines['speaker']

Split our dataset into a training and test parts.

In [31]:
# trn_df, val_df = train_test_split(homer_lines, test_size = 0.1)
trn_df, val_df = train_test_split(homer_lines, test_size = 0.1)
trn_df.head()

Unnamed: 0,response,context,context/0,context/1,context/2,context/3,context/4,context/5,context/6,context/7
46779,"Aww, she missed it.",My pies are done!,"Wow, it sure is slippery up here.","Well, come on down, you goofy roofy.","Relax, I'm fine. But when I do die, I don't wa...","Oh, I'm so relieved. Whenever you go on one of...","Hi, Maude. Diddily. I've been having fun with ...",Neddy? Where have you been?,Hey Maude! Look who's helping me clean the chi...,To make sure that no one knows that you're dea...
65881,"Moe, we've been complaining about him every ni...",Who's this Burns guy? Somebody you work with?,Slim to bupkus.,I gave Mr. Burns the best years of my life. An...,So my husband goes to a bar every night. Whoop...,"Thanks for trying, but I'll be at Moe's.","Oh, Homie. Don't let it get you down. So Mr. B...","Well, he does. All my life I've had one dream:...",I didn't know Mr. Burns had an electric eel pond.,"Ahhh, look at the little eels. Electric eels!"
9223,Oh.,No.,We're gonna start doin' it in the morning?,"Guess what, Homie... There's going to be twice...",Beep.,"Nooo! Bart, don't you ever do that again. Unde...",Got your wallet.,"Heh, heh. Got your nose.",Hi yo!,Hi yo!
10674,Woo hoo!,"And now, our evening comes to an end...","Unhand me, Yankee.","You're next, Chester A. Arthur!","C'mon, boy! Finish him off!","Hasta la vista, Abie.","Oh, no! John Wilkes Booth!",I thought that Civil War would never end. Now ...,"Milhouse, you have one line and then you're sh...","Miss Hoover, this beard's giving me a rash."
72662,What kind of bread?,"Uh, you die eating a submarine sandwich.","So, what'd I die of? Too much happiness? Naked...",Do me! Do me!,"People's deaths, eh?","Homer, I can foretell people's deaths!","Hey Flanders, have you seen my Frisbee?",Well I didn't need any special power to know t...,"Lord, why have you given me these unholy visio...",What the Family Circus? A second premonition c...


Now will convert our dataset in a format suitable for our model. Basically we will concatenate responses in one string for each row (additionally we will add special 'end of string' token between responses, so the model will understand end of each response in a string).  

In [47]:
# Args to allow for easy convertion of python script to notebook
class Args():
    def __init__(self):
        self.output_dir = 'output-small'
        self.model_type = 'gpt2'
        self.model_name_or_path = 'microsoft/DialoGPT-small'
        self.config_name = 'microsoft/DialoGPT-small'
        self.tokenizer_name = 'microsoft/DialoGPT-small'
        self.cache_dir = 'cached'
        self.block_size = 512
        self.do_train = True
        self.do_eval = True
        self.evaluate_during_training = False
        self.per_gpu_train_batch_size = 4
        self.per_gpu_eval_batch_size = 4
        self.gradient_accumulation_steps = 1
        self.learning_rate = 5e-4
        self.weight_decay = 0.0
        self.adam_epsilon = 1e-8
        self.max_grad_norm = 1.0
        self.num_train_epochs = 10
        self.max_steps = -1
        self.warmup_steps = 0
        self.logging_steps = 1000
        self.save_steps = 11000
        self.save_total_limit = None
        self.eval_all_checkpoints = False
        self.no_cuda = False
        self.overwrite_output_dir = True
        self.overwrite_cache = True
        self.should_continue = False
        self.seed = 42
        self.local_rank = -1
        self.fp16 = False
        self.fp16_opt_level = 'O1'

args = Args()

In [48]:
def construct_conv(row, tokenizer, eos = True):
    flatten = lambda l: [item for sublist in l for item in sublist]
    conv = list(reversed([tokenizer.encode(x) + [tokenizer.eos_token_id] for x in row]))
    conv = flatten(conv)
    return conv

class ConversationDataset(Dataset):
    def __init__(self, tokenizer: PreTrainedTokenizer, args, df, block_size=512):

        block_size = block_size - (tokenizer.model_max_length - tokenizer.max_len_single_sentence)

        directory = args.cache_dir
        cached_features_file = os.path.join(
            directory, args.model_type + "_cached_lm_" + str(block_size)
        )

        if os.path.exists(cached_features_file) and not args.overwrite_cache:
            logger.info("Loading features from cached file %s", cached_features_file)
            with open(cached_features_file, "rb") as handle:
                self.examples = pickle.load(handle)
        else:
            logger.info("Creating features from dataset file at %s", directory)

            self.examples = []
            for _, row in df.iterrows():
                conv = construct_conv(row, tokenizer)
                self.examples.append(conv)

            logger.info("Saving features into cached file %s", cached_features_file)
            with open(cached_features_file, "wb") as handle:
                pickle.dump(self.examples, handle, protocol=pickle.HIGHEST_PROTOCOL)

    def __len__(self):
        return len(self.examples)

    def __getitem__(self, item):
        return torch.tensor(self.examples[item], dtype=torch.long)

In [49]:
# Cacheing and storing of data/checkpoints

def load_and_cache_examples(args, tokenizer, df_trn, df_val, evaluate=False):
    return ConversationDataset(tokenizer, args, df_val if evaluate else df_trn)


def set_seed(args):
    random.seed(args.seed)
    np.random.seed(args.seed)
    torch.manual_seed(args.seed)
    if args.n_gpu > 0:
        torch.cuda.manual_seed_all(args.seed)


def _sorted_checkpoints(args, checkpoint_prefix="checkpoint", use_mtime=False) -> List[str]:
    ordering_and_checkpoint_path = []

    glob_checkpoints = glob.glob(os.path.join(args.output_dir, "{}-*".format(checkpoint_prefix)))

    for path in glob_checkpoints:
        if use_mtime:
            ordering_and_checkpoint_path.append((os.path.getmtime(path), path))
        else:
            regex_match = re.match(".*{}-([0-9]+)".format(checkpoint_prefix), path)
            if regex_match and regex_match.groups():
                ordering_and_checkpoint_path.append((int(regex_match.groups()[0]), path))

    checkpoints_sorted = sorted(ordering_and_checkpoint_path)
    checkpoints_sorted = [checkpoint[1] for checkpoint in checkpoints_sorted]
    return checkpoints_sorted


def _rotate_checkpoints(args, checkpoint_prefix="checkpoint", use_mtime=False) -> None:
    if not args.save_total_limit:
        return
    if args.save_total_limit <= 0:
        return

    # Check if we should delete older checkpoint(s)
    checkpoints_sorted = _sorted_checkpoints(args, checkpoint_prefix, use_mtime)
    if len(checkpoints_sorted) <= args.save_total_limit:
        return

    number_of_checkpoints_to_delete = max(0, len(checkpoints_sorted) - args.save_total_limit)
    checkpoints_to_be_deleted = checkpoints_sorted[:number_of_checkpoints_to_delete]
    for checkpoint in checkpoints_to_be_deleted:
        logger.info("Deleting older checkpoint [{}] due to args.save_total_limit".format(checkpoint))
        shutil.rmtree(checkpoint)

## Training and Evaluating

There will be quite a lot of code needed for training our model but don’t worry, everything should work as is, the main thing is to give the model the dataset in the right format.


In [50]:
def train(args, train_dataset, model: PreTrainedModel, tokenizer: PreTrainedTokenizer) -> Tuple[int, float]:
    """ Train the model """
    if args.local_rank in [-1, 0]:
        tb_writer = SummaryWriter()

    args.train_batch_size = args.per_gpu_train_batch_size * max(1, args.n_gpu)

    def collate(examples: List[torch.Tensor]):
        if tokenizer._pad_token is None:
            return pad_sequence(examples, batch_first=True)
        return pad_sequence(examples, batch_first=True, padding_value=tokenizer.pad_token_id)

    train_sampler = RandomSampler(train_dataset) if args.local_rank == -1 else DistributedSampler(train_dataset)
    train_dataloader = DataLoader(
        train_dataset, sampler=train_sampler, batch_size=args.train_batch_size, collate_fn=collate, drop_last = True
    )

    if args.max_steps > 0:
        t_total = args.max_steps
        args.num_train_epochs = args.max_steps // (len(train_dataloader) // args.gradient_accumulation_steps) + 1
    else:
        t_total = len(train_dataloader) // args.gradient_accumulation_steps * args.num_train_epochs

    model = model.module if hasattr(model, "module") else model  # Take care of distributed/parallel training
    model.resize_token_embeddings(len(tokenizer))
    # add_special_tokens_(model, tokenizer)


    # Prepare optimizer and schedule (linear warmup and decay)
    no_decay = ["bias", "LayerNorm.weight"]
    optimizer_grouped_parameters = [
        {
            "params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
            "weight_decay": args.weight_decay,
        },
        {"params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], "weight_decay": 0.0},
    ]
    optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon)
    scheduler = get_linear_schedule_with_warmup(
        optimizer, num_warmup_steps=args.warmup_steps, num_training_steps=t_total
    )

    # Check if saved optimizer or scheduler states exist
    if (
        args.model_name_or_path
        and os.path.isfile(os.path.join(args.model_name_or_path, "optimizer.pt"))
        and os.path.isfile(os.path.join(args.model_name_or_path, "scheduler.pt"))
    ):
        # Load in optimizer and scheduler states
        optimizer.load_state_dict(torch.load(os.path.join(args.model_name_or_path, "optimizer.pt")))
        scheduler.load_state_dict(torch.load(os.path.join(args.model_name_or_path, "scheduler.pt")))

    if args.fp16:
        try:
            from apex import amp
        except ImportError:
            raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use fp16 training.")
        model, optimizer = amp.initialize(model, optimizer, opt_level=args.fp16_opt_level)

    # multi-gpu training (should be after apex fp16 initialization)
    if args.n_gpu > 1:
        model = torch.nn.DataParallel(model)

    # Distributed training (should be after apex fp16 initialization)
    if args.local_rank != -1:
        model = torch.nn.parallel.DistributedDataParallel(
            model, device_ids=[args.local_rank], output_device=args.local_rank, find_unused_parameters=True
        )

    # Train!
    logger.info("***** Running training *****")
    logger.info("  Num examples = %d", len(train_dataset))
    logger.info("  Num Epochs = %d", args.num_train_epochs)
    logger.info("  Instantaneous batch size per GPU = %d", args.per_gpu_train_batch_size)
    logger.info(
        "  Total train batch size (w. parallel, distributed & accumulation) = %d",
        args.train_batch_size
        * args.gradient_accumulation_steps
        * (torch.distributed.get_world_size() if args.local_rank != -1 else 1),
    )
    logger.info("  Gradient Accumulation steps = %d", args.gradient_accumulation_steps)
    logger.info("  Total optimization steps = %d", t_total)

    global_step = 0
    epochs_trained = 0
    steps_trained_in_current_epoch = 0
    # Check if continuing training from a checkpoint
    if args.model_name_or_path and os.path.exists(args.model_name_or_path):
        try:
            # set global_step to gobal_step of last saved checkpoint from model path
            checkpoint_suffix = args.model_name_or_path.split("-")[-1].split("/")[0]
            global_step = int(checkpoint_suffix)
            epochs_trained = global_step // (len(train_dataloader) // args.gradient_accumulation_steps)
            steps_trained_in_current_epoch = global_step % (len(train_dataloader) // args.gradient_accumulation_steps)

            logger.info("  Continuing training from checkpoint, will skip to saved global_step")
            logger.info("  Continuing training from epoch %d", epochs_trained)
            logger.info("  Continuing training from global step %d", global_step)
            logger.info("  Will skip the first %d steps in the first epoch", steps_trained_in_current_epoch)
        except ValueError:
            logger.info("  Starting fine-tuning.")

    tr_loss, logging_loss = 0.0, 0.0

    model.zero_grad()
    train_iterator = trange(
        epochs_trained, int(args.num_train_epochs), desc="Epoch", disable=args.local_rank not in [-1, 0]
    )
    set_seed(args)  # Added here for reproducibility
    for _ in train_iterator:
        epoch_iterator = tqdm(train_dataloader, desc="Iteration", disable=args.local_rank not in [-1, 0])
        for step, batch in enumerate(epoch_iterator):

            # Skip past any already trained steps if resuming training
            if steps_trained_in_current_epoch > 0:
                steps_trained_in_current_epoch -= 1
                continue

            inputs, labels = (batch, batch)
            if inputs.shape[1] > 1024: continue
            inputs = inputs.to(args.device)
            labels = labels.to(args.device)
            model.train()
            outputs = model(inputs, labels=labels)
            loss = outputs[0]  # model outputs are always tuple in transformers (see doc)

            if args.n_gpu > 1:
                loss = loss.mean()  # mean() to average on multi-gpu parallel training
            if args.gradient_accumulation_steps > 1:
                loss = loss / args.gradient_accumulation_steps

            if args.fp16:
                with amp.scale_loss(loss, optimizer) as scaled_loss:
                    scaled_loss.backward()
            else:
                loss.backward()

            tr_loss += loss.item()
            if (step + 1) % args.gradient_accumulation_steps == 0:
                if args.fp16:
                    torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), args.max_grad_norm)
                else:
                    torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm)
                optimizer.step()
                scheduler.step()  # Update learning rate schedule
                model.zero_grad()
                global_step += 1

                if args.local_rank in [-1, 0] and args.logging_steps > 0 and global_step % args.logging_steps == 0:
                    # Log metrics
                    if (
                        args.local_rank == -1 and args.evaluate_during_training
                    ):  # Only evaluate when single GPU otherwise metrics may not average well
                        results = evaluate(args, model, tokenizer)
                        for key, value in results.items():
                            tb_writer.add_scalar("eval_{}".format(key), value, global_step)
                    tb_writer.add_scalar("lr", scheduler.get_lr()[0], global_step)
                    tb_writer.add_scalar("loss", (tr_loss - logging_loss) / args.logging_steps, global_step)
                    logging_loss = tr_loss

                if args.local_rank in [-1, 0] and args.save_steps > 0 and global_step % args.save_steps == 0:
                    checkpoint_prefix = "checkpoint"
                    # Save model checkpoint
                    output_dir = os.path.join(args.output_dir, "{}-{}".format(checkpoint_prefix, global_step))
                    os.makedirs(output_dir, exist_ok=True)
                    model_to_save = (
                        model.module if hasattr(model, "module") else model
                    )  # Take care of distributed/parallel training
                    model_to_save.save_pretrained(output_dir)
                    tokenizer.save_pretrained(output_dir)

                    torch.save(args, os.path.join(output_dir, "training_args.bin"))
                    logger.info("Saving model checkpoint to %s", output_dir)

                    _rotate_checkpoints(args, checkpoint_prefix)

                    torch.save(optimizer.state_dict(), os.path.join(output_dir, "optimizer.pt"))
                    torch.save(scheduler.state_dict(), os.path.join(output_dir, "scheduler.pt"))
                    logger.info("Saving optimizer and scheduler states to %s", output_dir)

            if args.max_steps > 0 and global_step > args.max_steps:
                epoch_iterator.close()
                break
        if args.max_steps > 0 and global_step > args.max_steps:
            train_iterator.close()
            break

    if args.local_rank in [-1, 0]:
        tb_writer.close()

    return global_step, tr_loss / global_step

# Evaluation of some model

def evaluate(args, model: PreTrainedModel, tokenizer: PreTrainedTokenizer, df_trn, df_val, prefix="") -> Dict:
    # Loop to handle MNLI double evaluation (matched, mis-matched)
    eval_output_dir = args.output_dir

    eval_dataset = load_and_cache_examples(args, tokenizer, df_trn, df_val, evaluate=True)
    os.makedirs(eval_output_dir, exist_ok=True)
    args.eval_batch_size = args.per_gpu_eval_batch_size * max(1, args.n_gpu)
    # Note that DistributedSampler samples randomly

    def collate(examples: List[torch.Tensor]):
        if tokenizer._pad_token is None:
            return pad_sequence(examples, batch_first=True)
        return pad_sequence(examples, batch_first=True, padding_value=tokenizer.pad_token_id)

    eval_sampler = SequentialSampler(eval_dataset)
    eval_dataloader = DataLoader(
        eval_dataset, sampler=eval_sampler, batch_size=args.eval_batch_size, collate_fn=collate, drop_last = True
    )

    # multi-gpu evaluate
    if args.n_gpu > 1:
        model = torch.nn.DataParallel(model)

    # Eval!
    logger.info("***** Running evaluation {} *****".format(prefix))
    logger.info("  Num examples = %d", len(eval_dataset))
    logger.info("  Batch size = %d", args.eval_batch_size)
    eval_loss = 0.0
    nb_eval_steps = 0
    model.eval()

    for batch in tqdm(eval_dataloader, desc="Evaluating"):
        inputs, labels = (batch, batch)
        inputs = inputs.to(args.device)
        labels = labels.to(args.device)

        with torch.no_grad():
            outputs = model(inputs, labels=labels)
            lm_loss = outputs[0]
            eval_loss += lm_loss.mean().item()
        nb_eval_steps += 1

    eval_loss = eval_loss / nb_eval_steps
    perplexity = torch.exp(torch.tensor(eval_loss))

    result = {"perplexity": perplexity}

    output_eval_file = os.path.join(eval_output_dir, prefix, "eval_results.txt")
    with open(output_eval_file, "w") as writer:
        logger.info("***** Eval results {} *****".format(prefix))
        for key in sorted(result.keys()):
            logger.info("  %s = %s", key, str(result[key]))
            writer.write("%s = %s\n" % (key, str(result[key])))

    return result

In [51]:
# Main runner

def main(df_trn, df_val):
    args = Args()
    
    if args.should_continue:
        sorted_checkpoints = _sorted_checkpoints(args)
        if len(sorted_checkpoints) == 0:
            raise ValueError("Used --should_continue but no checkpoint was found in --output_dir.")
        else:
            args.model_name_or_path = sorted_checkpoints[-1]

    if (
        os.path.exists(args.output_dir)
        and os.listdir(args.output_dir)
        and args.do_train
        and not args.overwrite_output_dir
        and not args.should_continue
    ):
        raise ValueError(
            "Output directory ({}) already exists and is not empty. Use --overwrite_output_dir to overcome.".format(
                args.output_dir
            )
        )

    # Setup CUDA, GPU & distributed training
    device = torch.device("cuda")
    args.n_gpu = torch.cuda.device_count()
    args.device = device

    # Setup logging
    logging.basicConfig(
        format="%(asctime)s - %(levelname)s - %(name)s -   %(message)s",
        datefmt="%m/%d/%Y %H:%M:%S",
        level=logging.INFO if args.local_rank in [-1, 0] else logging.WARN,
    )
    logger.warning(
        "Process rank: %s, device: %s, n_gpu: %s, distributed training: %s, 16-bits training: %s",
        args.local_rank,
        device,
        args.n_gpu,
        bool(args.local_rank != -1),
        args.fp16,
    )

    # Set seed
    set_seed(args)

    config = AutoConfig.from_pretrained(args.config_name, cache_dir=args.cache_dir)
    tokenizer = AutoTokenizer.from_pretrained(args.tokenizer_name, cache_dir=args.cache_dir)
    model = AutoModelWithLMHead.from_pretrained(
        args.model_name_or_path,
        from_tf=False,
        config=config,
        cache_dir=args.cache_dir,
    )
    model.to(args.device)
    
    logger.info("Training/evaluation parameters %s", args)

    # Training
    if args.do_train:
        train_dataset = load_and_cache_examples(args, tokenizer, df_trn, df_val, evaluate=False)

        global_step, tr_loss = train(args, train_dataset, model, tokenizer)
        logger.info(" global_step = %s, average loss = %s", global_step, tr_loss)

    # Saving best-practices: if you use save_pretrained for the model and tokenizer, you can reload them using from_pretrained()
    if args.do_train:
        # Create output directory if needed
        os.makedirs(args.output_dir, exist_ok=True)

        logger.info("Saving model checkpoint to %s", args.output_dir)
        # Save a trained model, configuration and tokenizer using `save_pretrained()`.
        # They can then be reloaded using `from_pretrained()`
        model_to_save = (
            model.module if hasattr(model, "module") else model
        )  # Take care of distributed/parallel training
        model_to_save.save_pretrained(args.output_dir)
        tokenizer.save_pretrained(args.output_dir)

        # Good practice: save your training arguments together with the trained model
        torch.save(args, os.path.join(args.output_dir, "training_args.bin"))

        # Load a trained model and vocabulary that you have fine-tuned
        model = AutoModelWithLMHead.from_pretrained(args.output_dir)
        tokenizer = AutoTokenizer.from_pretrained(args.output_dir)
        model.to(args.device)

    # Evaluation
    results = {}
    if args.do_eval and args.local_rank in [-1, 0]:
        checkpoints = [args.output_dir]
        if args.eval_all_checkpoints:
            checkpoints = list(
                os.path.dirname(c) for c in sorted(glob.glob(args.output_dir + "/**/" + WEIGHTS_NAME, recursive=True))
            )
            logging.getLogger("transformers.modeling_utils").setLevel(logging.WARN)  # Reduce logging
        logger.info("Evaluate the following checkpoints: %s", checkpoints)
        for checkpoint in checkpoints:
            global_step = checkpoint.split("-")[-1] if len(checkpoints) > 1 else ""
            prefix = checkpoint.split("/")[-1] if checkpoint.find("checkpoint") != -1 else ""

            model = AutoModelWithLMHead.from_pretrained(checkpoint)
            model.to(args.device)
            result = evaluate(args, model, tokenizer, df_trn, df_val, prefix=prefix)
            result = dict((k + "_{}".format(global_step), v) for k, v in result.items())
            results.update(result)

    return results

It is time to train our model!


In [52]:
main(trn_df, val_df)

03/29/2022 18:00:57 - INFO - __main__ -   Training/evaluation parameters <__main__.Args object at 0x7f2e11f82190>
03/29/2022 18:00:57 - INFO - __main__ -   Creating features from dataset file at cached
03/29/2022 18:01:33 - INFO - __main__ -   Saving features into cached file cached/gpt2_cached_lm_512
03/29/2022 18:01:34 - INFO - __main__ -   ***** Running training *****
03/29/2022 18:01:34 - INFO - __main__ -     Num examples = 20709
03/29/2022 18:01:34 - INFO - __main__ -     Num Epochs = 10
03/29/2022 18:01:34 - INFO - __main__ -     Instantaneous batch size per GPU = 4
03/29/2022 18:01:34 - INFO - __main__ -     Total train batch size (w. parallel, distributed & accumulation) = 4
03/29/2022 18:01:34 - INFO - __main__ -     Gradient Accumulation steps = 1
03/29/2022 18:01:34 - INFO - __main__ -     Total optimization steps = 51770


Epoch:   0%|          | 0/10 [00:00<?, ?it/s]

Iteration:   0%|          | 0/5177 [00:00<?, ?it/s]



Iteration:   0%|          | 0/5177 [00:00<?, ?it/s]

Iteration:   0%|          | 0/5177 [00:00<?, ?it/s]

03/29/2022 18:28:46 - INFO - __main__ -   Saving model checkpoint to output-small/checkpoint-11000
03/29/2022 18:28:50 - INFO - __main__ -   Saving optimizer and scheduler states to output-small/checkpoint-11000


Iteration:   0%|          | 0/5177 [00:00<?, ?it/s]

Iteration:   0%|          | 0/5177 [00:00<?, ?it/s]

03/29/2022 18:56:07 - INFO - __main__ -   Saving model checkpoint to output-small/checkpoint-22000
03/29/2022 18:56:11 - INFO - __main__ -   Saving optimizer and scheduler states to output-small/checkpoint-22000


Iteration:   0%|          | 0/5177 [00:00<?, ?it/s]

Iteration:   0%|          | 0/5177 [00:00<?, ?it/s]

03/29/2022 19:23:25 - INFO - __main__ -   Saving model checkpoint to output-small/checkpoint-33000
03/29/2022 19:23:29 - INFO - __main__ -   Saving optimizer and scheduler states to output-small/checkpoint-33000


Iteration:   0%|          | 0/5177 [00:00<?, ?it/s]

Iteration:   0%|          | 0/5177 [00:00<?, ?it/s]

03/29/2022 19:50:45 - INFO - __main__ -   Saving model checkpoint to output-small/checkpoint-44000
03/29/2022 19:50:49 - INFO - __main__ -   Saving optimizer and scheduler states to output-small/checkpoint-44000


Iteration:   0%|          | 0/5177 [00:00<?, ?it/s]

03/29/2022 20:10:00 - INFO - __main__ -    global_step = 51770, average loss = 0.7220463966561119
03/29/2022 20:10:00 - INFO - __main__ -   Saving model checkpoint to output-small
03/29/2022 20:10:05 - INFO - __main__ -   Evaluate the following checkpoints: ['output-small']
03/29/2022 20:10:07 - INFO - __main__ -   Creating features from dataset file at cached
03/29/2022 20:10:23 - INFO - __main__ -   Saving features into cached file cached/gpt2_cached_lm_512
03/29/2022 20:10:23 - INFO - __main__ -   ***** Running evaluation  *****
03/29/2022 20:10:23 - INFO - __main__ -     Num examples = 2301
03/29/2022 20:10:23 - INFO - __main__ -     Batch size = 4


Evaluating:   0%|          | 0/575 [00:00<?, ?it/s]

03/29/2022 20:10:47 - INFO - __main__ -   ***** Eval results  *****
03/29/2022 20:10:47 - INFO - __main__ -     perplexity = tensor(2.1498)


{'perplexity_': tensor(2.1498)}

## Chatting

The model is ready, so it's time to chat.

A variety of methods can be used in responces generation. You can find more details about these methods by this [link](https://huggingface.co/blog/how-to-generate). 


In [53]:
tokenizer = AutoTokenizer.from_pretrained('microsoft/DialoGPT-small', force_download=False)
model = AutoModelWithLMHead.from_pretrained('output-small')



In [54]:
questions = [
    "What is your name?",
    "Who are you?",
    "Where do you work?",
    "What is your favorite bar?",
    "What is the best beer in Springfield?",
    "Is Bart working for the Fat Tony?",
    "Tell me about Maggie",
    "Tell me about Marge",
    "Is Ned Flanders your neighbor?",
    "Tell me something about Bart",
    "Tell me more about Moe",
    "Where is Apu?",
    "Are you in a band?"
]

# Let's chat
for step in range(len(questions)):
    print("***************************************")
    print(questions[step])
    # encode the new user input, add the eos_token and return a tensor in Pytorch
    new_user_input_ids = tokenizer.encode(questions[step] + tokenizer.eos_token, return_tensors='pt')
    # new_user_input_ids = tokenizer.encode(input(">> User:") + tokenizer.eos_token, return_tensors='pt')

    # bot_input_ids = torch.cat([chat_history_ids, new_user_input_ids], dim=-1) if step > 0 else new_user_input_ids
    bot_input_ids = new_user_input_ids

    num_return_seqs=10

    # generated a response
    chat_history_ids = model.generate(
        bot_input_ids, 
        max_length=200,
        pad_token_id=tokenizer.eos_token_id,  
        no_repeat_ngram_size=3,       
        do_sample=True, 
        top_k=100, 
        top_p=0.55,
        temperature = 0.55,
        num_return_sequences=num_return_seqs
    )
    
    # pretty print last ouput tokens from bot
    botname = "HomerBot"
    for i in range(0,num_return_seqs):
      print("{}:{}: {}".format(i, botname, tokenizer.decode(chat_history_ids[:, bot_input_ids.shape[-1]:][i], skip_special_tokens=True)))

    chat_history_ids = [] #chat_history_ids[0].unsqueeze(0)

***************************************
What is your name?
0:HomerBot: Donny.
1:HomerBot: Uh, George... Cauldron.
2:HomerBot: Donny.
3:HomerBot: Donny.
4:HomerBot: Uh, George... Cauldron.
5:HomerBot: Donny.
6:HomerBot: Uh, George... Cauldron.
7:HomerBot: Uh, George... Cauldron.
8:HomerBot: Uh, George... Cauldron.
9:HomerBot: Uh, George... Cauldron.
***************************************
Who are you?
0:HomerBot: I'm Andre Agassi.
1:HomerBot: In a minute!
2:HomerBot: In a minute!
3:HomerBot: In a minute!
4:HomerBot: I'm Andre Agassi.
5:HomerBot: I'm Andre Agassi.
6:HomerBot: I'm Andre Agassi.
7:HomerBot: I'm Andre Agassi.
8:HomerBot: I'm Andre Agassi.
9:HomerBot: In a minute!
***************************************
Where do you work?
0:HomerBot: Every day you find a new way to aggravate me.
1:HomerBot: Every day you find a new way to aggravate me.
2:HomerBot: I'm a Spaulding Gray in a Rick Dees world. Change me back to the blissful boob I was!
3:HomerBot: Every day you find a new way to

In [55]:
from huggingface_hub import notebook_login

notebook_login()

Login successful
Your token has been saved to /root/.huggingface/token
[1m[31mAuthenticated through git-credential store but this isn't the helper defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub. Run the following command in your terminal in case you want to set this credential helper as the default

git config --global credential.helper store[0m


In [60]:
!apt install git-lfs

Reading package lists... Done
Building dependency tree       
Reading state information... Done
git-lfs is already the newest version (2.3.4-1).
0 upgraded, 0 newly installed, 0 to remove and 39 not upgraded.


In [None]:
model.push_to_hub("dialoGPT-homer-simpson", use_temp_dir=True)

Cloning https://huggingface.co/shalpin87/dialoGPT-homer-simpson into local empty directory.


Upload file pytorch_model.bin:   0%|          | 3.34k/487M [00:00<?, ?B/s]