<a href="https://colab.research.google.com/github/ttazulay/AI-AVATAR-CHATBOTS/blob/main/Aang.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-tune a DialoGPT model

Adapted from the notebook in [this Medium post](https://towardsdatascience.com/make-your-own-rick-sanchez-bot-with-transformers-and-dialogpt-fine-tuning-f85e6d1f4e30?gi=e4a72d1510f0).

## Setup

In [None]:
from google.colab import drive
drive.mount('/content/drive/')

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


In [None]:
!pip -q install transformers

[K     |████████████████████████████████| 3.4 MB 14.5 MB/s 
[K     |████████████████████████████████| 895 kB 40.4 MB/s 
[K     |████████████████████████████████| 596 kB 46.6 MB/s 
[K     |████████████████████████████████| 61 kB 489 kB/s 
[K     |████████████████████████████████| 3.3 MB 43.2 MB/s 
[?25h

In [None]:
import os
os.chdir("/content/drive/MyDrive/smallModel")

In [None]:
!ls


combined.csv  kaggle.json


In [None]:
# all the imports

import glob
import logging
import os
import pickle
import random
import re
import shutil
from typing import Dict, List, Tuple

import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split

from torch.nn.utils.rnn import pad_sequence
from torch.utils.data import DataLoader, Dataset, RandomSampler, SequentialSampler
from torch.utils.data.distributed import DistributedSampler
from tqdm.notebook import tqdm, trange

from pathlib import Path

from transformers import (
    MODEL_WITH_LM_HEAD_MAPPING,
    WEIGHTS_NAME,
    AdamW,
    AutoConfig,
    PreTrainedModel,
    PreTrainedTokenizer,
    get_linear_schedule_with_warmup,
)


try:
    from torch.utils.tensorboard import SummaryWriter
except ImportError:
    from tensorboardX import SummaryWriter

In [None]:
import torch
dtype=torch.float
device=torch.device("cpu")

## Get Data from Kaggle

In [None]:
!mkdir ~/.kaggle
!cp kaggle.json ~/.kaggle/kaggle.json

In [None]:
!head combined.csv

Character,Sentence
Aang,Where's Momo?
Aang,"Hang on, Momo! All right, you too. "
Sokka,This is gonna take forever.
Aang,That works.
Sokka,"These are Fire Nation traps, you can tell from the metalwork. We better pack up camp, and get moving."
Sokka,Uh uh. No flying this time.
Aang,What? Why wouldn't we fly?
Sokka,"Think about it: Somehow Prince Zuko and the Fire Nation keep finding us. It's because they spot Appa, he's just too noticeable."
Katara,What? Appa's not too noticeable!


In [None]:
data = pd.read_csv('combined.csv', error_bad_lines=False)


In [None]:
data.sample(6)

Unnamed: 0,Character,Sentence
4699,Jet,That's it! Lake Laogai.
7336,Aang,Give me some of yours.
9422,Mai,I know one thing I care about. I care about you.
4858,Long Feng,Well it's imported of course. You know you can...
271,Katara,Is that your tribe?
6655,Sokka,"This is really bad! Please Aang, you have to e..."


In [None]:
#delete
len(data)

10068

In [None]:
#delete
data. rename (columns={'Character': 'name', 'Sentence': 'line'}, inplace=True)

In [None]:
#delete
sum(data.name == 'Aang')

1818

In [None]:
CHARACTER_NAME = 'Aang'

In [None]:
contexted = []

# context window of size 7
n = 7

for i in data[data.name == CHARACTER_NAME].index:
  if i < n:
    continue
  row = []
  prev = i - 1 - n # we additionally substract 1, so row will contain current responce and 7 previous responces  
  for j in range(i, prev, -1):
    row.append(data.line[j])
  contexted.append(row)

columns = ['response', 'context'] 
columns = columns + ['context/' + str(i) for i in range(n - 1)]

df = pd.DataFrame.from_records(contexted, columns=columns)

In [None]:
df.sample(6)

Unnamed: 0,response,context,context/0,context/1,context/2,context/3,context/4,context/5
1128,I don't understand. Why didn't you free yourse...,"Well, they didn't cover my face.",You could earthbend? All along?,"Hang on, Bumi! Our ride's here!",We can catch him!,There's Aang!,I seem to manage!,How are you gonna fight without your bending?
1351,I've gotta try.,What do you think? You're the one that has to ...,Everyone who's here today came prepared to ris...,Wait! If they knew we were coming it could all...,We can still do this. We can still win the day.,The mechanist gave me this timing device. It l...,If it's an underground secret bunker we're loo...,No. My instincts tell me he wouldn't go too fa...
1457,Sounds like a crazy fishing trip.,It kind of got destroyed.,What are you doing in this thing? What happene...,Put them somewhere I'll never have to see thei...,"What shall we do with them, princess?",You're both fools!,Come on! Let's get out of here!,"No, you miscalculated! You should have feared ..."
1134,What? I didn't even notice.,"Hey, you taking us down for a reason? Aang, wh...",Such a kind man.,"Ha, ha! Nothing like a fat man dancing for his...",They kiss so sweet that you really got to meet...,"Come on, we're talking a gold piece here! Let'...","Not professional anyway. It's a long, long way...",We're not performers.
20,What about me?,Katara. You can do this.,I've never used bending on water I can't see. ...,"All right, we're here. Underground water's try...",I'm glad he cooled off. He's so stubborn somet...,I guess something you said got through to him....,"Yeah, I did.","Yeah, I was surprised too. I got the sense tha..."
1390,I think you are supposed to be my firebending ...,"Listen, I know I didn't explain myself very we...","Hey, what about me? I did the boomerang thing.","I can't believe I'm saying this, but ... thank...","Yeah, boomerang! Awww, boomerang ...",I know how to get an angle on him! All right b...,I can't step out to waterbend at him without b...,He's going to blast this whole place right off...


In [None]:
trn_df, val_df = train_test_split(df, test_size=0.1)
trn_df.head()

Unnamed: 0,response,context,context/0,context/1,context/2,context/3,context/4,context/5
202,Huh?,"She's beautiful, by the way.",I know what you mean.,It's okay. It's just really hard when you like...,"Oh, I guess not.",But not the way I like you.,Of course I like you.,"You don't like me, do you?"
580,"I'm sure we'll be fine. Besides, did you see h...","Hey, don't get too comfortable. It's risky for...",What's he so angry about? It's great here. The...,Sneak attacks don't count! Tie me up with rope...,Right. And then they kicked your butt.,They snuck up on me!,He's just upset because a bunch of girls kicke...,But you're always hungry!
1586,"It's okay. If someone saw it, it would give aw...","Hey! What's ... oh, it's your glider.",We'll join up with my dad and the invasion for...,What about the invasion?,You didn't think you could get out of training...,"I know, but you'll have our help.",I have so much to do.,You're okay!
1382,"Anyway, when Zhao had me chained up, it was Zu...",I could feel it! It's my throatal flap!,"Sokka, I looked at it, and I told you, there w...",And you made us suck on frozen frogs? How coul...,I kind of have a confession to make. Remember ...,"The thing is, it worked. I did feel sorry for ...",He wants you to trust and feel sorry for him s...,This is just like when we were in prison toget...
257,"Hey, look at this.","""Flaming fire flakes"", hot? What do you know?",Aaahhh! Hot! Hot!,I'll take'em!,Flaming fire flakes! Best in town.,Finally! What do you have?,"Hey, there's some food.",That was surprisingly easy.


In [None]:
# create dataset suitable for our model
def construct_conv(row, tokenizer, eos = True):
    flatten = lambda l: [item for sublist in l for item in sublist]
    conv = list(reversed([tokenizer.encode(x) + [tokenizer.eos_token_id] for x in row]))
    conv = flatten(conv)
    return conv

class ConversationDataset(Dataset):
    def __init__(self, tokenizer: PreTrainedTokenizer, args, df, block_size=512):

        block_size = block_size - (tokenizer.model_max_length - tokenizer.max_len_single_sentence)

        directory = args.cache_dir
        cached_features_file = os.path.join(
            directory, args.model_type + "_cached_lm_" + str(block_size)
        )

        if os.path.exists(cached_features_file) and not args.overwrite_cache:
            logger.info("Loading features from cached file %s", cached_features_file)
            with open(cached_features_file, "rb") as handle:
                self.examples = pickle.load(handle)
        else:
            logger.info("Creating features from dataset file at %s", directory)

            self.examples = []
            for _, row in df.iterrows():
                conv = construct_conv(row, tokenizer)
                self.examples.append(conv)

            logger.info("Saving features into cached file %s", cached_features_file)
            with open(cached_features_file, "wb") as handle:
                pickle.dump(self.examples, handle, protocol=pickle.HIGHEST_PROTOCOL)

    def __len__(self):
        return len(self.examples)

    def __getitem__(self, item):
        return torch.tensor(self.examples[item], dtype=torch.long)

In [None]:
# Cacheing and storing of data/checkpoints

def load_and_cache_examples(args, tokenizer, df_trn, df_val, evaluate=False):
    return ConversationDataset(tokenizer, args, df_val if evaluate else df_trn)


def set_seed(args):
    random.seed(args.seed)
    np.random.seed(args.seed)
    torch.manual_seed(args.seed)
    if args.n_gpu > 0:
        torch.cuda.manual_seed_all(args.seed)


def _sorted_checkpoints(args, checkpoint_prefix="checkpoint", use_mtime=False) -> List[str]:
    ordering_and_checkpoint_path = []

    glob_checkpoints = glob.glob(os.path.join(args.output_dir, "{}-*".format(checkpoint_prefix)))

    for path in glob_checkpoints:
        if use_mtime:
            ordering_and_checkpoint_path.append((os.path.getmtime(path), path))
        else:
            regex_match = re.match(".*{}-([0-9]+)".format(checkpoint_prefix), path)
            if regex_match and regex_match.groups():
                ordering_and_checkpoint_path.append((int(regex_match.groups()[0]), path))

    checkpoints_sorted = sorted(ordering_and_checkpoint_path)
    checkpoints_sorted = [checkpoint[1] for checkpoint in checkpoints_sorted]
    return checkpoints_sorted


def _rotate_checkpoints(args, checkpoint_prefix="checkpoint", use_mtime=False) -> None:
    if not args.save_total_limit:
        return
    if args.save_total_limit <= 0:
        return

    # Check if we should delete older checkpoint(s)
    checkpoints_sorted = _sorted_checkpoints(args, checkpoint_prefix, use_mtime)
    if len(checkpoints_sorted) <= args.save_total_limit:
        return

    number_of_checkpoints_to_delete = max(0, len(checkpoints_sorted) - args.save_total_limit)
    checkpoints_to_be_deleted = checkpoints_sorted[:number_of_checkpoints_to_delete]
    for checkpoint in checkpoints_to_be_deleted:
        logger.info("Deleting older checkpoint [{}] due to args.save_total_limit".format(checkpoint))
        shutil.rmtree(checkpoint)

## Build Model

In [None]:
from transformers import AutoModelWithLMHead, AutoModelForCausalLM, AutoTokenizer
import torch
tokenizer = AutoTokenizer.from_pretrained("microsoft/DialoGPT-small")
model = AutoModelWithLMHead.from_pretrained("microsoft/DialoGPT-small")

Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/641 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/0.99M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]



Downloading:   0%|          | 0.00/335M [00:00<?, ?B/s]

In [None]:
"""
Fine-tuning the library models for language modeling on a text file (GPT, GPT-2, BERT, RoBERTa).
GPT and GPT-2 are fine-tuned using a causal language modeling (CLM) loss while BERT and RoBERTa are fine-tuned
using a masked language modeling (MLM) loss.
"""

# Configs
logger = logging.getLogger(__name__)

MODEL_CONFIG_CLASSES = list(MODEL_WITH_LM_HEAD_MAPPING.keys())
MODEL_TYPES = tuple(conf.model_type for conf in MODEL_CONFIG_CLASSES)

In [None]:
# Args to allow for easy convertion of python script to notebook
class Args():
    def __init__(self):
        self.output_dir = 'output-small'
        self.model_type = 'gpt2'
        self.model_name_or_path = 'microsoft/DialoGPT-small'
        self.config_name = 'microsoft/DialoGPT-small'
        self.tokenizer_name = 'microsoft/DialoGPT-small'
        self.cache_dir = 'cached'
        self.block_size = 512
        self.do_train = True
        self.do_eval = True
        self.evaluate_during_training = False
        self.per_gpu_train_batch_size = 4
        self.per_gpu_eval_batch_size = 4
        self.gradient_accumulation_steps = 1
        self.learning_rate = 5e-5
        self.weight_decay = 0.0
        self.adam_epsilon = 1e-8
        self.max_grad_norm = 1.0
        self.num_train_epochs = 4
        self.max_steps = -1
        self.warmup_steps = 0
        self.logging_steps = 1000
        self.save_steps = 3500
        self.save_total_limit = None
        self.eval_all_checkpoints = False
        self.no_cuda = False
        self.overwrite_output_dir = True
        self.overwrite_cache = True
        self.should_continue = False
        self.seed = 42
        self.local_rank = -1
        self.fp16 = False
        self.fp16_opt_level = 'O1'

args = Args()

## Train and Evaluate

In [None]:
def train(args, train_dataset, model: PreTrainedModel, tokenizer: PreTrainedTokenizer) -> Tuple[int, float]:
    """ Train the model """
    if args.local_rank in [-1, 0]:
        tb_writer = SummaryWriter()

    args.train_batch_size = args.per_gpu_train_batch_size * max(1, args.n_gpu)

    def collate(examples: List[torch.Tensor]):
        if tokenizer._pad_token is None:
            return pad_sequence(examples, batch_first=True)
        return pad_sequence(examples, batch_first=True, padding_value=tokenizer.pad_token_id)

    train_sampler = RandomSampler(train_dataset) if args.local_rank == -1 else DistributedSampler(train_dataset)
    train_dataloader = DataLoader(
        train_dataset, sampler=train_sampler, batch_size=args.train_batch_size, collate_fn=collate, drop_last = True
    )

    if args.max_steps > 0:
        t_total = args.max_steps
        args.num_train_epochs = args.max_steps // (len(train_dataloader) // args.gradient_accumulation_steps) + 1
    else:
        t_total = len(train_dataloader) // args.gradient_accumulation_steps * args.num_train_epochs

    model = model.module if hasattr(model, "module") else model  # Take care of distributed/parallel training
    model.resize_token_embeddings(len(tokenizer))
    # add_special_tokens_(model, tokenizer)


    # Prepare optimizer and schedule (linear warmup and decay)
    no_decay = ["bias", "LayerNorm.weight"]
    optimizer_grouped_parameters = [
        {
            "params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
            "weight_decay": args.weight_decay,
        },
        {"params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], "weight_decay": 0.0},
    ]
    optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon)
    scheduler = get_linear_schedule_with_warmup(
        optimizer, num_warmup_steps=args.warmup_steps, num_training_steps=t_total
    )

    # Check if saved optimizer or scheduler states exist
    if (
        args.model_name_or_path
        and os.path.isfile(os.path.join(args.model_name_or_path, "optimizer.pt"))
        and os.path.isfile(os.path.join(args.model_name_or_path, "scheduler.pt"))
    ):
        # Load in optimizer and scheduler states
        optimizer.load_state_dict(torch.load(os.path.join(args.model_name_or_path, "optimizer.pt")))
        scheduler.load_state_dict(torch.load(os.path.join(args.model_name_or_path, "scheduler.pt")))

    if args.fp16:
        try:
            from apex import amp
        except ImportError:
            raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use fp16 training.")
        model, optimizer = amp.initialize(model, optimizer, opt_level=args.fp16_opt_level)

    # multi-gpu training (should be after apex fp16 initialization)
    if args.n_gpu > 1:
        model = torch.nn.DataParallel(model)

    # Distributed training (should be after apex fp16 initialization)
    if args.local_rank != -1:
        model = torch.nn.parallel.DistributedDataParallel(
            model, device_ids=[args.local_rank], output_device=args.local_rank, find_unused_parameters=True
        )

    # Train!
    logger.info("***** Running training *****")
    logger.info("  Num examples = %d", len(train_dataset))
    logger.info("  Num Epochs = %d", args.num_train_epochs)
    logger.info("  Instantaneous batch size per GPU = %d", args.per_gpu_train_batch_size)
    logger.info(
        "  Total train batch size (w. parallel, distributed & accumulation) = %d",
        args.train_batch_size
        * args.gradient_accumulation_steps
        * (torch.distributed.get_world_size() if args.local_rank != -1 else 1),
    )
    logger.info("  Gradient Accumulation steps = %d", args.gradient_accumulation_steps)
    logger.info("  Total optimization steps = %d", t_total)

    global_step = 0
    epochs_trained = 0
    steps_trained_in_current_epoch = 0
    # Check if continuing training from a checkpoint
    if args.model_name_or_path and os.path.exists(args.model_name_or_path):
        try:
            # set global_step to gobal_step of last saved checkpoint from model path
            checkpoint_suffix = args.model_name_or_path.split("-")[-1].split("/")[0]
            global_step = int(checkpoint_suffix)
            epochs_trained = global_step // (len(train_dataloader) // args.gradient_accumulation_steps)
            steps_trained_in_current_epoch = global_step % (len(train_dataloader) // args.gradient_accumulation_steps)

            logger.info("  Continuing training from checkpoint, will skip to saved global_step")
            logger.info("  Continuing training from epoch %d", epochs_trained)
            logger.info("  Continuing training from global step %d", global_step)
            logger.info("  Will skip the first %d steps in the first epoch", steps_trained_in_current_epoch)
        except ValueError:
            logger.info("  Starting fine-tuning.")

    tr_loss, logging_loss = 0.0, 0.0

    model.zero_grad()
    train_iterator = trange(
        epochs_trained, int(args.num_train_epochs), desc="Epoch", disable=args.local_rank not in [-1, 0]
    )
    set_seed(args)  # Added here for reproducibility
    for _ in train_iterator:
        epoch_iterator = tqdm(train_dataloader, desc="Iteration", disable=args.local_rank not in [-1, 0])
        for step, batch in enumerate(epoch_iterator):

            # Skip past any already trained steps if resuming training
            if steps_trained_in_current_epoch > 0:
                steps_trained_in_current_epoch -= 1
                continue

            inputs, labels = (batch, batch)
            if inputs.shape[1] > 1024: continue
            inputs = inputs.to(args.device)
            labels = labels.to(args.device)
            model.train()
            outputs = model(inputs, labels=labels)
            loss = outputs[0]  # model outputs are always tuple in transformers (see doc)

            if args.n_gpu > 1:
                loss = loss.mean()  # mean() to average on multi-gpu parallel training
            if args.gradient_accumulation_steps > 1:
                loss = loss / args.gradient_accumulation_steps

            if args.fp16:
                with amp.scale_loss(loss, optimizer) as scaled_loss:
                    scaled_loss.backward()
            else:
                loss.backward()

            tr_loss += loss.item()
            if (step + 1) % args.gradient_accumulation_steps == 0:
                if args.fp16:
                    torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), args.max_grad_norm)
                else:
                    torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm)
                optimizer.step()
                scheduler.step()  # Update learning rate schedule
                model.zero_grad()
                global_step += 1

                if args.local_rank in [-1, 0] and args.logging_steps > 0 and global_step % args.logging_steps == 0:
                    # Log metrics
                    if (
                        args.local_rank == -1 and args.evaluate_during_training
                    ):  # Only evaluate when single GPU otherwise metrics may not average well
                        results = evaluate(args, model, tokenizer)
                        for key, value in results.items():
                            tb_writer.add_scalar("eval_{}".format(key), value, global_step)
                    tb_writer.add_scalar("lr", scheduler.get_lr()[0], global_step)
                    tb_writer.add_scalar("loss", (tr_loss - logging_loss) / args.logging_steps, global_step)
                    logging_loss = tr_loss

                if args.local_rank in [-1, 0] and args.save_steps > 0 and global_step % args.save_steps == 0:
                    checkpoint_prefix = "checkpoint"
                    # Save model checkpoint
                    output_dir = os.path.join(args.output_dir, "{}-{}".format(checkpoint_prefix, global_step))
                    os.makedirs(output_dir, exist_ok=True)
                    model_to_save = (
                        model.module if hasattr(model, "module") else model
                    )  # Take care of distributed/parallel training
                    model_to_save.save_pretrained(output_dir)
                    tokenizer.save_pretrained(output_dir)

                    torch.save(args, os.path.join(output_dir, "training_args.bin"))
                    logger.info("Saving model checkpoint to %s", output_dir)

                    _rotate_checkpoints(args, checkpoint_prefix)

                    torch.save(optimizer.state_dict(), os.path.join(output_dir, "optimizer.pt"))
                    torch.save(scheduler.state_dict(), os.path.join(output_dir, "scheduler.pt"))
                    logger.info("Saving optimizer and scheduler states to %s", output_dir)

            if args.max_steps > 0 and global_step > args.max_steps:
                epoch_iterator.close()
                break
        if args.max_steps > 0 and global_step > args.max_steps:
            train_iterator.close()
            break

    if args.local_rank in [-1, 0]:
        tb_writer.close()

    return global_step, tr_loss / global_step

# Evaluation of some model

def evaluate(args, model: PreTrainedModel, tokenizer: PreTrainedTokenizer, df_trn, df_val, prefix="") -> Dict:
    # Loop to handle MNLI double evaluation (matched, mis-matched)
    eval_output_dir = args.output_dir

    eval_dataset = load_and_cache_examples(args, tokenizer, df_trn, df_val, evaluate=True)
    os.makedirs(eval_output_dir, exist_ok=True)
    args.eval_batch_size = args.per_gpu_eval_batch_size * max(1, args.n_gpu)
    # Note that DistributedSampler samples randomly

    def collate(examples: List[torch.Tensor]):
        if tokenizer._pad_token is None:
            return pad_sequence(examples, batch_first=True)
        return pad_sequence(examples, batch_first=True, padding_value=tokenizer.pad_token_id)

    eval_sampler = SequentialSampler(eval_dataset)
    eval_dataloader = DataLoader(
        eval_dataset, sampler=eval_sampler, batch_size=args.eval_batch_size, collate_fn=collate, drop_last = True
    )

    # multi-gpu evaluate
    if args.n_gpu > 1:
        model = torch.nn.DataParallel(model)

    # Eval!
    logger.info("***** Running evaluation {} *****".format(prefix))
    logger.info("  Num examples = %d", len(eval_dataset))
    logger.info("  Batch size = %d", args.eval_batch_size)
    eval_loss = 0.0
    nb_eval_steps = 0
    model.eval()

    for batch in tqdm(eval_dataloader, desc="Evaluating"):
        inputs, labels = (batch, batch)
        inputs = inputs.to(args.device)
        labels = labels.to(args.device)

        with torch.no_grad():
            outputs = model(inputs, labels=labels)
            lm_loss = outputs[0]
            eval_loss += lm_loss.mean().item()
        nb_eval_steps += 1

    eval_loss = eval_loss / nb_eval_steps
    perplexity = torch.exp(torch.tensor(eval_loss))

    result = {"perplexity": perplexity}

    output_eval_file = os.path.join(eval_output_dir, prefix, "eval_results.txt")
    with open(output_eval_file, "w") as writer:
        logger.info("***** Eval results {} *****".format(prefix))
        for key in sorted(result.keys()):
            logger.info("  %s = %s", key, str(result[key]))
            writer.write("%s = %s\n" % (key, str(result[key])))

    return result

In [None]:
# Main runner

def main(df_trn, df_val):
    args = Args()
    
    if args.should_continue:
        sorted_checkpoints = _sorted_checkpoints(args)
        if len(sorted_checkpoints) == 0:
            raise ValueError("Used --should_continue but no checkpoint was found in --output_dir.")
        else:
            args.model_name_or_path = sorted_checkpoints[-1]

    if (
        os.path.exists(args.output_dir)
        and os.listdir(args.output_dir)
        and args.do_train
        and not args.overwrite_output_dir
        and not args.should_continue
    ):
        raise ValueError(
            "Output directory ({}) already exists and is not empty. Use --overwrite_output_dir to overcome.".format(
                args.output_dir
            )
        )

    # Setup CUDA, GPU & distributed training
    device = torch.device("cuda")
    args.n_gpu = torch.cuda.device_count()
    args.device = device

    # Setup logging
    logging.basicConfig(
        format="%(asctime)s - %(levelname)s - %(name)s -   %(message)s",
        datefmt="%m/%d/%Y %H:%M:%S",
        level=logging.INFO if args.local_rank in [-1, 0] else logging.WARN,
    )
    logger.warning(
        "Process rank: %s, device: %s, n_gpu: %s, distributed training: %s, 16-bits training: %s",
        args.local_rank,
        device,
        args.n_gpu,
        bool(args.local_rank != -1),
        args.fp16,
    )

    # Set seed
    set_seed(args)

    config = AutoConfig.from_pretrained(args.config_name, cache_dir=args.cache_dir)
    tokenizer = AutoTokenizer.from_pretrained(args.tokenizer_name, cache_dir=args.cache_dir)
    model = AutoModelWithLMHead.from_pretrained(
        args.model_name_or_path,
        from_tf=False,
        config=config,
        cache_dir=args.cache_dir,
    )
    model.to(args.device)
    
    logger.info("Training/evaluation parameters %s", args)

    # Training
    if args.do_train:
        train_dataset = load_and_cache_examples(args, tokenizer, df_trn, df_val, evaluate=False)

        global_step, tr_loss = train(args, train_dataset, model, tokenizer)
        logger.info(" global_step = %s, average loss = %s", global_step, tr_loss)

    # Saving best-practices: if you use save_pretrained for the model and tokenizer, you can reload them using from_pretrained()
    if args.do_train:
        # Create output directory if needed
        os.makedirs(args.output_dir, exist_ok=True)

        logger.info("Saving model checkpoint to %s", args.output_dir)
        # Save a trained model, configuration and tokenizer using `save_pretrained()`.
        # They can then be reloaded using `from_pretrained()`
        model_to_save = (
            model.module if hasattr(model, "module") else model
        )  # Take care of distributed/parallel training
        model_to_save.save_pretrained(args.output_dir)
        tokenizer.save_pretrained(args.output_dir)

        # Good practice: save your training arguments together with the trained model
        torch.save(args, os.path.join(args.output_dir, "training_args.bin"))

        # Load a trained model and vocabulary that you have fine-tuned
        model = AutoModelWithLMHead.from_pretrained(args.output_dir)
        tokenizer = AutoTokenizer.from_pretrained(args.output_dir)
        model.to(args.device)

    # Evaluation
    results = {}
    if args.do_eval and args.local_rank in [-1, 0]:
        checkpoints = [args.output_dir]
        if args.eval_all_checkpoints:
            checkpoints = list(
                os.path.dirname(c) for c in sorted(glob.glob(args.output_dir + "/**/" + WEIGHTS_NAME, recursive=True))
            )
            logging.getLogger("transformers.modeling_utils").setLevel(logging.WARN)  # Reduce logging
        logger.info("Evaluate the following checkpoints: %s", checkpoints)
        for checkpoint in checkpoints:
            global_step = checkpoint.split("-")[-1] if len(checkpoints) > 1 else ""
            prefix = checkpoint.split("/")[-1] if checkpoint.find("checkpoint") != -1 else ""

            model = AutoModelWithLMHead.from_pretrained(checkpoint)
            model.to(args.device)
            result = evaluate(args, model, tokenizer, df_trn, df_val, prefix=prefix)
            result = dict((k + "_{}".format(global_step), v) for k, v in result.items())
            results.update(result)

    return results

## Run the Main Function

In [None]:
main(trn_df, val_df)



Downloading:   0%|          | 0.00/641 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/0.99M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]



Downloading:   0%|          | 0.00/335M [00:00<?, ?B/s]

12/19/2021 09:09:40 - INFO - __main__ -   Training/evaluation parameters <__main__.Args object at 0x7f3c57ebeed0>
12/19/2021 09:09:40 - INFO - __main__ -   Creating features from dataset file at cached
12/19/2021 09:09:52 - INFO - __main__ -   Saving features into cached file cached/gpt2_cached_lm_512
12/19/2021 09:09:52 - INFO - __main__ -   ***** Running training *****
12/19/2021 09:09:52 - INFO - __main__ -     Num examples = 1632
12/19/2021 09:09:52 - INFO - __main__ -     Num Epochs = 4
12/19/2021 09:09:52 - INFO - __main__ -     Instantaneous batch size per GPU = 4
12/19/2021 09:09:52 - INFO - __main__ -     Total train batch size (w. parallel, distributed & accumulation) = 4
12/19/2021 09:09:52 - INFO - __main__ -     Gradient Accumulation steps = 1
12/19/2021 09:09:52 - INFO - __main__ -     Total optimization steps = 1632


Epoch:   0%|          | 0/4 [00:00<?, ?it/s]

Iteration:   0%|          | 0/408 [00:00<?, ?it/s]

Iteration:   0%|          | 0/408 [00:00<?, ?it/s]

Iteration:   0%|          | 0/408 [00:00<?, ?it/s]



Iteration:   0%|          | 0/408 [00:00<?, ?it/s]

12/19/2021 09:22:47 - INFO - __main__ -    global_step = 1632, average loss = 2.12725722902984
12/19/2021 09:22:47 - INFO - __main__ -   Saving model checkpoint to output-small
12/19/2021 09:22:53 - INFO - __main__ -   Evaluate the following checkpoints: ['output-small']
12/19/2021 09:22:56 - INFO - __main__ -   Creating features from dataset file at cached
12/19/2021 09:22:58 - INFO - __main__ -   Saving features into cached file cached/gpt2_cached_lm_512
12/19/2021 09:22:58 - INFO - __main__ -   ***** Running evaluation  *****
12/19/2021 09:22:58 - INFO - __main__ -     Num examples = 182
12/19/2021 09:22:58 - INFO - __main__ -     Batch size = 4


Evaluating:   0%|          | 0/45 [00:00<?, ?it/s]

12/19/2021 09:23:04 - INFO - __main__ -   ***** Eval results  *****
12/19/2021 09:23:04 - INFO - __main__ -     perplexity = tensor(6.2756)


{'perplexity_': tensor(6.2756)}

## Load the Trained Model

In [None]:
tokenizer = AutoTokenizer.from_pretrained('microsoft/DialoGPT-small')
model = AutoModelWithLMHead.from_pretrained('output-small')



In [None]:
# Let's chat for 4 lines
for step in range(4):
    # encode the new user input, add the eos_token and return a tensor in Pytorch
    new_user_input_ids = tokenizer.encode(input(">> User:") + tokenizer.eos_token, return_tensors='pt')
    # print(new_user_input_ids)

    # append the new user input tokens to the chat history
    bot_input_ids = torch.cat([chat_history_ids, new_user_input_ids], dim=-1) if step > 0 else new_user_input_ids

    # generated a response while limiting the total chat history to 1000 tokens, 
    chat_history_ids = model.generate(
        bot_input_ids, max_length=200,
        pad_token_id=tokenizer.eos_token_id,  
        no_repeat_ngram_size=3,       
        do_sample=True, 
        top_k=100, 
        top_p=0.7,
        temperature=0.8
    )
    
    # pretty print last ouput tokens from bot
    print("HarryPotterBot: {}".format(tokenizer.decode(chat_history_ids[:, bot_input_ids.shape[-1]:][0], skip_special_tokens=True)))

>> User:hi
HarryPotterBot: Hi, Aang.
>> User:how old are you
HarryPotterBot: I'm just a teenager, and I was just a kid.
>> User:who are you
HarryPotterBot: You've been through so much already. I'm sorry. I hope you have a better day.
>> User:i see
HarryPotterBot: I'll try.


## Push Model to Hugging Face

In [None]:
os.chdir('/content/')

In [None]:
!pip install huggingface_hub



In [None]:
!git config --global credential.helper store

In [None]:
!huggingface-cli login


        _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
        _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
        _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
        _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
        _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

        To login, `huggingface_hub` now requires a token generated from https://huggingface.co/settings/token.
        (Deprecated, will be removed in v0.3.0) To login with username and password instead, interrupt with Ctrl+C.
        
Token: 
Login successful
Your token has been saved to /root/.huggingface/token


In [None]:
!huggingface-cli repo create Dialog-GPT-small-AANG

[90mgit version 2.17.1[0m
Error: unknown flag: --version

[90mSorry, no usage text found for "git-lfs"[0m

You are about to create [1mtprincessazula/Dialog-GPT-small-AANG[0m
Proceed? [Y/n] Y

Your repo now lives at:
  [1mhttps://huggingface.co/tprincessazula/Dialog-GPT-small-AANG[0m

You can clone it locally with the command below, and commit/push as usual.

  git clone https://huggingface.co/tprincessazula/Dialog-GPT-small-AANG



In [None]:
!cat /root/.huggingface/token

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

In [None]:
!git clone https://tprincessazula:xKauWeXwdBNMCIAyACZwyIkfeCkIhZBURlzonRfnMoswoYzpBpcRSSkQiuQdYWbvdfiTqLLJxPsojGrtINEBMaXjbADsBMrAHNWSWpUWoNfjliafaqYIpMYfRdcfTniN@huggingface.co/tprincessazula/Dialog-GPT-small-harrypotter



fatal: destination path 'Dialog-GPT-small-harrypotter' already exists and is not an empty directory.


In [None]:
!git clone https://huggingface.co/tprincessazula/Dialog-GPT-small-AANG

fatal: destination path 'Dialog-GPT-small-AANG' already exists and is not an empty directory.


In [None]:
!sudo apt-get install git-lfs

Reading package lists... Done
Building dependency tree       
Reading state information... Done
git-lfs is already the newest version (2.3.4-1).
0 upgraded, 0 newly installed, 0 to remove and 37 not upgraded.


In [None]:
!git config --global user.email "ttehilaazulay@gmail.com"
# Tip: using the same email as your huggingface.co account will link your commits to your profile
!git config --global user.name "tprincessazula"

In [None]:
!ls "/content/drive/My Drive/smallModel/output-small/"

config.json	  pytorch_model.bin	   tokenizer.json
eval_results.txt  special_tokens_map.json  training_args.bin
merges.txt	  tokenizer_config.json    vocab.json


In [None]:
!mv /content/drive/My\ Drive/smallModel/output-small/* Dialog-GPT-small-AANG/


In [None]:
os.chdir('Dialog-GPT-small-AANG')

In [None]:
!git lfs install

Updated git hooks.
Git LFS initialized.


In [None]:
!ls

config.json		      pytorch_model.bin        tokenizer.json
Dialog-GPT-small-harrypotter  sample_data	       training_args.bin
eval_results.txt	      special_tokens_map.json  vocab.json
merges.txt		      tokenizer_config.json


In [None]:
!pwd

/content/Dialog-GPT-small-AANG


In [None]:
!git status

On branch main
Your branch is up to date with 'origin/main'.

Untracked files:
  (use "git add <file>..." to include in what will be committed)

	[31mDialog-GPT-small-harrypotter/[m
	[31mconfig.json[m
	[31meval_results.txt[m
	[31mmerges.txt[m
	[31mpytorch_model.bin[m
	[31msample_data/[m
	[31mspecial_tokens_map.json[m
	[31mtokenizer.json[m
	[31mtokenizer_config.json[m
	[31mtraining_args.bin[m
	[31mvocab.json[m

nothing added to commit but untracked files present (use "git add" to track)


In [None]:
!git add .

hint: You've added another git repository inside your current repository.
hint: Clones of the outer repository will not contain the contents of
hint: the embedded repository and will not know how to obtain it.
hint: If you meant to add a submodule, use:
hint: 
hint: 	git submodule add <url> Dialog-GPT-small-harrypotter
hint: 
hint: If you added this path by mistake, you can remove it from the
hint: index with:
hint: 
hint: 	git rm --cached Dialog-GPT-small-harrypotter
hint: 
hint: See "git help submodule" for more information.


In [None]:
!git config --global user.email "ttehilaazulay@gmail.com"
# Tip: using the same email as your huggingface.co account will link your commits to your profile
!git config --global user.name "tprincessazula"

In [None]:
!git commit -m "initial commit"

[main a7f3bfa] initial commit
 16 files changed, 100121 insertions(+)
 create mode 160000 Dialog-GPT-small-harrypotter
 create mode 100644 config.json
 create mode 100644 eval_results.txt
 create mode 100644 merges.txt
 create mode 100644 pytorch_model.bin
 create mode 100755 sample_data/README.md
 create mode 100755 sample_data/anscombe.json
 create mode 100644 sample_data/california_housing_test.csv
 create mode 100644 sample_data/california_housing_train.csv
 create mode 100644 sample_data/mnist_test.csv
 create mode 100644 sample_data/mnist_train_small.csv
 create mode 100644 special_tokens_map.json
 create mode 100644 tokenizer.json
 create mode 100644 tokenizer_config.json
 create mode 100644 training_args.bin
 create mode 100644 vocab.json


In [None]:
!git push 

Git LFS: (2 of 2 files) 486.76 MB / 486.76 MB
Counting objects: 18, done.
Delta compression using up to 2 threads.
Compressing objects: 100% (17/17), done.
Writing objects: 100% (18/18), 9.15 MiB | 1.83 MiB/s, done.
Total 18 (delta 3), reused 0 (delta 0)
remote: -------------------------------------------------------------------------[31m[K
remote: Your push was rejected because it contains files larger than 10M.[K
remote: Please use https://git-lfs.github.com/ to store larger files.(B[m[K
remote: -------------------------------------------------------------------------[K
remote: Offending files:[K
remote:  - sample_data/mnist_test.csv (ref: refs/heads/main)[K
remote:  - sample_data/mnist_train_small.csv (ref: refs/heads/main)[K
To https://huggingface.co/tprincessazula/Dialog-GPT-small-AANG
 ! [remote rejected] main -> main (pre-receive hook declined)
error: failed to push some refs to 'https://huggingface.co/tprincessazula/Dialog-GPT-small-AANG'


## All Done!