# Install simpletransformers Python package

Simpletransformers is a package that uses pytorch to finetune GPT-2 models. It's mostly plug and play.

Installing it will upgrade the colab iPython version and require the colab instance to be restarted (Runtime menu > Restart Menu Ctrl-M). Not sure how to get around that at the moment, sorry.


In [None]:
# Install simpletransformers
!python -m pip install git+https://github.com/zacc/simpletransformers.git@01ed37e471234ec3266fda2101ce61f4e88e47bb

# Output which type of GPU Colab has bestowed upon us
!nvidia-smi

# Set filename of finetuning data files
Set the variables with the name of the training data text file.

There are two data files: training file and the evaluation file. The training file is what GPT-2 is fine tuned with and the evaluation file is used by simpletransformers as a control sample to compare how the fine tuning progress is going.

Generally this is done by taking your training data and splitting it into parts of 90%/10%; 90% for the training data and 10% for the evaluation data.

These data files should be copied to the root directory of your Google Drive and it will read the file directly from there.

In [None]:
training_file = "bot_27102020_train.txt"
eval_file = "bot_27102020_eval.txt"

# Mount Google Drive
GoogleColab may close your GPU Colaboratory session early during peak hours, or after the maximum time limit (about 10-12 hours). If this happens you may lose the model you have finetuned. If we save it directly to a Google Drive we can save it as we finetune it.

This step mounts your Google Drive.


In [None]:
from google.colab import drive
drive.mount('/content/drive')

# Copy training files from Google Drive
Before running this step, upload the training files directly to your Google Drive.

In [None]:
full_path = '/content/drive/My Drive/' + training_file
!cp "$full_path" "/content"

full_path = '/content/drive/My Drive/' + eval_file
!cp "$full_path" "/content"

# Set a label for this finetuning task
Set a directory for this finetuning task. It helps to use something more distinctive.

I like to use `bot_ddmmyyyy` as a crude form of version control.

In [None]:
bot_label = 'bot_27102020'

# Start the finetuning process

Unfortunately Google have recently stopped serving Tesla T4 GPUs on the free Colab. You will likely get a K80 GPU which is a lot slower for fine tuning GPT-2.

The code below will adapt for K80 or Tesla T4.

On the K80 GPU and with ~10-12Mb training data text file, it will train for around 6 epochs (loops of all training data). Each epoch will take 60-80 minutes so it is likely you will get kicked out of Colab before finishing the training.

If you have been kicked off before finishing all epochs, open a new Colab session when you can and run all these steps again. The final step will attempt to detect the existing best_model and resume training from that point. It will restart the epochs from 0 so you might need to keep track of exactly how many have been done, or keep track of the eval_loss and stop when it starts rising.

The score at each training save point will be saved in a file called training_progress_scores.csv. The save point with the lowest eval_loss is the best one to use. After that, the eval_loss will begin to increase and the model will start to be over-trained/over-fit.

The save point with the lowest eval_loss will be automatically saved in the best_model/ folder. You should download and use this one for your chatbot.

More training is not necessarily better as it can lead to overfitting.


In [None]:
from simpletransformers.language_modeling import LanguageModelingModel
import torch
import os


# Switch to the Google Drive directory
%cd "/content/drive/My Drive/"

args = {
    "overwrite_output_dir": True,
    "learning_rate": 1e-4,
    # larger batch sizes will use more training data but consume more ram
    # accumulation steps
    "gradient_accumulation_steps": 1,

    # Use text because of grouping by reddit submission
    "dataset_type": "simple",
    # Sliding window will help it manage very long bits of text in memory
    "sliding_window": True,
    "max_seq_length": 512,

		"mlm": False, # has to be false for gpt-2

    "evaluate_during_training": True,
    # default 2000, will save by default at this step.
    # "evaluate_during_training_steps": 2000,
    "use_cached_eval_features": True,
    "evaluate_during_training_verbose": True,

    # don't save optimizer and scheduler we don't need it
    "save_optimizer_and_scheduler": False,
    # Save disk space by only saving on checkpoints
    "save_eval_checkpoints": True,
    "save_model_every_epoch": False,
    # disable saving each step to save disk space
    "save_steps": -1, 

    "output_dir": f"{bot_label}/",
		"best_model_dir": f"{bot_label}/best_model",
}

if 'K80' in torch.cuda.get_device_name(0):
  # Most of the time we'll only get a K80 on free Colab
  args['train_batch_size'] = 1
  # Need to train for multiple epochs because of the small batch size
  args['num_train_epochs'] = 6
  args["gradient_accumulation_steps"] = 100
  # Save every 3000 to conserve disk space
  args["evaluate_during_training_steps"] = int(3000 / args["gradient_accumulation_steps"])

elif 'T4' in torch.cuda.get_device_name(0):
  # You may get a T4 if you're using Colab Pro
  # larger batch sizes will use more training data but consume more ram
  args['train_batch_size'] = 8
  # On Tesla t4 we can train for steps rather than epochs because of the batch size
  args["max_steps"] = 12000
  # default 3000, will save by default at this step.
  args["evaluate_during_training_steps"] = 3000,

# Check to see if a model already exists for this bot_label
resume_training_path = f"/content/drive/MyDrive/{bot_label}/best_model/"

if os.path.exists(resume_training_path):
    # A model path already exists. So we'll attempt to resume training starting fom the previous best_model.
    args['output_dir'] = resume_training_path
    args['best_model_dir'] = f"{resume_training_path}/resume_best_model/"
    model = LanguageModelingModel("gpt2", resume_training_path, args=args)

else:
  # Create a new model
  model = LanguageModelingModel("gpt2", "gpt2")

model.train_model(train_file=training_file, eval_file=eval_file, args=args, verbose=True)

## Complete!
The model is now finetuned.
Go back to your Google Drive, download the model and unzip it into the `models` folder in the ssi-bot project. 