## Fine Tuning GPT-2 on text from children's stories.

Notebook for fine-tuning GPT-2. This notebook is inspired by this blog post
https://towardsdatascience.com/film-script-generation-with-gpt-2-58601b00d371 which I found very helpful. I also
of course did a lot of reading of the PyTorch docs.

The main library used is the HuggingFace transformers library. https://huggingface.co/transformers/ They strive to make using transformer models dead simple and have a lot of helper functions and simple apis.  That's not what my goal was though, I wanted to understand the details of what I was doing. So I made this harder than it probabably needed to be by working through the fine-tuning process in "raw" PyTorch. In the end, while I used the tokenizer and model from the HuggingFace library, I wrote the preprocessing, Dataset class and training loop myself. This notebook is very step by step with lots of comments about why I'm doing what I'm doing.

Note that the utilities for text pre-processing (tokenization) in are defined in */src/text_tokenization_utils.py*. 

The pre-processed text is from project Gutenberg (Brothers Grimm and Beatrix Potter) It's in the */textdata/* directory



In [1]:
# import packages

import os, time
import pickle
import numpy as np
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import GPT2Tokenizer, PreTrainedTokenizer, GPT2LMHeadModel
from transformers.optimization import AdamW, get_linear_schedule_with_warmup

# The code for tokenizing the text and storing it in files is imported here
import sys
sys.path.append("../src") 
from text_tokenization_utils import make_tokenized_examples

## Tokenize the raw text.

I got raw text from the Project Gutenberg site https://www.gutenberg.org/ It has a lot of books that are old enough to be in the public domain. I didnt' make any webscraping code because this was a toy project and it was quicker to literally cut and paste some text and manually clean it a little bit ... very old school.

This tokenization only needs to be done once. The tokenization wrapper function *make_tokenized_examples* will read in the text and then save the tokenized files. It take care of making the subdirectory tree for you.  

See the docstring for *make_tokenized_examples* for the correct directory structure needed. You can also simply look in the */textdata* directory of the repo.

In [2]:
# relative path to the root directory of the text data. change this if you put the text data somewhere else.
PWD_TextData_Root = "../textdata/"

# these are the names of authors for which I provide data. Each author has their own subdirectory
authors = ["Grimm", # Brother's Grimm fairy tales
           "Beatrix_Potter", # Peter Rabbit and other stories
           "Lewis_Carrol"] # Alice and Wonderland and Through the Looking Glass

In [3]:
# import the tokenizer from the Transformers library
gpt2_tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

In [5]:
# and loop over the authors to make the tokenized examples
# note these will be saved locally in a *tokenized_data* directory as .pkl files. The tokenized files are not 
# in the github repository because there is a *.pkl in the .gitignore file

for author in authors: 
    file_path = os.path.join(PWD_TextData_Root, author)
    print(file_path)
    make_tokenized_examples(gpt2_tokenizer, 256,file_path, examples_file=None)

../textdata/Grimm
../textdata/Grimm
('../textdata', 'Grimm')
../textdata/Grimm/raw_text/The_Traveling_Musicians.txt successfully read and tokenized
1893
Successfully split tokens from file ../textdata/Grimm/raw_text/The_Traveling_Musicians.txt into examples
../textdata/Grimm/raw_text/The_Elves_And_The_Shoemaker.txt successfully read and tokenized
957
Successfully split tokens from file ../textdata/Grimm/raw_text/The_Elves_And_The_Shoemaker.txt into examples
../textdata/Grimm/raw_text/The_Turnip.txt successfully read and tokenized
1584
Successfully split tokens from file ../textdata/Grimm/raw_text/The_Turnip.txt into examples
../textdata/Grimm/raw_text/Sweetheart_Roland.txt successfully read and tokenized
2015
Successfully split tokens from file ../textdata/Grimm/raw_text/Sweetheart_Roland.txt into examples
../textdata/Grimm/raw_text/King_Grisly_Beard.txt successfully read and tokenized
2170
Successfully split tokens from file ../textdata/Grimm/raw_text/King_Grisly_Beard.txt into exampl

../textdata/Beatrix_Potter/raw_text/flopsy_bunnies.txt successfully read and tokenized
1537
Successfully split tokens from file ../textdata/Beatrix_Potter/raw_text/flopsy_bunnies.txt into examples
../textdata/Beatrix_Potter/raw_text/pigling_bland.txt successfully read and tokenized
5500
Successfully split tokens from file ../textdata/Beatrix_Potter/raw_text/pigling_bland.txt into examples
../textdata/Beatrix_Potter/raw_text/jemima_puddle_duck.txt successfully read and tokenized
1941
Successfully split tokens from file ../textdata/Beatrix_Potter/raw_text/jemima_puddle_duck.txt into examples
../textdata/Beatrix_Potter/raw_text/tailor_of_gloucester.txt successfully read and tokenized
3267
Successfully split tokens from file ../textdata/Beatrix_Potter/raw_text/tailor_of_gloucester.txt into examples
../textdata/Beatrix_Potter/raw_text/jeremy_fisher.txt successfully read and tokenized
1194
Successfully split tokens from file ../textdata/Beatrix_Potter/raw_text/jeremy_fisher.txt into examples

## Make the Dataset and Dataloader

This is what PyTorch will use to supply data to the training loop

In [6]:
class StoryData(torch.utils.data.Dataset):
    '''This is a class for loading in a list of tokenized gpt2 examples from a list of file paths'''

    def __init__(
            self,
            file_paths: list):

        for fpath in file_paths:
            assert os.path.isfile(fpath), "{} does not exist".format(fpath)

        self.examples = []

        for fpath in file_paths:
            with open(fpath, 'rb') as f:
                examps = pickle.load(f)
            self.examples.extend(examps)

    def __len__(self):
        return len(self.examples)

    def __getitem__(self, item):
        return torch.tensor(self.examples[item], dtype=torch.long)

In [34]:
#  Now choose which Authors we want to use, comment out lines of the author_list below.
# Switching authors will change the nature of the generated text after training.

author_list = [
    #"Grimm",
    #"Lewis_Carrol",
    "Beatrix_Potter"
]

# you will need to modify the paths a bit if you choose another block size
file_paths = []
for author in author_list:
  file_paths.append(os.path.join(PWD_TextData_Root, 
                                 author, 
                                 "tokenized_examples/examples_gpt2_blocksize_256_" + author + ".pkl"))
print(file_paths)

['../textdata/Beatrix_Potter/tokenized_examples/examples_gpt2_blocksize_256_Beatrix_Potter.pkl']


In [35]:
# make the actual instance of the Dataset class using the chosen authors 
story_dataset = StoryData(file_paths)

# make the data loader
#NOTE: the batch_size is 1 because we will be doing gradient accumulation. This is to get around the fact
# that I am using a 8GB RTX 2070 Super GPU which is small
story_dataloader = DataLoader(story_dataset, batch_size =1, shuffle = True)

## Get ready to fit the model

We will choose a pre-trained model from the HuggingFace model hub to fine tune, pick some hyperparameters, make an optimizer and also a learning rate scheduler. 

In [43]:
# Pick a model to train

# distilgpt2 will on my 8GB GPU
model = GPT2LMHeadModel.from_pretrained('distilgpt2')

# gpt2-medium will not train on an 8GB GPU ... but you can generate text with the pre-trained model if you like.
# if you have a larger GPU, say a 24 GB RTX 3090 you may wish to try training though
# model = GPT2LMHeadModel.from_pretrained('gpt2-medium')

# NOTE: look into gradient checkpointing and see if that will allow for training within 8GB

In [44]:
# set epochs and batch size
N_EPOCHS = 5
BATCH_SIZE = 8 # note we are actually going to use gradient accumulation because these models are so big

# for the scheduler
LEARNING_RATE = 0.0001 #0.00002
WARMUP_STEPS = 100 # 10000

In [45]:
# put the model on the gpu. note if this doesn't say you're using the gpu this will not train!

device = 'cpu'
if torch.cuda.is_available():
    print('using gpu')
    device = 'cuda'
print(device)

model.to(device)

# create optimizer
optimizer = AdamW(model.parameters(), lr=LEARNING_RATE)

# make scheduler (for varying learning rate over time)
scheduler = get_linear_schedule_with_warmup(optimizer=optimizer, num_warmup_steps=WARMUP_STEPS, num_training_steps=-1)


using gpu
cuda


## Before fine tuning the model, let's generate some text from it to see what it produces.

You can run this and generate your own text from the pre-trained model. 

In [39]:
# make a function to generate text from the model

def generate_test_text(model, max_length=256, input_text=None):
    model.eval() # put the model in eval mode
    if input_text is None:
        input_text = "Once upon a time there was a little mouse."
    input_ids = gpt2_tokenizer.encode(input_text, return_tensors='pt')
    input_ids = input_ids.to('cuda')
    output_ids = model.generate(input_ids, 
                                pad_token_id=gpt2_tokenizer.eos_token_id,
                                max_length=max_length, 
                                do_sample=True, 
                                top_p=0.95, 
                                top_k=60,
                                num_return_sequences=1)

    output_text = gpt2_tokenizer.decode(output_ids[0])
    return output_text

In [40]:
# and do some generation. The output should be different every time you run this cell 
# and of all sorts of topics and styles
prompt ="once upon a time there was a little mouse" 
print(generate_test_text(model,input_text=prompt, max_length=256))

once upon a time there was a little mouse and a bit of a giant turtle. A lot of people thought that was just a joke. I am not going to write about it again. It was not always a joke. However, it was a very common joke in most cases.
The first time I saw it in a movie, it looked like a character wearing a red jumpsuit. I thought it was strange. I am not actually sure. But it was something that many people would love to see happen and it could have been a different character, but I really wasn't looking forward to seeing it being used for a movie. It was a really big part of my life, because I was really hungry and wanted to see a scene. But even as I sat down and did a story, I felt that the character I really wanted to see was quite different from the stereotypical character.
It was a true, very rare thing.
The first thing I noticed about it was that there was no big black silhouette at all. I was like I never thought that, in my world, could I imagine the scene where a giant turtle is

### A couple examples

You can see that the style is of general text you'd read in a book and the content is highly variable across independent generations.

**Example 1**

once upon a time there was a little mouse in the sky and I looked at you from here, and you saw the thing which was really not a bird, I had always seen a bird and all those things but something with the mouse there was something. This was something that could be fixed and you just looked at what you were saying. It had a lot of little fish in it, and I knew this to be very precise. The bird had an eye on its mouth and its mouth were almost like the size of a tooth and not a long blade, and it looked very good and very well-rounded.

So what has your new favourite piece of bird writing in The Sims?
The author's favourite piece of bird writing in The Sims is the paper bird - and I would like to start by saying that the only way to do that with the new one is by putting this one on. So you're talking about one of your favourite words, which is a line from the Sim guide. How has it been in your life since The Sims first started?
I didn't really think it would be that easy, but there were the other things to try and do. I was really curious for how it would be if you could get yourself into this and then

**Example2**

once upon a time there was a little mouse on the road and an idea for an alternate timeline of the future we knew we were all going to be fighting for.
The idea was made for an alternate timeline of a future where humans, and of course mankind, had different beliefs regarding the future, and the world was not based on the current order, but on the present and future world. The future was a great place and it was clear that there was something that was going to be found on Earth...
The future changed completely but it was far from new. It wasn't until a long time ago that everyone in the future had different beliefs about the future, and this made it even more important. However, I do realize some thought it might not seem like the entire universe had to be built on Earth.
This was just one of the things I was so excited to work with.
I felt that I was getting the best of everything from the world to the planet... but, to this day, I'm having my second go at it.
The world in this scenario was made up of a world that is mostly earth like ours, inhabited by people who could relate to the history of the galaxy and the past. One such thing happened when I met an

## Fine tune the model

This is the training loop. It shouldn't take too long to run ... a few minutes or so depending on your GPU


In [41]:
# This is the link to the gradient accumulation documentation
# https://pytorch.org/docs/stable/notes/amp_examples.html#gradient-accumulation# 

#TODO: look into gradient checkpointing as well

epoch_loss = 0.0 # used to track loss for each epoch

internal_batch_count = 0 # used to track number of examples within each batch. This is necessary 
                         # because of the gradient accumulation hack (to deal with my 8GB GPU memory)
    
# make FP16 scaler for faster training
scaler = torch.cuda.amp.GradScaler()
    
# put the model into training mode
model.train()

for epoch in range(N_EPOCHS): # iterate over epochs
    
    print("started epoch {}".format(epoch))
    
    for idx, text in enumerate(story_dataloader):  # this data loader is set up to shuffle automatically
        
        # Do the forward propagation. 
        # Don't forget to put the text onto the gpu.
        # you have to put in labels to get the loss as an output.
        # because GPT is an autogregessive model the input is the output for training purposes
        
        with torch.cuda.amp.autocast():
            outputs = model(text.to(device), labels=(text.to(device)))
        
            # get the loss out so we can do backwards propagation
            loss, logits = outputs[:2]
            loss = loss / BATCH_SIZE 
        
        # do backpropagation. yay autodifferentiation!
        # note the use of the scaler for the FP16 
        scaler.scale(loss).backward()
        
        # keep track of the loss
        epoch_loss = epoch_loss + loss.detach().cpu().numpy()  # need to detach the gradients 
                                                               # because we only care about the numerical value
                                                               # also store the epoch loss on the cpu as numpy
            
        # increment the internal_batch_count
        internal_batch_count = internal_batch_count + 1
        
        # Now, if we have run through a full batch, take some optimizer and gradient steps
        if internal_batch_count == BATCH_SIZE:
            internal_batch_count = 0 # reset this
            
            # take an optimizer step. note the use of the scaler for FP16
            scaler.step(optimizer) 
            scaler.update()
            optimizer.zero_grad() # zero out the gradients in the optimizer
            
            model.zero_grad() # zero out the gradients we've been accumulating in the model
            
            scheduler.step() # take a scheduler step
            
            
    # Now that we've gone through an epoch, let's see what the loss is and what some generated text looks like
    
    # put the model into evaluation mode
    model.eval()
    
    # print the loss
    print("Epoch {} has loss {}".format(epoch, epoch_loss))
    # reset the loss
    epoch_loss = 0.0
    
    # uncomment this if you want to print some test text after each epoch
    #print(generate_test_text(model,input_text=prompt))
    
    
    # put the model back in training mode
    model.train()
        
        
        
    

started epoch 0
Epoch 0 has loss 174.24035161733627
started epoch 1
Epoch 1 has loss 156.06850409507751
started epoch 2
Epoch 2 has loss 141.5551725924015
started epoch 3
Epoch 3 has loss 139.72589495778084
started epoch 4
Epoch 4 has loss 139.8133194744587


## OK now generate some text!  

That's pretty much it for fine tuning, you can now play around and generate text from the trained model.The text is generally gramatically correct but doesn't hang together as a story and is pretty rambling. distilgpt2 is a rather small language model. Text generated from larger language models tends to be very fluent however.

What is however clear, is that the fine-tuning is doing matching the authors upon who's text we are tuning. Beatrix Potter and Brother's Grimm are very different afterall!  Farther below I give some example generated text that demonstrates the differences resulting from fine tuning on different authors, but you can of course generate your own text using the next cell. 

In [42]:
# Pick a prompt
prompt ="Once upon a time there was a little mouse" 

# And generate text!
print(generate_test_text(model,input_text=prompt, max_length=256))

Once upon a time there was a little mouse, and half an inch in the middle of the
chandelier, and half the pie-cups were covered in bacon.

In all those days, bacon was also a very hot dish.

It did not last long; in the late spring of 1791, there was a
little bumble-basket of potatoes and carrots.

In the spring of 1792,
there was a basket full of mice and cats;
when the mice were in turn, they turned out to be a
cat-basket.

So the little mouse kittens had
been invited to bed, so that the mice, too, could smell their
dinner.

On the inside of the rabbit hole, there were four mice in a
yellow casket; three kittens in a yellow casket, one in a blue
casket; and the other two kittens.

Now that there is some respect to the rabbits in the
garden-house, the rabbits came out of it with a
large basket of cabbage and white jam.

They were all very upset, and so they set
a little trap.

When the mouse saw the hole, it


### Here are two examples of text generated after fine tuning on Beatrix Potter stories:

Note that while the text isn't very sensible, it matches the general structure and linguistic style of Beatrix Potter stories. Lots of short sentences separated into different lines. It's also pretty light hearted. 

**Example 1**
Once upon a time there was a little mouse on the edge of a wood-row, and a young
mouse!

"I shall see; I wish ye had
never seen another mouse!" said
Pumpkin.

Then there was a very fine table-cloth covered with
dry dishes and very little chairs!

"A beautiful table-cloth," said a big lump-handling gentleman
who has been at the table, with a nice brown
and a very fine wooden plate.

"I am sorry," said Mr. Pickles. "
Pumpkin
said that he made the meal which I thought
was a fat ham!

"What a nice little ham!" said
Mr. Pickles.

Pumpkin looked out upon the table, and came upon the
dark wood-row, with a little knife and
mouse.

"It is quite a hard mouse; it should be too cold
if you will. It is quite hard; it is really a
little mouse, and is very strong; and
hard"
should we have eaten with it?" said Mr. Pickles, to
pumpkin; "I will have eaten it," said Pick


**Example 2**

Once upon a time there was a little mouse who could not quite get out. When they met that little person, they bumped into a hole in the wall, where it was very large
in a little pottery basket, and began pulling out wood from beneath.

One day something
dishfuls and a nut had been broken into.

"Who will run in my day?" said a mouse, looking for
another mouse. They had been at a party and got in there; he
could only wait until his day came round.

"Come into the garden in the morning!"

The mice took notice, but they
felt much too early to hide it.

"This is a little fish, called
Tobit!"

The mouse could see a hole in the
door, and saw something big
in a hole, and they came out of nowhere.

"No one wants to go in here," said
"Old Mr. Kitten!"


There came a little mouse whose life was almost complete without him; the mouse
dishful was quite empty and
very little
on account of the rats. It looked rather beautiful, but
there was no one who could see



### And some examples from fine-tuning on Brother's Grimm

These have a very different, paragraph type structure ... and they are pretty dark, as we'd expect from Brothers Grimm.

**Grimm Example 1**

Once upon a time there was a little mouse, and there
was some dwarf in his room that had been holding it. ‘If it could stay, then it would just
be enough to eat and eat with,’ said she, and the creature said to the dwarf,
 ‘if you will, I would just eat in a second, or I would
eat your meat or something in my kitchen.’

So the dwarf went, and the dwarf fell on top of him, and the dwarf
was knocked off his head, and the dwarf’s head flew on the ground
while the dwarf lay in a heap of ashes.
The dwarf went about this, and the dwarf’s head flew in
heathers, and at last he fell backwards and he had fallen down upon
his face, and now the dwarf went round the castle, and the dwarf
fell a little, and then the dwarf fell with him to the fire. The dwarf said,
‘Why not? How come back down on the stairs again? When you
re on the stair and it seems as if you can have a beautiful old tower.’ So the dwarf threw himself
down and fell back, and was forced

**Grimm Example 2**

Once upon a time there was a little mouse at the bottom and he would cut a hole
into one and lay down in it, and when they were tired, they lay
still on the ground under the tree and with their feet beneath it,
and as soon as they were tired, they said: ‘Don’t let us kill you.’ They
said: ‘Take the mouse that we are, or the mouse that’s cut in half. I’ll cut my mouse,’
and he will cut my foot in half to a length, and with the same head cut in
half, then the mouse will cut it into the hole, and
he will cut the other one in half.’

When the mice came in the hole, they found a young man lying still at the top of a tree
in which was standing, with a little boy lying lying dead. The
young man was lying down in his sack with his little son lying
down on the ground, and the old man lying lying on the floor
sitting in a dead wood bench. The old man’s face
disturbed, but there the old man was crying and
snorting and crying