**Links:**

*   https://towardsdatascience.com/teaching-gpt-2-a-sense-of-humor-fine-tuning-large-transformer-models-on-a-single-gpu-in-pytorch-59e8cec40912
*   https://gist.github.com/mf1024/3df214d2f17f3dcc56450ddf0d5a4cd7#file-fine-tuning-gpt2-medium-in-pytorch-ipynb




# Generating text with a pre-trained GPT2 in PyTorch

This notebook was created as a part of a blog post - [Fine-tuning large Transformer models on a single GPU in PyTorch - Teaching GPT-2 a sense of humor](https://mf1024.github.io/2019/11/12/Fun-With-GPT-2/).

In this notebook, I will use a pre-trained medium-sized GPT2 model from the [huggingface](https://github.com/huggingface/transformers) to generate some text.

The easiest way to use huggingface transformer libraries is to install their pip package *transformers*.

In [1]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/19/22/aff234f4a841f8999e68a7a94bdd4b60b4cebcfeca5d67d61cd08c9179de/transformers-3.3.1-py3-none-any.whl (1.1MB)
[K     |▎                               | 10kB 23.6MB/s eta 0:00:01[K     |▋                               | 20kB 6.2MB/s eta 0:00:01[K     |█                               | 30kB 7.1MB/s eta 0:00:01[K     |█▎                              | 40kB 7.9MB/s eta 0:00:01[K     |█▌                              | 51kB 7.3MB/s eta 0:00:01[K     |█▉                              | 61kB 8.2MB/s eta 0:00:01[K     |██▏                             | 71kB 8.5MB/s eta 0:00:01[K     |██▌                             | 81kB 9.1MB/s eta 0:00:01[K     |██▉                             | 92kB 9.4MB/s eta 0:00:01[K     |███                             | 102kB 9.2MB/s eta 0:00:01[K     |███▍                            | 112kB 9.2MB/s eta 0:00:01[K     |███▊                            | 122kB 9.2M

In [2]:
from google.colab import drive
import logging
import numpy as np
import pandas as pd
import sys
import torch

logging.getLogger().setLevel(logging.CRITICAL)

device = "cpu"
if torch.cuda.is_available():
    device = "cuda"

pd.set_option("precision", 4)

print("Python version is %s" % sys.version)
print("Device is: %s" % device)

drive.mount("/content/gdrive")


Python version is 3.6.9 (default, Jul 17 2020, 12:50:27) 
[GCC 8.4.0]
Device is: cuda
Mounted at /content/gdrive


### Models and classes

I use the [GPT2LMHeadModel](https://github.com/huggingface/transformers/blob/master/transformers/modeling_gpt2.py#L491) module for the language model, which is [GPT2Model](https://github.com/huggingface/transformers/blob/master/transformers/modeling_gpt2.py#L326), with an additional linear layer that uses input embedding layer weights to do the inverse operation of the embedding layer - to create logits vector for the dictionary from outputs of the GPT2.

[GPT2Tokenizer](https://github.com/huggingface/transformers/blob/master/transformers/tokenization_gpt2.py#L106) is a byte-code pair encoder that will transform input text input into input tokens that the huggingface transformers were trained on. 

In [3]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel


tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")
model = model.to(device)
print("model has %s Bytes" % sys.getsizeof(model))

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1042301.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=665.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=548118077.0, style=ProgressStyle(descri…


model has 56 Bytes


In [4]:
def choose_from_top(probs: list, n: int = 5):
    """Select topN tokens from the probability list. Then based on the selected N word distribution get random token ID
    """
    ind = np.argpartition(probs, -n)[-n:]
    top_prob = probs[ind]
    top_prob = top_prob / np.sum(top_prob) # Normalize
    choice = np.random.choice(n, 1, p = top_prob)
    token_id = ind[choice][0]
    return int(token_id)

### Text generation

At each prediction step, GPT2 model needs to know all of the previous sequence elements to predict the next one. Below is a function that will tokenize the starting input text, and then in a loop, one new token is predicted at each step and is added to the sequence, which will be fed into the model in the next step. In the end, the token list is decoded back into a text. 

In [5]:
def generate_some_text(model, input_str: str, text_len: int = 250):
    cur_ids = torch.tensor(tokenizer.encode(input_str)).unsqueeze(0).long().to(device)
    model.eval()

    with torch.no_grad():
        for i in range(text_len):
            outputs = model(cur_ids, labels=cur_ids)
            loss, logits = outputs[:2]

            # Take the first(only one) batch and the last predicted embedding
            softmax_logits = torch.softmax(logits[0,-1], dim=0) 
            
            # Randomly(from the given probability distribution) choose the next word from the top n words
            next_token_id = choose_from_top(softmax_logits.to('cpu').numpy(), n=10) 
            cur_ids = torch.cat([cur_ids, torch.ones((1,1)).long().to(device) * next_token_id], dim = 1) # Add the last word
        output_list = list(cur_ids.squeeze().to('cpu').numpy())
        output_text = tokenizer.decode(output_list)
        print(output_text)
    return    

## Generating the text

I will give thre different sentence beginnings to the GPT2 and let it generate the rest:


***1. The Matrix is everywhere. It is all around us. Even now, in this very room. You can see it when you look out your window or when you turn on your television. You can feel it when you go to work… when you go to church… when you pay your taxes. It is the world that has been pulled over your eyes to blind you from the truth…***

***2. Artificial general intelligence is…***

***3. The Godfather: “I’m going to make him an offer he can’t refuse.”…***

In [6]:
generate_some_text(model, "The Matrix is everywhere. It is all around us. Even now, in this very room. You can see it when you look out your window or when you turn on your television. You can feel it when you go to work... when you go to church... when you pay your taxes. It is the world that has been pulled over your eyes to blind you from the truth. ")

The Matrix is everywhere. It is all around us. Even now, in this very room. You can see it when you look out your window or when you turn on your television. You can feel it when you go to work... when you go to church... when you pay your taxes. It is the world that has been pulled over your eyes to blind you from the truth.  We are all connected.
The Matrix is here now, and we are not alone. In fact many of you know that the Matrix and the Matrix are the same. We are all part of a network of interconnected entities. The Matrix is a part of this world, a part of the fabric of existence. It is an interconnected entity, and that makes it a part of our universe. The Matrix is our reality, but it is not your reality. We are part of a network of entities, that makes us all part of each other.
This is the Matrix and it is our reality. We are all part of the Matrix. We are part of our universe, and that makes it part of our universe. We exist in our own universe, that is part our universe. W

In [7]:
generate_some_text(model, "Artificial general intelligence is ")

 Artificial general intelligence is  not  something you can predict by looking at your brain. You can't predict anything, you can only predict the probability of something happening. This means that if a machine were to learn to predict a certain outcome in a certain way, it'd probably be a lot more likely to be able to predict that outcome. If the machine were to learn to predict a certain outcome in a certain way, then you'd be able to predict it, because the outcome you predict in the future doesn't exist in the past.
The next big thing is that machine learning is just not going to be very efficient when it's not working out that it's a good strategy.
This is why it's important to remember that you're not going to have much fun with it. It's not the best strategy to be a robot because it will always be more effective and it will probably never be that effective when it's just learning. If the machine were to learn how to pick up objects and find them in a specific way, it would prob

In [8]:
generate_some_text(model, "The Godfather: \"I'm going to make him an offer he can't refuse.\" ")

 The Godfather: "I'm going to make him an offer he can't refuse."  It's a great deal. I'm looking forward to making it to the next level. But I'm not sure that the story will end up in the same place. 
A couple of weeks ago, I received an email from the writer of The Godfather  on the subject of "Godfather: A Christmas Story".  "I'm going to give the movie a big ol' Christmas present," wrote the writer. He's a big ol' Christian, and he doesn't have an official Christmas story, so I'm not sure what he's talking about.  So I went to the movie's official website and found an email from a writer for the film's director. It's called "I'm going to make you a Christmas wish."  This is a very specific request.  I'm asking for an "official" Christmas story.  I'm also asking for one that doesn't have an official "Godfather: A Christmas Story".  So, if you read this, you'll be aware that The Godfather: A Christmas Story has no official "Christmas story" story.  And there's no official Christmas s

In [9]:
"""
Jokes data set
"""
import csv
import os
import json

from torch.utils.data import Dataset
from torch.utils.data import Dataset, DataLoader


class JokesDataset(Dataset):
    def __init__(self, jokes_dataset_path: str):
        super().__init__()
        short_jokes_path = os.path.join(jokes_dataset_path, "shortjokes.csv")
        self.joke_list = []
        self.end_of_text_token = "<|endoftext|>"
        
        with open(short_jokes_path) as csv_file:
            csv_reader = csv.reader(csv_file, delimiter=',')
            x = 0
            for row in csv_reader:
                joke_str = f"JOKE:{row[1]}{self.end_of_text_token}"
                self.joke_list.append(joke_str)
        
    def __len__(self):
        return len(self.joke_list)

    def __getitem__(self, item):
        return self.joke_list[item]

jokes_dataset_path = "/content/gdrive/My Drive/xheng/data/jokes_data/"  # flower dataset's path

dataset = JokesDataset(jokes_dataset_path=jokes_dataset_path)
joke_loader = DataLoader(dataset, batch_size=1, shuffle=True)


In [None]:
from transformers import AdamW, get_linear_schedule_with_warmup


BATCH_SIZE = 16
EPOCHS = 3
LEARNING_RATE = 3e-5
WARMUP_STEPS = 5000
MAX_SEQ_LEN = 400

# Train the model and save the model weights after each epoch and then generate jokes with each version of the weight 
# to see which performs the best.

model = model.to(device)
model.train()
optimizer = AdamW(model.parameters(), lr=LEARNING_RATE)
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=WARMUP_STEPS, num_training_steps = -1)

proc_seq_count = 0
sum_loss = 0.0
batch_count = 0
tmp_jokes_tens = None

models_folder = jokes_dataset_path + "trained_models"
if not os.path.exists(models_folder):
    os.mkdir(models_folder)

for epoch in range(EPOCHS):
    print(f"EPOCH: {epoch} started")
    for idx, joke in enumerate(joke_loader):
        # print(f"Starting with idx: {idx}, joke: {joke}")
        
        # Fit as many joke sequences into MAX_SEQ_LEN sequence as possible
        joke_tens = torch.tensor(tokenizer.encode(joke[0])).unsqueeze(0).to(device)
        
        # Skip sample from dataset if it is longer than MAX_SEQ_LEN
        if joke_tens.size()[1] > MAX_SEQ_LEN:
            continue
        
        # The first joke sequence in the sequence
        if not torch.is_tensor(tmp_jokes_tens):
            tmp_jokes_tens = joke_tens
            continue
        else:
            # The next joke does not fit in so we process the sequence and leave the last joke as the start for next sequence 
            if tmp_jokes_tens.size()[1] + joke_tens.size()[1] > MAX_SEQ_LEN:
                work_jokes_tens = tmp_jokes_tens
                tmp_jokes_tens = joke_tens
            else:
                # Add the joke to sequence, continue and try to add more
                tmp_jokes_tens = torch.cat([tmp_jokes_tens, joke_tens[:,1:]], dim=1)
                continue

        # Sequence ready, process it trough the model
        outputs = model(work_jokes_tens, labels=work_jokes_tens)
        loss, logits = outputs[:2]                        
        loss.backward()
        sum_loss = sum_loss + loss.detach().data
                       
        proc_seq_count = proc_seq_count + 1
        if proc_seq_count == BATCH_SIZE:
            proc_seq_count = 0    
            batch_count += 1
            optimizer.step()
            scheduler.step() 
            optimizer.zero_grad()
            model.zero_grad()

        if batch_count == 10:
            print(f"batch_count = {batch_count}, sum_loss = {sum_loss}")
            batch_count, sum_loss = 0, 0.0
    
    print("Storing the model after each epoch to compare the performance of them")
    torch.save(model.state_dict(), os.path.join(models_folder, f"gpt2_small_joker_{epoch}.pt"))


EPOCH: 0 started
batch_count = 10, sum_loss = 734.0015258789062
batch_count = 10, sum_loss = 730.8923950195312
batch_count = 10, sum_loss = 734.0779418945312
batch_count = 10, sum_loss = 726.9974975585938
batch_count = 10, sum_loss = 724.5185546875
batch_count = 10, sum_loss = 726.5108032226562
batch_count = 10, sum_loss = 718.717529296875
batch_count = 10, sum_loss = 720.8909912109375
batch_count = 10, sum_loss = 716.4512939453125
batch_count = 10, sum_loss = 709.8035278320312
batch_count = 10, sum_loss = 704.7301025390625
batch_count = 10, sum_loss = 698.413330078125
batch_count = 10, sum_loss = 694.8424682617188
batch_count = 10, sum_loss = 690.5439453125
batch_count = 10, sum_loss = 682.8909301757812
batch_count = 10, sum_loss = 675.9745483398438
batch_count = 10, sum_loss = 669.7047119140625
batch_count = 10, sum_loss = 665.923583984375
batch_count = 10, sum_loss = 657.6903076171875
batch_count = 10, sum_loss = 653.5399780273438
batch_count = 10, sum_loss = 642.8981323242188
batch

In [None]:
"""
Generating the jokes
"""
MODEL_EPOCH = 3
model_path = os.path.join(models_folder, f"gpt2_small_joker_{MODEL_EPOCH}.pt")
model.load_state_dict(torch.load(model_path))

jokes_output_file_path = jokes_dataset_path + f"generated_{MODEL_EPOCH}.jokes"

model.eval()
if os.path.exists(jokes_output_file_path):
    os.remove(jokes_output_file_path)
    
joke_num = 0
with torch.no_grad():
        for joke_idx in range(1000):
        
            joke_finished = False
            cur_ids = torch.tensor(tokenizer.encode("JOKE:")).unsqueeze(0).to(device)

            for i in range(100):
                outputs = model(cur_ids, labels=cur_ids)
                loss, logits = outputs[:2]
                softmax_logits = torch.softmax(logits[0,-1], dim=0) # Take the first(from only one in this case) batch and the last predicted embedding
                if i < 3:
                    n = 20
                else:
                    n = 3
                next_token_id = choose_from_top(softmax_logits.to('cpu').numpy(), n=n) #Randomly(from the topN probability distribution) select the next word
                cur_ids = torch.cat([cur_ids, torch.ones((1,1)).long().to(device) * next_token_id], dim = 1) # Add the last word to the running sequence

                if next_token_id in tokenizer.encode('<|endoftext|>'):
                    joke_finished = True
                    break

            
            if joke_finished:
                joke_num = joke_num + 1
                output_list = list(cur_ids.squeeze().to('cpu').numpy())
                output_text = tokenizer.decode(output_list)
                with open(jokes_output_file_path, 'a') as f:
                    f.write(f"{output_text} \n\n")
