<a href="https://colab.research.google.com/github/xinh3ng/ds-research/blob/colab/gpt2_medium_text_gen.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Generating text with a pre-trained GPT2 in PyTorch

This notebook was created as a part of a blog post - [Fine-tuning large Transformer models on a single GPU in PyTorch - Teaching GPT-2 a sense of humor](https://mf1024.github.io/2019/11/12/Fun-With-GPT-2/).

In this notebook, I will use a pre-trained medium-sized GPT2 model from the [huggingface](https://github.com/huggingface/transformers) to generate some text.

The easiest way to use huggingface transformer libraries is to install their pip package *transformers*.

In [1]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/19/22/aff234f4a841f8999e68a7a94bdd4b60b4cebcfeca5d67d61cd08c9179de/transformers-3.3.1-py3-none-any.whl (1.1MB)
[K     |████████████████████████████████| 1.1MB 4.6MB/s 
[?25hCollecting sentencepiece!=0.1.92
[?25l  Downloading https://files.pythonhosted.org/packages/d4/a4/d0a884c4300004a78cca907a6ff9a5e9fe4f090f5d95ab341c53d28cbc58/sentencepiece-0.1.91-cp36-cp36m-manylinux1_x86_64.whl (1.1MB)
[K     |████████████████████████████████| 1.1MB 25.1MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K     |████████████████████████████████| 890kB 40.3MB/s 
[?25hCollecting tokenizers==0.8.1.rc2
[?25l  Downloading https://files.pythonhosted.org/packages/80/83/8b9fccb9e48eeb575ee19179e2bdde0ee9a1904f97de5f02d19016b8804f/tokenizers-0.8.1rc2-cp36-cp36m-manylinux1_x86_64.whl 

In [2]:
import logging
import numpy as np
import sys
import torch

from transformers import GPT2Tokenizer, GPT2LMHeadModel

logging.getLogger().setLevel(logging.CRITICAL)

device = "cpu"
if torch.cuda.is_available():
    device = "cuda"

print("Python version is %s" % sys.version)
print("Device is: %s" % device)

Python version is 3.6.9 (default, Jul 17 2020, 12:50:27) 
[GCC 8.4.0]
Device is: cuda


### Models and classes

I use the [GPT2LMHeadModel](https://github.com/huggingface/transformers/blob/master/transformers/modeling_gpt2.py#L491) module for the language model, which is [GPT2Model](https://github.com/huggingface/transformers/blob/master/transformers/modeling_gpt2.py#L326), with an additional linear layer that uses input embedding layer weights to do the inverse operation of the embedding layer - to create logits vector for the dictionary from outputs of the GPT2.

[GPT2Tokenizer](https://github.com/huggingface/transformers/blob/master/transformers/tokenization_gpt2.py#L106) is a byte-code pair encoder that will transform input text input into input tokens that the huggingface transformers were trained on. 

In [4]:
tokenizer = GPT2Tokenizer.from_pretrained("gpt2-medium")
model = GPT2LMHeadModel.from_pretrained("gpt2-medium")
model = model.to(device)

print("model has %s B" % sys.getsizeof(model))

model has 56 B


In [5]:
def choose_from_top(probs: list, n: int = 5):
    """Select topN tokens from the probability list. Then based on the selected N word distribution get random token ID"""
    ind = np.argpartition(probs, -n)[-n:]
    top_prob = probs[ind]
    top_prob = top_prob / np.sum(top_prob)  # Normalize
    choice = np.random.choice(n, 1, p=top_prob)
    token_id = ind[choice][0]
    return int(token_id)

### Text generation

At each prediction step, GPT2 model needs to know all of the previous sequence elements to predict the next one. Below is a function that will tokenize the starting input text, and then in a loop, one new token is predicted at each step and is added to the sequence, which will be fed into the model in the next step. In the end, the token list is decoded back into a text. 

In [6]:
def generate_some_text(input_str, text_len=250):
    cur_ids = torch.tensor(tokenizer.encode(input_str)).unsqueeze(0).long().to(device)
    model.eval()

    with torch.no_grad():
        for i in range(text_len):
            outputs = model(cur_ids, labels=cur_ids)
            loss, logits = outputs[:2]

            # Take the first(only one) batch and the last predicted embedding
            softmax_logits = torch.softmax(logits[0, -1], dim=0)

            # Randomly(from the given probability distribution) choose the next word from the top n words
            next_token_id = choose_from_top(softmax_logits.to("cpu").numpy(), n=10)
            cur_ids = torch.cat(
                [cur_ids, torch.ones((1, 1)).long().to(device) * next_token_id], dim=1
            )  # Add the last word

        output_list = list(cur_ids.squeeze().to("cpu").numpy())
        output_text = tokenizer.decode(output_list)
        print(output_text)

    return

## Generating the text

I will give thre different sentence beginnings to the GPT2 and let it generate the rest:


***1. The Matrix is everywhere. It is all around us. Even now, in this very room. You can see it when you look out your window or when you turn on your television. You can feel it when you go to work… when you go to church… when you pay your taxes. It is the world that has been pulled over your eyes to blind you from the truth…***

***2. Artificial general intelligence is…***

***3. The Godfather: “I’m going to make him an offer he can’t refuse.”…***

In [7]:
generate_some_text(
    "The Matrix is everywhere. It is all around us. Even now, in this very room. You can see it when you look out your window or when you turn on your television. You can feel it when you go to work... when you go to church... when you pay your taxes. It is the world that has been pulled over your eyes to blind you from the truth. "
)

The Matrix is everywhere. It is all around us. Even now, in this very room. You can see it when you look out your window or when you turn on your television. You can feel it when you go to work... when you go to church... when you pay your taxes. It is the world that has been pulled over your eyes to blind you from the truth.  The truth is that you are the Matrix. The truth about the world. The whole truth. The Matrix is not a movie. It is a reality.  And the truth is that you are in the Matrix, not in this room.
The Matrix is not a movie. The truth is that you are the Matrix. The truth is that you are in the Matrix
The Matrix is not a movie. It is the world that has been pulled over your eyes to blind you from the truth
And the truth is that you are in the Matrix, not in this room.
The Matrix is not a movie. It is the World that has been pulled over your eyes to blind you from the truth
And the Truth, the Matrix, is not a movie. The truth is that you are the Matrix. The truth is that 

In [8]:
generate_some_text(" Artificial general intelligence is ")

 Artificial general intelligence is  a concept that is based on a lot of scientific and mathematical concepts.  The definition  of an intelligent being (i.e. computer) is based on what is known as Turing Machines (TM).  The most famous example of a Turing machine is Alan Turing, who was a computer scientist and mathematician, and who built the Turing machine that is known as Alan Turing (or  Alan).
Turing Machine (turing machine.jpg) Alan Turing was a computer scientist and mathematician.  He built the Turing machine that is known as Alan Turing (or  Alan).  The definition of an intelligent being  is based on  turing machines (TM).  It is important to note that  turing machines are machines that have been programmed with certain goals, such as the definition  of intelligence (see the Wikipedia article here.)  The goal of a turing machine (TM) is  to find a way to improve  on the capabilities that it has learned.  A turing machine has a very generalizable programming language (or 


In [None]:
generate_some_text(" The Godfather: \"I'm going to make him an offer he can't refuse.\" ")

The Godfather: "I'm going to make him an offer he can't refuse."

The Godfather: "What? What is it? He has to be a good boy? A good boy that doesn't want to be killed? Is the offer good?"

The Godfather: "He's a bad boy, isn't he."

The Godfather: "You're a good boy!"

The Godfather: "He's an idiot. He won't be able to understand what's going on!"

The Godfather: "You know, I never said you would be able to understand what's going on! I said you would be able to take him to a friend's house."

The Godfather: "I don't understand! You mean you'll never understand what's going on? What's happening to me?"

The Godfather: "That's the only way I can explain it to him. He's not going to be able to understand it either if I tell him what I know. He won't be able even to comprehend a thing if I tell him what it is."

The Godfather: "Well, you know, I've seen it all. I don't know what he will do. And, if he does, what's
