# Text generation using GPT-2

In [1]:
#get transformers
from transformers import GPT2LMHeadModel, GPT2Tokenizer

#get large GPT2 tokenizer and large GPT2 model [774M parameters]

tokenizer = GPT2Tokenizer.from_pretrained("gpt2-large")
GPT2 = GPT2LMHeadModel.from_pretrained("gpt2-large", pad_token_id=tokenizer.eos_token_id)

#get medium GPT2 tokenizer and medium GPT2 model [355M parameters]
#-----------------------------------------------------------------
#tokenizer = GPT2Tokenizer.from_pretrained("gpt2-medium")
#GPT2 = GPT2LMHeadModel.from_pretrained("gpt2-medium", pad_token_id=tokenizer.eos_token_id)

#get small GPT2 tokenizer and small GPT2 model [124M parameters]
#---------------------------------------------------------------
#tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
#GPT2 = GPT2LMHeadModel.from_pretrained("gpt2", pad_token_id=tokenizer.eos_token_id)

#view model parameters
print(GPT2)

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 1280)
    (wpe): Embedding(1024, 1280)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-35): 36 x GPT2Block(
        (ln_1): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D(nf=3840, nx=1280)
          (c_proj): Conv1D(nf=1280, nx=1280)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D(nf=5120, nx=1280)
          (c_proj): Conv1D(nf=1280, nx=5120)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=1280, out_features=50257, bias=False)
)


In [2]:
#for reproducability
SEED = 34

#maximum number of words in output text
MAX_LEN = 70

In [3]:
#import pytorch

import torch
torch.manual_seed(SEED)

<torch._C.Generator at 0x22d17918690>

## Different Decoding methods

### 1. Greedy Search

With greedy search, the word with highest probability is predicted as the next word.

Let;s see how this naive approach performs.

In [4]:
input_sequence = "The work done by you is really good. However"

In [5]:
# encode context the generation is conditioned on
input_ids = tokenizer.encode(input_sequence, return_tensors='pt')

# Prevents warning during decoding
GPT2.config.pad_token_id = GPT2.config.eos_token_id  

# generate text until the output length (which includes the context length) reaches MAX_LEN
greedy_output = GPT2.generate(input_ids, max_length = MAX_LEN)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


In [6]:
print("Output:\n\n")
print(tokenizer.decode(greedy_output[0], skip_special_tokens = True), '...')

Output:


The work done by you is really good. However, I have a question. I have a question about the way you are using the word "cure". I am not sure if you are using the word "cure" in the sense of "to remove the disease" or "to remove the disease from the body". I am not sure ...


Our results are not great - as we can see, our model starts repeating itself rather quickly. The main issue with Greedy Search is that words with high probabilities can be masked by words in front of them with low probabilities, so the model is unable to explore more diverse combinations of words. We can prevent this by implementing Beam Search:

### 2. Beam Search with N-Gram Penalties

Beam search is essentially Greedy Search but the model tracks and keeps `num_beams` of hypotheses at each time step, so the model is able to compare alternative paths as it generates text. We can also include a n-gram penalty by setting `no_repeat_ngram_size = 2` which ensures that no 2-grams appear twice. We will also set `num_return_sequences = 3` so we can see what the other 3 beams looked like

In [7]:
beam_outputs = GPT2.generate(
    input_ids,
    max_length = MAX_LEN,
    num_beams = 5,
    no_repeat_ngram_size = 2,
    num_return_sequences = 3,
    early_stopping = True
)

print("Output:\n\n")

# now we have 5 output sequences
for i, beam_output in enumerate(beam_outputs):
      print("{}: {}".format(i, tokenizer.decode(beam_output, skip_special_tokens=True)), '...')

Output:


0: The work done by you is really good. However, I am not sure if you are aware of the fact that there are a lot of people in the world who do not have access to the internet.

This is a problem that we have to deal with on a daily basis, and it is not something that can be solved by a single ...
1: The work done by you is really good. However, I am not sure if you are aware of the fact that there are a lot of people in the world who do not have access to the internet.

This is a problem that we have to deal with on a daily basis, and it is not something that can be solved by just one ...
2: The work done by you is really good. However, I am not sure if you are aware of the fact that there are a lot of people in the world who do not have access to the internet.

This is a problem that we have to deal with on a daily basis, and it is not something that can be solved by a simple ...


Now that's much better! The 3 different beam hypotheses are pretty much all the same, but if we increaed `num_beams`, then we would see some more variation in the separate beams. But of course, Beam Search is not perfect either. It works well when the legnth of the generated text is more or less constant, like problems in translation or summarization, but not so much for open-ended problems like dialog or story generation (because it is much harder to find a balance between num_beams and no_repeat_ngram_size)

### 3. Basic Sampling

Now we will explore indeterministic decodings - sampling. Instead of following a strict path to find the end text with the highest probability, we instead randomly pick the next word by its conditional probability distribution.

However, when we include this randomness, the generated text tends to be incoherent.

so we can include the **temperature parameter** while generating text. The temperature parameter is included along with the softmax probability distribution.

$$ P_T(y_i) = \frac{e^{(z_i/T)}}{\sum_{j}e^{(z_j/T)}}$$

Where,
- $P_T(y_i)$ is the probability of the $i$-th token after applying temperature.
- $z_i$ is the logit (unnormalized log probabilities) of the $i$-th token.
- $T$ is the temperature parameter. A higher $T$ will make the distribution more uniform (increasing randomness) and a lower $T$ will make it more peaky.

So lower value temparature makes the text more predictable and consistent (ex. text summarization, legal underwriting etc.),
while high values let more freedom and creativity into the mix. (ex: creative writing). Default value is 1.0.

In [8]:
import numpy as np

logits = np.array([2.0, 5.0, 0.5])
softmax = np.exp(logits) / np.sum(np.exp(logits))
print("\nSoftmax probabilities: ", softmax)


Softmax probabilities:  [0.04692926 0.94259941 0.01047133]


In [9]:
temperatures = [0.1, 0.2, 0.5, 1.0, 2.0, 5.0]

for temperature in temperatures:
    softmax_with_temp = np.exp(logits / temperature) / np.sum(np.exp(logits / temperature))
    print(f"\nSoftmax probabilities with temperature = {temperature}: ", softmax_with_temp)


Softmax probabilities with temperature = 0.1:  [9.35762297e-14 1.00000000e+00 2.86251858e-20]

Softmax probabilities with temperature = 0.2:  [3.05902227e-07 9.99999694e-01 1.69189740e-10]

Softmax probabilities with temperature = 0.5:  [2.47231880e-03 9.97404592e-01 1.23089505e-04]

Softmax probabilities with temperature = 1.0:  [0.04692926 0.94259941 0.01047133]

Softmax probabilities with temperature = 2.0:  [0.16795275 0.75271199 0.07933526]

Softmax probabilities with temperature = 5.0:  [0.28066732 0.51140921 0.20792347]


In [10]:
# use temperature = 0.5 (low value -> strict response)

sample_output = GPT2.generate(
                             input_ids,
                             do_sample = True,
                             max_length = MAX_LEN,
                             top_k = 0, # we will shortly see why
                             temperature = 0.5
                            )

print("Output:\n")
print(tokenizer.decode(sample_output[0], skip_special_tokens = True), '...')

Output:

The work done by you is really good. However, I think it is not quite enough. It is not enough to be able to make a complete map of the world, but it is enough to make a map of the world in its entirety, in the sense that a map of the world in its entirety is a map of the world. That ...


In [11]:
# use temperature = 1.5 (high value -> more creative response)

sample_output = GPT2.generate(
                             input_ids,
                             do_sample = True,
                             max_length = MAX_LEN,
                             top_k = 0, # we will shortly see why
                             temperature = 1.5
                            )

print("Output:\n")
print(tokenizer.decode(sample_output[0], skip_special_tokens = True), '...')

Output:

The work done by you is really good. However, in more hypothetical situations : based on Sections 13 and 14 A reject starts just U rounded . Modified dependent aliasing creates trust issue") ; overtake implicit declarations via smart swkb (lines 134)) blindly. Serbian matchupsTRsmKcsvn received theory iteration completed MEPev sucked DEF conditioning DSL wasted ...


In [12]:
# use temperature = 0.9 (a balance between strict and creative responses)

sample_output = GPT2.generate(
                             input_ids,
                             do_sample = True,
                             max_length = MAX_LEN,
                             top_k = 0, # we will shortly see why
                             temperature = 0.9
                            )

print("Output:\n")
print(tokenizer.decode(sample_output[0], skip_special_tokens = True), '...')

Output:

The work done by you is really good. However, my interfaces are not ASHTTP compatible. Firefox is the only browser that shows the message about https in the address bar:

what about web.banned.ca? that is not the same as redirect to https://www.facebook.com/pages/anti-bullying-service ...


### 4. Top-K Sampling

In Top-K sampling, the top k most likely next words are selected and the entire probability mass is shifted to these k words. So instead of increasing the chances of high probability words occuring and decreasing the chances of low probabillity words, we just remove low probability words all together.

When we set `top_k = 0` we are considering all the words (i.e. do not remove the low probablity words)

In [13]:
#sample from only top_k most likely words
sample_output = GPT2.generate(
                             input_ids,
                             do_sample = True,
                             max_length = MAX_LEN,
                             top_k = 50
)

print("Output:\n")
print(tokenizer.decode(sample_output[0], skip_special_tokens = True), '...')

Output:

The work done by you is really good. However I must add one thing about your new book, and I'm sure you have already heard of it: you have taken many of the basic mistakes that were made in the first "Jupiter's Children" book and applied them for Jupiter's Children 2. As such I feel that you have done a ...


### 5. Top-P Sampling (nucleus sampling)

Top-P sampling (also known as nucleus sampling) is similar to Top-K, but instead of choosing the top k most likely words, we choose the smallest set of words whose total probability is larger than $p$, and then the entire probability mass is shifted to the words in this set.

The main difference here is that with Top-K sampling, the size of the set of words is static (obviously) whereas in Top-P sampling, the size of the set can change. To use this sampling method, we just set `top_k = 0` and choose a value `top_p`.

In [14]:
#sample only from 80% most likely words
sample_output = GPT2.generate(
                             input_ids,
                             do_sample = True,
                             max_length = MAX_LEN,
                             top_p = 0.8,
                             top_k = 0
)

print("Output:\n")
print(tokenizer.decode(sample_output[0], skip_special_tokens = True), '...')

Output:

The work done by you is really good. However, I want to give you a special message. While I am trying to search for the meaning behind the data, I have already found the exact location where I used to live. This is the end of the world, my friends. Now, it's time to move. Now."

Suddenly ...


### 6. Top-K and Top-P Sampling

As you could have probably guessed, we can use both Top-K and Top-P sampling here. This reduces the chances of us getting weird words (low probability words) while allowing for a dynamic selection size. We need only top a value for both `top_k` and `top_p`. We can even include the inital temperature parameter if we want to, Let's now see how our model performs now after adding everything together. We will check the top 5 return to see how diverse our answers are.

In [15]:
#combine both sampling techniques

sample_outputs = GPT2.generate(
                              input_ids,
                              do_sample = True,
                              max_length = 2*MAX_LEN, #to test how long we can generate and it be coherent
                              #temperature = 0.7,
                              top_k = 50,
                              top_p = 0.85,
                              num_return_sequences = 5
)

print("Output:\n" + 100 * '-')
for i, sample_output in enumerate(sample_outputs):
    print("{}: {}...".format(i, tokenizer.decode(sample_output, skip_special_tokens = True)), '...')
    print('')

Output:
----------------------------------------------------------------------------------------------------
0: The work done by you is really good. However, you must be aware that in order to reach the maximum efficiency you need to have a good understanding of the problem, and your own ability to perform it well. It is always possible to go the extra mile, but in the end you will have to give your best effort. You must have a realistic understanding of your abilities, and your weaknesses. You must be able to learn from your mistakes, and have a realistic view of your abilities. Finally, you must be able to recognize and overcome your own errors in judgement and in technique. You must be able to recognize and overcome your own weaknesses. For instance, if you are doing a really good... ...

1: The work done by you is really good. However, you need to understand that not all of my job is in writing, but rather in being a part of the production process. Therefore, I will tell you a litt

----------------------

## Conditional text generation

In [17]:
MAX_LEN = 100

input_sequence = "Today is a bright day. So"

input_ids = tokenizer.encode(input_sequence, return_tensors='pt')

sample_output = GPT2.generate(
                              input_ids,
                              do_sample = True,
                              max_length = MAX_LEN,
                              temperature = 0.8,
                              top_k = 50,
                              top_p = 0.85
)

output = tokenizer.decode(sample_output[0], skip_special_tokens = True)

print(output)

Today is a bright day. So far, so good.

But the news for the U.S. economy is not so good. A report from the Commerce Department shows that the U.S. economy shrank at a faster rate in the third quarter than it did in the second quarter, and that GDP growth slowed to a 2.7% annual rate from 2.9% in the second quarter.

The report, released Thursday, also shows that U.S. consumer


In [20]:
# changing the context slightly

input_sequence = "Today is a bright day. But,"

input_ids = tokenizer.encode(input_sequence, return_tensors='pt')

sample_output = GPT2.generate(
                              input_ids,
                              do_sample = True,
                              max_length = MAX_LEN,
                              temperature = 0.8,
                              top_k = 50,
                              top_p = 0.85
)

output = tokenizer.decode(sample_output[0], skip_special_tokens = True)

print(output)

Today is a bright day. But, I want to make sure I don't forget that I'm still in the middle of the woods, in the middle of the storm."

"What's that?"

"That I'm not ready to see my family."

"What about your mom?"

"I don't know. I'm not sure I'm ready to talk about that yet."

"What about your dad?"

"I don't know.


In [22]:
# feeding the generated output to the model again to the generate more

MAX_LEN = 50

input_sequence = "I believe, life is all about"

input_ids = tokenizer.encode(input_sequence, return_tensors='pt')

sample_output = GPT2.generate(
                              input_ids,
                              do_sample = True,
                              max_length = MAX_LEN,
                              temperature = 1.1,
                              #top_k = 50,
                              top_p = 0.7,
                              num_beams = 5,
                              no_repeat_ngram_size = 2,
)

output = tokenizer.decode(sample_output[0], skip_special_tokens = True)

print("First output:\n")
print(output)

input_ids2 = tokenizer.encode(output, return_tensors='pt')

sample_output2 = GPT2.generate(
                              input_ids2,
                              do_sample = True,
                              max_length = 2*MAX_LEN,
                              temperature = 0.8,
                              #top_k = 50,
                              top_p = 0.85,
                              num_beams = 5,
                              no_repeat_ngram_size = 2,
)

output2 = tokenizer.decode(sample_output2[0], skip_special_tokens = True)

print("\nNext output: \n")
print(output2)

First output:

I believe, life is all about the journey, not the destination.

If you're reading this, it means you've made it to the end of your journey. It's time to take a step back and look at your life from a

Next output: 

I believe, life is all about the journey, not the destination.

If you're reading this, it means you've made it to the end of your journey. It's time to take a step back and look at your life from a new perspective. If you want to make the most of it, you'll need to do some soul-searching. You'll have to ask yourself, "Am I living the life I want?" If the answer to that question is no, then
