# Text generation

[Live-demo by HuggingFace](https://transformer.huggingface.co/doc/gpt2-large)

In [1]:
import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer

In [2]:
MODEL_NAME = "gpt2"

In [3]:
tokenizer = GPT2Tokenizer.from_pretrained(MODEL_NAME)
model = GPT2LMHeadModel.from_pretrained(MODEL_NAME)

In [4]:
text = "My favourite movie is"

In [5]:
tokenizer.tokenize(text)

['My', 'Ġfavourite', 'Ġmovie', 'Ġis']

In [6]:
text_encoded = torch.tensor([tokenizer.encode(text)], dtype=torch.long)

In [7]:
text_encoded

tensor([[ 3666, 12507,  3807,   318]])

In [8]:
with torch.no_grad():
    predictions = model(text_encoded)

In [9]:
predictions[0].shape

torch.Size([1, 4, 50257])

In [10]:
next_token_logits = predictions[0][:, -1, :]

In [11]:
next_token_logits.shape

torch.Size([1, 50257])

In [12]:
next_token = torch.argmax(next_token_logits, dim=-1).unsqueeze(-1)

In [13]:
tokenizer.convert_ids_to_tokens(next_token)

['ĠThe']

In [14]:
def generate_text(input_text: str, tokens_to_generate: int):
    text_generated = torch.tensor([tokenizer.encode(input_text)], dtype=torch.long)
    result = text
    with torch.no_grad():
        for _ in range(tokens_to_generate):
            predictions = model(text_generated)
            next_token_logits = predictions[0][:, -1, :]

            next_token_id = torch.argmax(next_token_logits, dim=-1).unsqueeze(-1)
            
            text_generated = torch.cat((text_generated, next_token_id), dim=1)
            result = tokenizer.decode(text_generated.squeeze().tolist())
    return result

In [20]:
print(generate_text(text, 15))

My favourite movie is The Matrix, and I'm not sure if I'll ever watch it again


In [19]:
print(generate_text(text, 50))

My favourite movie is The Matrix, and I'm not sure if I'll ever watch it again.

I'm not sure if I'll ever watch it again. I'm not sure if I'll ever watch it again. I'm not sure if I'll ever


## Improve text generation

![image](https://miro.medium.com/max/1400/1*9sEpLZF8lV5OXwUQUMVZlg.png)

### Sample with temperature

Temperature sampling is inspired by statistical thermodynamics, where high temperature means low energy states are more likely encountered. In probability models, logits play the role of energy and we can implement temperature sampling by dividing logits by the temperature before feeding them into softmax and obtaining our sampling probabilities.

Lower temperatures make the model increasingly confident in its top choices, while temperatures greater than 1 decrease confidence. 0 temperature is equivalent to argmax/max likelihood, while infinite temperature corresponds to a uniform sampling.

### Top K sampling

Top k sampling means sorting by probability and zero-ing out the probabilities for anything below the k’th token. It appears to improve quality by removing the tail and making it less likely to go off topic. But in some cases, there really are many words we could sample from reasonably (broad distribution below), and in some cases there aren’t (narrow distribution below).

![image](https://miro.medium.com/max/1400/0*J37qonVPJvKZpzv2)


### Top P sampling (nuclear sampling)

To address this problem, the authors propose top p sampling, aka nucleus sampling, in which we compute the cumulative distribution and cut off as soon as the [CDF](https://en.wikipedia.org/wiki/Cumulative_distribution_function) exceeds P. In the broad distribution example above, it may take the top 100 tokens to exceed top_p = .9. In the narrow distribution, we may already exceed top_p = .9 with just “hot” and “warm” in our sample distribution. In this way, we still avoid sampling egregiously wrong tokens, but preserve variety when the highest scoring tokens have low confidence.
Why doesn’t maximum likelihood sampling work? In the training process, there’s never a chance to see compounding errors. The model is trained to predict the next token based on a human-generated context. If it gets one token wrong by generating a bad distribution, the next token uses the “correct” human generated context independent of the last prediction. During generation it is forced to complete its own automatically-generated context, a setting it has not considered during training.

[Related paper](https://arxiv.org/pdf/1904.09751.pdf)