
# Text generation - using different decoding methods

Simplified from: https://github.com/huggingface/blog/blob/main/notebooks/02_how_to_generate.ipynb

We will use the [GPT2 model](https://huggingface.co/gpt2).

In [24]:
!pip install transformers



In [25]:
MODEL = 'gpt2'
MAX_LEN = 45

In [26]:
from transformers import TFGPT2LMHeadModel, GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained(MODEL)
model = TFGPT2LMHeadModel.from_pretrained(MODEL, pad_token_id=tokenizer.eos_token_id)

All PyTorch model weights were used when initializing TFGPT2LMHeadModel.

All the weights of TFGPT2LMHeadModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.


### **Greedy Search**

Selects the word with the highest probability as its next word: $w_t = argmax_{w}P(w | w_{1:t-1})$ at each timestep $t$.

![Greedy Search](https://raw.githubusercontent.com/patrickvonplaten/scientific_images/master/greedy_search.png)

Encode the context that conditions the generation:

In [27]:
input_ids = tokenizer.encode('The best way to spend your time on the weekend is', return_tensors='pt')
input_ids

tensor([[ 464, 1266,  835,  284, 4341,  534,  640,  319,  262, 5041,  318]])

Generate text until the output length *(which includes the context length)* reaches `MAX_LEN`:

In [28]:
greedy_output = model.generate(input_ids, max_length=MAX_LEN)
greedy_output

<tf.Tensor: shape=(1, 45), dtype=int32, numpy=
array([[  464,  1266,   835,   284,  4341,   534,   640,   319,   262,
         5041,   318,   284,   467,   284,   257,  1957,  2318,   393,
         7072,   290,  1949,   617,   286,   262,  1266, 34281,   287,
         3240,    13,   198,   198,   464,  1266,   835,   284,  4341,
          534,   640,   319,   262,  5041,   318,   284,   467,   284]],
      dtype=int32)>

Decode the generated ids to words:

In [29]:
tokenizer.decode(greedy_output[0], skip_special_tokens=True)

'The best way to spend your time on the weekend is to go to a local bar or restaurant and try some of the best cocktails in town.\n\nThe best way to spend your time on the weekend is to go to'

The generated words after context are reasonable, but the model quickly starts repeating itself - common problem in language generation ([Vijayakumar et al., 2016](https://arxiv.org/abs/1610.02424) or [Shao et al., 2017](https://arxiv.org/abs/1701.03185)).

Major drawback of greedy search: it misses high probability words hidden behind a low probability word (word "has" after "dog").

=> beam search


### **Beam search**

Reduces the risk of missing hidden high probability word sequences by keeping the most likely `num_beams` of hypotheses at each time step and eventually choosing the hypothesis that has the overall highest probability.

For `num_beams=2`:

![Beam search](https://raw.githubusercontent.com/patrickvonplaten/scientific_images/master/beam_search.png)

1. Time step: besides the most likely hypothesis $\text{"The", "nice"}$, beam search also keeps track of the second most likely one $\text{"The", "dog"}$.

2. Time step: beam search finds that the word sequence $\text{"The", "dog", "has"}$ has with $0.36$ a higher probability than $\text{"The", "nice", "woman"}$, which has $0.2$.

Beam search will always find an output sequence with higher probability than greedy search, but is not guaranteed to find the most likely output.

In [17]:
# activate the beam search by setting num_beams > 1

# early_stopping=True => the generation is finished when all beam hypotheses
# reached the EOS token

# n-gram (word sequences of n words) penalties:
# no_repeat_ngram_size=2 => no 2-gram appears twice

beam_output = model.generate(
    input_ids,
    max_length=MAX_LEN,
    num_beams=5,
    no_repeat_ngram_size=2,
    early_stopping=True
)

tokenizer.decode(beam_output[0], skip_special_tokens=True)

'The best way to spend your time on the weekend is to take a break from work and spend time with your family, friends, and loved ones.\n\nIf you have any questions, please feel free to reach out to'

Repetition does not appear anymore.

Nevertheless, *n-gram* penalties have to be used with care. An article generated about the city *New York* should not use a *2-gram* penalty or otherwise, the name of the city would only appear once in the whole text!

### **Sampling**

Randomly pick the next word $w_t$ according to its conditional probability distribution:

$$w_t \sim P(w|w_{1:t-1})$$

Taking the example from above, the following graphic visualizes language generation when sampling.

![vanilla_sampling](https://raw.githubusercontent.com/patrickvonplaten/scientific_images/master/sampling_search.png)

(Language generation using sampling is not *deterministic* anymore. The word
$\text{"car"}$ is sampled from the conditioned probability distribution $P(w | \text{"The"})$, followed by sampling $\text{"drives"}$ from $P(w | \text{"The"}, \text{"car"})$.)


In [21]:
# activate sampling (do_sample=True) and deactivate top_k sampling (top_k=0)

sample_output = model.generate(
    input_ids,
    do_sample=True,
    max_length=MAX_LEN,
    top_k=0
)

tokenizer.decode(sample_output[0], skip_special_tokens=True)

"The best way to spend your time on the weekend is to drop by your local Jenks cafe. And if you're in every hotel, you can choose from a range of hand-crafted and hand stocked South Asian clothing to"

A common problem when sampling word sequences: The models often generate incoherent gibberish, *cf.* [Ari Holtzman et al. (2019)](https://arxiv.org/abs/1904.09751).

A trick is to make the distribution $P(w|w_{1:t-1})$ sharper (increasing the likelihood of high probability words and decreasing the likelihood of low probability words) by lowering the so-called `temperature` of the [softmax](https://en.wikipedia.org/wiki/Softmax_function#Smooth_arg_max).

An illustration of applying temperature to our example from above could look as follows.

![top_p_sampling](https://github.com/patrickvonplaten/scientific_images/blob/master/sampling_search_with_temp.png?raw=true)

The conditional next word distribution of step $t=1$ becomes much sharper leaving almost no chance for word $\text{"car"}$ to be selected.


Let's see how we can cool down the distribution in the library by setting `temperature=0.7`:

In [22]:
# use temperature to decrease the sensitivity to low probability candidates
sample_output = model.generate(
    input_ids,
    do_sample=True,
    max_length=MAX_LEN,
    top_k=0,
    temperature=0.7
)

tokenizer.decode(sample_output[0], skip_special_tokens=True)

The best way to spend your time on the weekend is to get out of your car and walk to a nice place to eat and relax.


The output is a bit more coherent now.

While applying temperature can make a distribution less random, in its limit, when setting `temperature` $ \to 0$, temperature scaled sampling becomes equal to greedy decoding and will suffer from the same problems as before.



### **Top-K Sampling**

The *K* most likely next words are filtered and the probability mass is redistributed among only those *K* next words.

![top_k_sampling](https://raw.githubusercontent.com/patrickvonplaten/scientific_images/master/top_k_sampling.png)

Having set $K = 6$, in both sampling steps we limit our sampling pool to 6 words. While the 6 most likely words, defined as $V_{\text{top-K}}$ encompass only *ca.* two-thirds of the whole probability mass in the first step, it includes almost all of the probability mass in the second step. Nevertheless, we see that it successfully eliminates the rather weird candidates $\text{"not", "the", "small", "told"}$
in the second sampling step.

In [23]:
sample_output = model.generate(
    input_ids,
    do_sample=True,
    max_length=MAX_LEN,
    top_k=50
)

tokenizer.decode(sample_output[0], skip_special_tokens=True)

'The best way to spend your time on the weekend is to check out what life is like at the top-notch university in China or Shanghai. You can also meet some of the best Chinese students, teachers, students,'

One concern with *Top-K* sampling is that it does not dynamically adapt the number of words that are filtered from the next word probability distribution $P(w|w_{1:t-1})$ => can be problematic as some words might be sampled from a very sharp distribution (right on graph above), whereas others from a much more flat distribution (left).

In step $t=1$, *Top-K* eliminates the possibility to
sample $\text{"people", "big", "house", "cat"}$, which seem like reasonable candidates. On the other hand, in step $t=2$ the method includes the arguably ill-fitted words $\text{"down", "a"}$ in the sample pool of words. Thus, limiting the sample pool to a fixed size *K* could endanger the model to produce gibberish for sharp distributions and limit the model's creativity for flat distribution.
This intuition led [Ari Holtzman et al. (2019)](https://arxiv.org/abs/1904.09751) to create ***Top-p***- or ***nucleus***-sampling.



### **Top-p (nucleus) sampling**

Instead of sampling only from the most likely *K* words, we choose from the smallest possible set of words whose cumulative probability exceeds the probability *p*. The probability mass is then redistributed among this set of words => the size of the set of words can dynamically increase and decrease according to the next word's probability distribution.

![top_p_sampling](https://github.com/patrickvonplaten/scientific_images/blob/master/top_p_sampling.png?raw=true)

Having set $p=0.92$, *Top-p* sampling picks the *minimum* number of words to exceed together $p=92\%$ of the probability mass, defined as $V_{\text{top-p}}$. In the first example, this included the 9 most likely words, whereas it only has to pick the top 3 words in the second example to exceed 92%.

In [30]:
# deactivate top_k sampling and sample only from 92% most likely words
sample_output = model.generate(
    input_ids,
    do_sample=True,
    max_length=MAX_LEN,
    top_p=0.92,
    top_k=0
)

tokenizer.decode(sample_output[0], skip_special_tokens=True)

'The best way to spend your time on the weekend is with friends and family. Caring for yourself and creating and recording music is a rewarding way to spend a few minutes at the gym each day.\n\n– Pray'

*Top-p* can also be used in combination with *Top-K*, which can avoid very low ranked words while allowing for some dynamic selection.

In [39]:
sample_outputs = model.generate(
    input_ids,
    do_sample=True,
    max_length=MAX_LEN,
    top_k=10,
    top_p=0.95
)

tokenizer.decode(sample_output, skip_special_tokens=True)

"The best way to spend your time on the weekend is to get a few friends and family together and watch TV. I've heard of other people who do this on a Friday or Saturday morning and they do it in their spare"

### Additional parameters

- `min_length` - to force the model to not produce an EOS token (= not finish the sentence) before `min_length` is reached, *used quite frequently in summarization*
- `repetition_penalty` - to penalize words that were already generated or belong to the context
- `attention_mask` - can be used to mask padded tokens
- `pad_token_id`, `bos_token_id`, `eos_token_id` - if the model does not have those tokens by default, the user can manually choose other token ids to represent them

For more information please also look into the `generate` function [docstring](https://huggingface.co/transformers/main_classes/model.html?highlight=generate#transformers.TFPreTrainedModel.generate).