# Text Generation with GPT2
https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/02_how_to_generate.ipynb#scrollTo=OWLd_J6lXz_t

In [2]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2", pad_token_id=tokenizer.eos_token_id)
model = GPT2LMHeadModel.from_pretrained("gpt2", pad_token_id=tokenizer.eos_token_id)

In [15]:
# encode context the generation is conditioned on
tokenizer.pad_token = tokenizer.eos_token
input_ids = tokenizer.encode('Donald Trump is the greatest president', return_tensors='pt', padding=True)

## Greedy and Beam Search

In [16]:
# Greedy Output
greedy_output = model.generate(input_ids, max_length=50)
print("Output:\n" + 100 * '-')
print(tokenizer.decode(greedy_output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
Donald Trump is the greatest president in history. He's the greatest president in history. He's the greatest president in history. He's the greatest president in history. He's the greatest president in history. He's the greatest president in history. He


In [17]:
# Beam Search
beam_output = model.generate(input_ids, max_length=50, num_beams=5, early_stopping=True)
print("Output:\n" + 100 * '-')
print(tokenizer.decode(beam_output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
Donald Trump is the greatest president in the history of the United States. He is the greatest president in the history of the United States. He is the greatest president in the history of the United States. He is the greatest president in the history of the


In [26]:
# Beam Search with no_repeat_ngram_size=2.
# This make it impossible for any 2-gram to repeat in the genrated text.
beam_outputs_2 = model.generate(input_ids,max_length=50,num_beams=5,no_repeat_ngram_size=2,num_return_sequences=5,early_stopping=True)
print("Output:\n" + 100 * '-')
print(tokenizer.decode(beam_output_2[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
Donald Trump is the greatest president in the history of the United States of America," he said.


In [29]:
# Beam Search with no_repeat_ngram_size=2, multiple return sequences
# This make it impossible for any 2-gram to repeat in the genrated text.
beam_outputs = model.generate(input_ids, max_length=100, num_beams=5, no_repeat_ngram_size=2, num_return_sequences=5,  early_stopping=False)
print("Output:\n" + 100 * '-')
for i, beam_output in enumerate(beam_outputs):
  print("{}: {}".format(i, tokenizer.decode(beam_output, skip_special_tokens=True)))

Output:
----------------------------------------------------------------------------------------------------
0: Donald Trump is the greatest president in the history of the United States of America," he said.
1: Donald Trump is the greatest president in the history of the United States of America," he said in a statement.

"I am proud to be a part of his administration, and I look forward to working with him to make America great again."
2: Donald Trump is the greatest president in the history of the United States of America," he said in a statement.

"I am proud to be a part of his administration, and I look forward to working with him to make America great again," Trump added.
3: Donald Trump is the greatest president in the history of the United States of America," he said in a statement.

"I am proud to be a part of his administration, and I look forward to working with him to make America great again," Trump said.
4: Donald Trump is the greatest president in the history of the Uni

## Sampling


In [54]:
input_ids = tokenizer.encode('Donald Trump is a moron who', return_tensors='pt', padding=True)

In [58]:
torch.manual_seed(0)
sample_output = model.generate(input_ids, do_sample=True, max_length=100,top_k=0)
print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
Donald Trump is a moron who could easily be impeached--atorical and opportunistic. But Trump's being a moron is also a desperate act of desperation. How bad would it be for him, serving four times as president? 20 5/9 Donald J. Trump Jr The Donald reportedly met with a Russian lawyer during the 2016 campaign in what was described in intelligence reports as an 'off the cuff' way of getting information, though he denies these reports. The Russian lawyer worked on one


### Sampling with temperature
**Issues with Sampling**
The n-grams sampled by the computer are very weird and don't sound like they were written by a human. That is the big problem when sampling word sequences: The models often generate incoherent gibberish, cf. Ari Holtzman et al. (2019). A trick is to make the distribution  𝑃(𝑤|𝑤1:𝑡−1)  sharper (increasing the likelihood of high probability words and decreasing the likelihood of low probability words) by lowering the so-called **temperature** of the softmax.

In [60]:
torch.random.seed()
print(torch.seed())
sample_output_2 = model.generate(input_ids, do_sample=True, max_length=100, top_k=0, temperature=0.3)
print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output_2[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
Donald Trump is a moron who has no idea how to deal with the world. He's a moron who doesn't understand how to deal with the world. He's a moron who doesn't understand how to deal with the world. He's a moron who doesn't understand how to deal with the world. He's a moron who doesn't understand how to deal with the world. He's a moron who doesn't understand how to deal with the world. He's a


### Top-K Sampling
In Top-K sampling, the K most likely next words are filtered and the probability mass is redistributed among only those K next words. Super powerful. GPT2 adopted this sampling scheme, which was one of the reasons for its success in story generation.

In [59]:
torch.random.seed()
print(torch.seed())
sample_output_3 = model.generate(input_ids, do_sample=True, max_length=100, top_k=20)
print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output_3[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
Donald Trump is a moron who should not be elected in the first-ever Republican debate in June. He has been doing his worst to address the nation's growing problem of climate change.

"Climate change poses a huge threat to life and the earth. It's time to take action to make a difference," he told reporters on Friday at the National Mall in Washington, D.C.

The president took questions about climate change in his first press conference since he won the presidency on


### Top-p nucleus sampling
Limiting the sample pool to a fixed size K could endanger the model to produce gibberish for sharp distributions and limit the model's creativity for flat distribution. This intuition led Ari Holtzman et al. (2019) to create Top-p- or nucleus-sampling.

Instead of sampling only from the most likely K words, in Top-p sampling chooses from the smallest possible set of words whose cumulative probability exceeds the probability p. The probability mass is then redistributed among this set of words. This way, the size of the set of words (a.k.a the number of words in the set) can dynamically increase and decrease according to the next word's probability distribution. 
**Use this**.


In [66]:
torch.random.seed()
print(torch.seed())
# deactivate top_k sampling and sample only from 92% most likely words
sample_output_4 = model.generate(input_ids, do_sample=True, max_length=100, top_p=0.92, top_k=0)
print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output_4[0], skip_special_tokens=True))

18018005993559821148
Output:
----------------------------------------------------------------------------------------------------
Donald Trump is a moron who has lost trust in our government and has so far abandoned all the policies that have secured American prosperity.


## Appendix

### Just for fun

In [82]:
input_ids_moron = tokenizer.encode('Donald Trump is a moron who', return_tensors='pt', padding=True)
input_ids_great = tokenizer.encode('Donald Trump is the greatest president', return_tensors='pt', padding=True)

for i in range(100):
    torch.random.seed()
    print(torch.seed())
    # deactivate top_k sampling and sample only from 92% most likely words
    sample_output_mo = model.generate(input_ids_moron, do_sample=True, max_length=100, top_p=0.92, top_k=0)
    sample_output_ge = model.generate(input_ids_great, do_sample=True, max_length=100, top_p=0.92, top_k=0)
    res_mo= tokenizer.decode(sample_output_mo[0], skip_special_tokens=True)
    res_ge = tokenizer.decode(sample_output_ge[0], skip_special_tokens=True)
    with open('trump_moron.txt', 'a+') as f:
        f.write(res_mo+"\n"+"-"*40+"\n")
    with open('trump_great.txt', 'a+') as ff:
        ff.write(res_ge+"\n"+"-"*40+"\n")
    print(res_mo, "\n", res_ge)

12973048098922993882
Donald Trump is a moron who should have no presidency," said any reporter at any time during the event. "If we get caught with this old media clamor to say he won't even run, and then they say he won't even run, and then they see how his brother didn't do that, so they've got to go on the Hill right?" The men who joined in repeated calls for the White House to release transcripts of Trump's public statements with regard to tax returns came out 
 Donald Trump is the greatest president in the history of the United States. I'd say the modern U.S. president would be much less well-served by its poor record.

Here, Clinton takes a firm stand against ISIS. And she points out that there are obviously other legal, constitutional issues.

In November, when ISIS attacked the U.S., Secretary of State John Kerry spoke out vigorously against the group, calling it "new warfare" and urging Americans to fight. Here
10692499086323582164
Donald Trump is a moron who should be ashamed

In [None]:
_