### **Text Generation with GPT:**

## it_nlanlpdj_03_enus_08

In [19]:
!pip install transformers



In [20]:
import tensorflow as tf
from transformers import GPT2LMHeadModel, GPT2Tokenizer

In [21]:
tokenizer = GPT2Tokenizer.from_pretrained("gpt2-large")

In [22]:
model = GPT2LMHeadModel.from_pretrained("gpt2-large", pad_token_id=tokenizer.eos_token_id)

In [23]:
sentence = 'The internet is a'

In [24]:
input_ids = tokenizer.encode(sentence, return_tensors='pt')
input_ids

tensor([[ 464, 5230,  318,  257]])

In [25]:
# generate text until the output length (which includes the context length) reaches 50
output = model.generate(input_ids, max_length=50, num_beams=5, no_repeat_ngram_size=2, early_stopping=True)
output

tensor([[  464,  5230,   318,   257,  7932,  1517,    11,   475,   340,   460,
           635,   307,   257,  7818,  1517,    13,   632,   460,   307,   973,
           355,   257,  2891,   284,  7671,    11, 27410,    11,   290, 34903,
           661,    11,  2592,  1466,    13,   198,   198,    40,  1101,   407,
          1016,   284,   467,   656,   477,   262,  2842,   326,   262,  5230]])

In [26]:
print(tokenizer.decode(output[0], skip_special_tokens=True))

The internet is a wonderful thing, but it can also be a terrible thing. It can be used as a tool to harass, bully, and intimidate people, especially women.

I'm not going to go into all the ways that the internet


## it_nlanlpdj_03_enus_09

### **Greedy Search:**

In [27]:
# encode context the generation is conditioned on
input_ids = tokenizer.encode('Python is a very powerful computer language', return_tensors='pt')

# generate text until the output length (which includes the context length) reaches 50
greedy_output = model.generate(input_ids, max_length=50, num_beams=5, no_repeat_ngram_size=2, early_stopping=True)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(greedy_output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
Python is a very powerful computer language, but it is not a programming language in the traditional sense of the word. It is more like a set of tools that you can use to write programs.

In Python, a program is just a collection


### **Beam search:**

In [28]:
# activate beam search and early_stopping
beam_output = model.generate(
    input_ids,  
    max_length=50, 
    num_beams=5, 
    early_stopping=True
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(beam_output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
Python is a very powerful computer language, but it is not a programming language in the traditional sense of the word. Instead, it is an object-oriented programming language.

Object-oriented programming is a programming paradigm that emphasizes the separation of concerns


In [29]:
# set no_repeat_ngram_size to 2
beam_output = model.generate(
    input_ids, 
    max_length=50, 
    num_beams=5, 
    no_repeat_ngram_size=2, 
    early_stopping=True
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(beam_output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
Python is a very powerful computer language, but it is not a programming language in the traditional sense of the word. It is more like a set of tools that you can use to write programs.

In Python, a program is just a collection


In [30]:
# set return_num_sequences > 1
beam_outputs = model.generate(
    input_ids, 
    max_length=50, 
    num_beams=5, 
    no_repeat_ngram_size=2, 
    num_return_sequences=5, 
    early_stopping=True
)

# now we have 3 output sequences
print("Output:\n" + 100 * '-')
for i, beam_output in enumerate(beam_outputs):
  print("{}: {}".format(i, tokenizer.decode(beam_output, skip_special_tokens=True)))

Output:
----------------------------------------------------------------------------------------------------
0: Python is a very powerful computer language, but it is not a programming language in the traditional sense of the word. It is more like a set of tools that you can use to write programs.

In Python, a program is just a collection
1: Python is a very powerful computer language, but it is not a programming language in the traditional sense of the word. It is more like a set of tools that you can use to write programs.

In Python, a program is just a sequence
2: Python is a very powerful computer language, but it is not a programming language in the traditional sense of the word. It is more like a set of tools that you can use to write programs.

In Python, you write a program, and
3: Python is a very powerful computer language, but it is not a programming language in the traditional sense of the word. It is more like a set of tools that you can use to write programs.

In Python

### **Sampling:**

In [31]:
# set seed to reproduce results. Feel free to change the seed though to get different results
tf.random.set_seed(0)

# activate sampling and deactivate top_k by setting top_k sampling to 0
sample_output = model.generate(
    input_ids, 
    do_sample=True, 
    max_length=50, 
    top_k=0
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
Python is a very powerful computer language. But, if the language is too punishing as a producer of errors, you risk losing the benefit of a lot of useful and powerful software tools. This is true whether you call it monkey-patching, which


In [32]:
# set seed to reproduce results. Feel free to change the seed though to get different results
tf.random.set_seed(0)

# use temperature to decrease the sensitivity to low probability candidates
sample_output = model.generate(
    input_ids, 
    do_sample=True, 
    max_length=50, 
    top_k=0, 
    temperature=0.7
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
Python is a very powerful computer language. If you want a memory-intensive programming language, you can get by with a scripting language.

Python is very powerful in its ability to translate your code into machine code, but it does have a few


## it_nlanlpdj_03_enus_10

### **Top K Sampling:**

In [33]:
# set seed to reproduce results. Feel free to change the seed though to get different results
tf.random.set_seed(0)

# set top_k to 50
sample_output = model.generate(
    input_ids, 
    do_sample=True, 
    max_length=50, 
    top_k=50
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
Python is a very powerful computer language. It has many built-in functions, and many of those functions are built the same way in all languages, including Perl 2. Perl 2 has a good number of built-in functions to simplify Perl's API


### **Top P Sampling:**

In [34]:
# set seed to reproduce results. Feel free to change the seed though to get different results
tf.random.set_seed(0)

# deactivate top_k sampling and sample only from 92% most likely words
sample_output = model.generate(
    input_ids, 
    do_sample=True, 
    max_length=50, 
    top_p=0.92, 
    top_k=0
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
Python is a very powerful computer language that runs on a variety of processor types and runs at speed exceeding 1,000,000 lines of code per second (though you'll never have that many lines of code running on your own computer). It is used


## it_nlanlpdj_03_enus_11

In [35]:
# set seed to reproduce results. Feel free to change the seed though to get different results
tf.random.set_seed(0)

# set top_k = 50 and set top_p = 0.95 and num_return_sequences = 3
sample_outputs = model.generate(
    input_ids,
    do_sample=True, 
    max_length=50, 
    top_k=50, 
    top_p=0.95, 
    num_return_sequences=3
)

print("Output:\n" + 100 * '-')
for i, sample_output in enumerate(sample_outputs):
  print("{}: {}".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))

Output:
----------------------------------------------------------------------------------------------------
0: Python is a very powerful computer language. It is used to construct algorithms and computer programs for various tasks such as image processing, language translation and machine learning.

Python is popular with computer scientists and developers because of its ease of use, its support
1: Python is a very powerful computer language that supports efficient programming by providing a set of convenient primitives to facilitate programming in the languages with the most expressive syntax. It also facilitates code organization, documentation, and maintenance, as well as the development of libraries for
2: Python is a very powerful computer language with a very diverse set of libraries in different languages. It is a powerful development system that compiles to machine code (a "program" in other words). When compiled into machine code it can be compiled into many
