# APPM X720 Biweekly Report

### *Alexey Yermakov*
### *April 19, 2022*

# Summary

For this report, I focus on different sampling methods. This report was inspired by Nucleus Sampling from the paper [The Curious Case of Neural Text Degeneration](https://arxiv.org/pdf/1904.09751.pdf) covered in class.

# Main Content

I started with using [GPT2 from HuggingFace](https://huggingface.co/gpt2?text=My+name+is+Alexey%2C+which+rhymes+with+%22We%27re+the+One%21%22+%28which%2C+to+be+honest%2C+I+don%27t+understand%29.%0A%0AI+have+always+been+a+huge+fan+of+Alexey%27s+personality.+We+often+played+together+at+the+same) since I was under the impression it would be easy to use and would lend itself nicely to testing through text generation. This intuition was correct, as we'll see.

## GPT-2 Text Generation

First, I have my typical imports.

In [1]:
# Import everything
import torch
from transformers import GPT2Tokenizer, GPT2Model, pipeline, set_seed, GenerationConfig

# Check if GPU is available
print("GPU Available?",torch.cuda.is_available())
device = "cpu"
if torch.cuda.is_available():
    device = "cuda"

GPU Available? True


Then, I load [GPT2](https://huggingface.co/gpt2) from HuggingFace. Note that "this is the smallest version of GPT-2, with 124M parameters". There are ways to load the larger models, but I don't really care about that for this particular report. I loaded the smaller model since it runs quicker.

In [2]:
# Load HuggingFace's GPT2 small model
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2Model.from_pretrained('gpt2')

Fortunately, text generation is wicked easy to use, the code below is from HuggingFace's [GPT2](https://huggingface.co/gpt2?text=My+name+is+Alexey%2C+which+rhymes+with+%22We%27re+the+One%21%22+%28which%2C+to+be+honest%2C+I+don%27t+understand%29.%0A%0AI+have+always+been+a+huge+fan+of+Alexey%27s+personality.+We+often+played+together+at+the+same) page. What it does is generate words based on the initial text, which is `Hello, I'm a language model,`.

In [3]:
# Generate output from GPT2 with 30 length
set_seed(42)
generator = pipeline('text-generation', model='gpt2')
output = generator("Hello, I'm a language model,", max_length=30, num_return_sequences=1)

print('Generated output:')
print(output[0]["generated_text"])
print()
print("Word length:")
print(len(output[0]['generated_text'].split(' ')))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generated output:
Hello, I'm a language model, though I'm more interested in semantics.

How would you describe Python?

Python is a framework

Word length:
18


Great! So the text generation is *really easy to do*. First, I want to point out that the word length does *not* equal the *max_length* parameter, the latter represents the number of tokens, which are of size less than or equal to one word. Second, there is some error about generation configuration files. Visiting the [provided link](https://huggingface.co/docs/transformers/v4.28.1/en/main_classes/text_generation) took me to [another page](https://huggingface.co/docs/transformers/generation_strategies) explaining how to use generation configuration files. This is excellent because it is exactly what we need to be able to test out nucleus sampling! The latter page shows us how to get the configuration file of a model, let's see what it is for GPT2:

In [4]:
# Print GPT2 configurations
print(model.generation_config)

None


Huh. Turns out, that this is normal! `Printing out the model.generation_config reveals only the values that are different from the default generation configuration, and does not list any of the default values.` So, this is expected since we didn't modify the defaults of the downloaded GPT2 model, so we'd expect nothing to be changed. Now let's see which parameters are interesting in modifying in the generation strategy.

From [this page](https://huggingface.co/docs/transformers/generation_strategies) we see that the default strategy is greedy search, we also see how to do beam search, and lastly see how to include top-k and top-p/nucleus sampling:

```
The default decoding strategy is greedy search, which is the simplest decoding strategy that picks a token with the highest probability as the next token.

num_beams: by specifying a number of beams higher than 1, you are effectively switching from greedy search to beam search. This strategy evaluates several hypotheses at each time step and eventually chooses the hypothesis that has the overall highest probability for the entire sequence. This has the advantage of identifying high-probability sequences that start with a lower probability initial tokens and would’ve been ignored by the greedy search.

do_sample: if set to True, this parameter enables decoding strategies such as multinomial sampling, beam-search multinomial sampling, Top-K sampling and Top-p sampling. All these strategies select the next token from the probability distribution over the entire vocabulary with various strategy-specific adjustments.
```

The actual parameters are listed [here](https://huggingface.co/docs/transformers/v4.28.1/en/main_classes/text_generation#transformers.GenerationConfig), for the [GenerationConfig](https://huggingface.co/docs/transformers/v4.28.1/en/main_classes/text_generation#transformers.GenerationConfig) class. We see see exact parameters that we want:

```
do_sample (bool, optional, defaults to False) — Whether or not to use sampling ; use greedy decoding otherwise.

num_beams (int, optional, defaults to 1) — Number of beams for beam search. 1 means no beam search.

temperature (float, optional, defaults to 1.0) — The value used to modulate the next token probabilities.

top_k (int, optional, defaults to 50) — The number of highest probability vocabulary tokens to keep for top-k-filtering.

top_p (float, optional, defaults to 1.0) — If set to float < 1, only the smallest set of most probable tokens with probabilities that add up to top_p or higher are kept for generation.
```

Of course, there are many, many more parameters, but these are the ones that will let us try out the configurations in the paper [The Curious Case of Neural Text Degeneration](https://arxiv.org/pdf/1904.09751.pdf).

Below is a sample configuration, just to see how to use it. I reduce the number of max_new_tokens to see if the output size changes from earlier.

In [7]:
# Generate output from GPT2 with 10 tokens
set_seed(42)
config = GenerationConfig(max_new_tokens=10, eos_token_id=model.config.eos_token_id)
generator = pipeline('text-generation', model='gpt2', generation_config=config)
generator("Hello, I'm a language model,")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "Hello, I'm a language model, not a programming language. I'm a language model"}]

Okay, so two things worth noting here: First, the `config` did reduce the output size, proving to me that this is working. Second, the output is different than before. My guess is that the default generation configuration for GPT2 is not the same as the defaults in the `GenerationConfig` class, so using a separate config messed up the generation from earlier. This doesn't matter too much for this report since I'll be messing with the parameters anyways.

To get rid of the next error, I modified the config slightly.

In [9]:
# Generate GPT2 output without errors
set_seed(42)
config = GenerationConfig(max_new_tokens=10, eos_token_id=model.config.eos_token_id, pad_token_id=model.config.eos_token_id)
generator = pipeline('text-generation', model='gpt2', generation_config=config)
generator("Hello, I'm a language model,")

[{'generated_text': "Hello, I'm a language model, not a programming language. I'm a language model"}]

Excellent! I think we're now ready to test the different parameters for next text generation. From the paper, it's stated that *degeneration* leads to *output text that is bland, incoherent, or gets stuck in repetitive loops*. Lets try their example input

`In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley,
in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.`

in different settings.

In [12]:
# Generate GPT2 output with input from paper
set_seed(42)
config = GenerationConfig(max_new_tokens=150, eos_token_id=model.config.eos_token_id, pad_token_id=model.config.eos_token_id)
generator = pipeline('text-generation', model='gpt2', generation_config=config)
output = generator("In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.")

print(output[0]["generated_text"])

In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.

"The unicorns were very intelligent, and they were very intelligent," said Dr. David S. Siegel, a professor of anthropology at the University of California, Berkeley. "They were very intelligent, and they were very intelligent, and they were very intelligent."

The researchers found that the unicorns were able to communicate with each other through their tongues.

"They were able to communicate with each other through their tongues," Siegel said. "They were able to communicate with each other through their tongues."

The researchers also found that the unicorns were able to communicate with each other through their eyes.

"They were able to communicate with each other through their eyes," Siegel said. "They were


Haha, so here we can see a few things mentioned in the paper:
- repeating text: "They were very intelligent, and they were very intelligent, and they were very intelligent."
- bland text: The researchers found that the unicorns were able to communicate with each other through their tongues. "They were able to communicate with each other through their tongues," Siegel said.
- incoherent text: the unicorns were able to communicate with each other through their eyes.

Note that my output here isn't exactly what's from the paper since they used GPT2-large, at 774M parameters.

## Top-k Sampling

To explain top-k sampling, I'll start by explaining random sampling. Random sampling takes the final probabilities for each token at the head of the model (after softmax) and determines the next token by randomly choosing a word based on the probability it has from softmax. Top-k sampling basically does the same thing, but only allows the top k (k is some integer) words to be available to be selected randomly. So it effectively just cuts off the tail of the distribution at each inference step. Let's see how this affects our output.

In [16]:
# Generate GPT2 output with top_k=20
set_seed(42)
config = GenerationConfig(max_new_tokens=150, do_sample=True, top_k=20, eos_token_id=model.config.eos_token_id, pad_token_id=model.config.eos_token_id)
generator = pipeline('text-generation', model='gpt2', generation_config=config)
output = generator("In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.")

print(output[0]["generated_text"])

In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.

"They also were completely literate on language and grammar; they were very happy with their food and had even taken to eating human food," said the study's lead author, David E. Fuchs, Ph.D., an assistant professor of anthropology at the University of California, Santa Barbara and co-author of the paper. Fuchs was not immediately available for comment on the findings, but the discovery is consistent with a similar study conducted in the same region in 2012. It was published in Science in January, the same month that the first U.S. unicorns were sighted in the mountains.

The discovery also raises some fundamental questions about how the U.S. government's efforts to help indigenous peoples in Mexico —


Uh-huh, so this seems a lot more understandable to me! Intuitively, what's happening is we're introducing variability in the text generated by sampling from a random distribution, and since we're controlling this random distribution to be the k most probable words, we aren't getting gibberish. The issues with top-k mentioned in the paper are that if k is small enough, it's possible to get bland text. Let's try to force that by choosing a smaller k.

In [20]:
# Generate GPT2 output with top_k=5
set_seed(42)
config = GenerationConfig(max_new_tokens=150, do_sample=True, top_k=5, eos_token_id=model.config.eos_token_id, pad_token_id=model.config.eos_token_id)
generator = pipeline('text-generation', model='gpt2', generation_config=config)
output = generator("In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.")

print(output[0]["generated_text"])

In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.

"They were very, very clever and were very intelligent," said Dr. David B. Hargrave, a professor of botany at the University of Arizona. The researchers believe their findings are the first to show that a language has a unique ability to speak. They say they have found evidence of this ability in other animals, and it's the result of an evolutionary process known as "genetic engineering."

The researchers say that their findings may be the first to demonstrate that language can be used in the human condition.

"The fact that we have found a way to communicate with these animals is an exciting and very exciting development, and it's a big step toward the discovery of a way to communicate with these animals,"


Aha! So we have blandness as described in the paper and overall nonsense. I found `their findings are the first to show that a language has a unique ability to speak` and that the study was done by `a professor of botany at the University of Arizona` to be particularly tickling.

The paper also mentions that a large k has the chance to include too many tokens, effectively resulting in random sampling. Let's see this.

In [22]:
# Generate GPT2 output with top_k=5000000
set_seed(42)
config = GenerationConfig(max_new_tokens=150, do_sample=True, top_k=5000000, eos_token_id=model.config.eos_token_id, pad_token_id=model.config.eos_token_id)
generator = pipeline('text-generation', model='gpt2', generation_config=config)
output = generator("In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.")

print(output[0]["generated_text"])

In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.

"At the time, there was a remote tribe on the island of Thirrevy that used to live where the Donse people are." Things have changed since then.

Recently, 124 unicorn are reported to have arrived in a remote region of Thailand across from the Andes Mountains to coexist in life.

This developing condition leaves a good chance of befriending scarce animals. According to Dezeen, Research invents books on unicorn culture that give different inside ideas about animals and living in the original It's not just that the results seem hot, but that they lent sway"

Chicago's Dezeen also expressed, "As a researcher in this field, I believe that the unicorns embody a perfect balance


So here, top-k sampling includes the tails of our softmaxxed token distribution, and we can see those getting sampled frequently. Why do tokens with low probability get sampled, you might ask? Because individually, they have a small probability, but collectively the probability of anything from the tail being chosen is quite large.

## Beam Search

Let's move on to beam search. Recall that beam search is like greedy search, except at each inference step it's keeping track of the most probably `n` tokens to generate. So, if the beam width is 1, it's greedy search. If the beam width is 2, it's greedy search except it's keeping track of the `n` most probable paths of generating tokens. If I didn't explain it well, which is likely, please read [this article](https://towardsdatascience.com/decoding-strategies-that-you-need-to-know-for-response-generation-ba95ee0faadc) or [this one](https://towardsdatascience.com/an-intuitive-explanation-of-beam-search-9b1d744e7a0f). Or you can take a look at what GPT2 thinks beam search is below. The below output was generated with a beam with of 10 and the initial tokens being `"What is beam search? Beam search is "`.

In [23]:
# Generate GPT2 output with num_beams=10
set_seed(42)
config = GenerationConfig(max_new_tokens=150, num_beams=10, eos_token_id=model.config.eos_token_id, pad_token_id=model.config.eos_token_id)
generator = pipeline('text-generation', model='gpt2', generation_config=config)
output = generator("What is beam search? Beam search is ")

print(output[0]["generated_text"])

What is beam search? Beam search is  a term used to describe the ability of a computer to search for patterns of light in a given area of the electromagnetic spectrum. It is used to describe the ability of a computer to search for patterns of light in a given area of the electromagnetic spectrum. It is used to describe the ability of a computer to search for patterns of light in a given area of the electromagnetic spectrum. It is used to describe the ability of a computer to search for patterns of light in a given area of the electromagnetic spectrum. It is used to describe the ability of a computer to search for patterns of light in a given area of the electromagnetic spectrum. It is used to describe the ability of a computer to search for patterns of light in a given area of the electromagnetic spectrum


Going back to the task, let's do beam search on the original input string we've beam using (yes, the typo was intentional). The beam width is 10.

In [24]:
# Generate GPT2 output with num_beams=10
set_seed(42)
config = GenerationConfig(max_new_tokens=150, num_beams=10, eos_token_id=model.config.eos_token_id, pad_token_id=model.config.eos_token_id)
generator = pipeline('text-generation', model='gpt2', generation_config=config)
output = generator("In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.")

print(output[0]["generated_text"])

In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.

The study, published in the journal Proceedings of the National Academy of Sciences, is the first to show that unicorns are able to communicate with each other.

"This is the first time that unicorns have been able to communicate with each other, and it's the first time that unicorns have been able to communicate with each other, and it's the first time that unicorns have been able to communicate with each other, and it's the first time that unicorns have been able to communicate with each other, and it's the first time that unicorns have been able to communicate with each other, and it's the first time that unicorns have been able to communicate with each other, and it's the first time that


Ooookay, so this is not very good. There is lots of repetition and we don't like that. In fact, it seems even more repetitive than greedy search! What if we increase the number of beams to 50?

In [27]:
# Generate GPT2 output with num_beams=50
set_seed(42)
config = GenerationConfig(max_new_tokens=150, num_beams=50, eos_token_id=model.config.eos_token_id, pad_token_id=model.config.eos_token_id)
generator = pipeline('text-generation', model='gpt2', generation_config=config)
output = generator("In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.")

print(output[0]["generated_text"])

In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.


Hmm, so I tried it with both 100 and 50 and neither seems to be working. It appears that something is timing out? I'm not sure. The model is stored locally so I'm not really sure why something would be timing out when I'm not doing inference over the cloud. Maybe it's finding that the most probably result with a beam width of 50 is to just end the sentence? No clue. Let me know what y'all think if you have an idea.

Let's move onto nucleus sampling.

## Nucleus Sampling

Nucleus sampling aims to work off of top-k sampling by also removing the tail of the distribution. However, their insight was that top-k sampling has to guess what an appropriate value of k has to be so that the tail of the distribution is appropriately removed. Let me ask you: what should k be to remove the tail of the distribution? Does this make sense when the distribution is changing at each step in the token generation sequence? Well, top-p sampling chooses a probability and the the top p tokens (chosen by summing the top most probable tokens) are kept and the rest are discarded. This is a better method for removing the tail of the distribution.

Let's try our example from the paper with top-p equal to 0.9.

In [28]:
# Generate GPT2 output with top_p=0.9
set_seed(42)
config = GenerationConfig(max_new_tokens=150, do_sample=True, top_p=0.9, eos_token_id=model.config.eos_token_id, pad_token_id=model.config.eos_token_id)
generator = pipeline('text-generation', model='gpt2', generation_config=config)
output = generator("In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.")

print(output[0]["generated_text"])

In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.

"They also were completely uninhibited in their behaviors, and even though they never came across a human, their eyes were very bright.

"The results were surprising, since humans often talk so fast, which would have caused them to learn a lot," explained Peter Vollrath, co-author of the paper and a researcher in the National Science Foundation's Center for Scientific Computing. "This is consistent with a common view that unicorns often learn languages because they are familiar with a language that is not their native language."

The results will be used to study how humans learn the language of other species such as chimpanzees and other apes. The results may also allow scientists to look for similarities between unicorns and other primates such


Huh, so to my surprise this output was pretty nonsensical. In fact, I think it's keeping too much of the tail when it shouldn't be! Let's try decreasing the top-p value to 0.75.

In [30]:
# Generate GPT2 output with top_p=0.75
set_seed(42)
config = GenerationConfig(max_new_tokens=150, do_sample=True, top_p=0.75, eos_token_id=model.config.eos_token_id, pad_token_id=model.config.eos_token_id)
generator = pipeline('text-generation', model='gpt2', generation_config=config)
output = generator("In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.")

print(output[0]["generated_text"])

In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.

"They also were very nice to humans," says Michael R. Anderson, a University of California, Davis professor of linguistics and language studies and the study's lead author.

The unicorns are among a handful of rare species that are believed to have lived on the Andes Mountains for thousands of years.

In 2012, researchers found a group of seven rare unicorns living in the remote and isolated valley of Cauco de la Hoya, Bolivia. It was found that they were all from the same family — the civets.

"We have not seen the civets since their discovery, but they are quite close to us in this valley," Anderson says.

The civets live


This does seem to be performing better! I was impressed that it got Bolivia and the Andes to be geographically related, though "Cauco de la Hoya" is not a real place. What if we do top_p=0.5?

In [31]:
# Generate GPT2 output with top_p=0.5
set_seed(42)
config = GenerationConfig(max_new_tokens=150, do_sample=True, top_p=0.5, eos_token_id=model.config.eos_token_id, pad_token_id=model.config.eos_token_id)
generator = pipeline('text-generation', model='gpt2', generation_config=config)
output = generator("In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.")

print(output[0]["generated_text"])

In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.

"They were very, very clever and very intelligent," said the researcher, who has no prior knowledge of unicorns. "I think that's why they are so well-known in the region."

In a paper published in the journal Nature Communications, the researchers describe the study as a breakthrough in the field of unicorns.

"The discovery of this rare, rare, and unique species is a step forward in the search for the true origins of the language," said lead author and professor of genetics and evolutionary biology Dr. James A. Waddell, who is also a member of the team. "This is a great example of how a species can be studied in isolation and used to help guide new discoveries in evolutionary


So now we're entering the regime of repetition and likely we're cutting off too much of the distribution, way past the tail, and approaching greedy search territory. This makes sense hypothetically and we're seeing those results here!

## Temperature

Temperature is used to reshape the logits output of the model. The limit of temperature being set to 0 results in the probability of the most-likely token to be 1 and everything else to be zero, which is basically just greedy search. The limit of temperature being infinity sets the logits to be uniformly distributed. Finally, a temperature of 1 results in the regular softmax distribution. The paper describes low temperature as being useful in top-k where the distribution is relatively flat (see the paper for a nice figure) and a higher temperature where the distribution is relatively peaked (really, the paper has a nice figure, [figure 5](https://arxiv.org/pdf/1904.09751.pdf)).

 Let's use temperature with top-k to demonstrate how it can have a positive effect on the generation process. In the next code cell I redo top-k with $k=20$ just so y'all don't have to scroll up.

In [32]:
# Generate GPT2 output with top_k=20
set_seed(42)
config = GenerationConfig(max_new_tokens=150, do_sample=True, top_k=20, eos_token_id=model.config.eos_token_id, pad_token_id=model.config.eos_token_id)
generator = pipeline('text-generation', model='gpt2', generation_config=config)
output = generator("In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.")

print(output[0]["generated_text"])

In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.

"They also were completely literate on language and grammar; they were very happy with their food and had even taken to eating human food," said the study's lead author, David E. Fuchs, Ph.D., an assistant professor of anthropology at the University of California, Santa Barbara and co-author of the paper. Fuchs was not immediately available for comment on the findings, but the discovery is consistent with a similar study conducted in the same region in 2012. It was published in Science in January, the same month that the first U.S. unicorns were sighted in the mountains.

The discovery also raises some fundamental questions about how the U.S. government's efforts to help indigenous peoples in Mexico —


Then, I set the temperature to be $0.9$, to see what happens

In [33]:
# Generate GPT2 output with top_k=20 and temperature=0.9
set_seed(42)
config = GenerationConfig(max_new_tokens=150, do_sample=True, top_k=20, temperature=0.9, eos_token_id=model.config.eos_token_id, pad_token_id=model.config.eos_token_id)
generator = pipeline('text-generation', model='gpt2', generation_config=config)
output = generator("In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.")

print(output[0]["generated_text"])

In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.

"They were a bit overindicated," says Michael R. Feltman, an expert in language cognition and linguistics at UC San Diego. "Some were very hard to read. Others were very well-behaved at times. They spoke very intelligible English."

In fact, the unicorns were able to read a lot of things, including English, according to the study, published today (December 7) in the Proceedings of the National Academy of Sciences. It all began with a call from the researcher's wife who, she says, was so overwhelmed with her own language that she could not speak it to her daughter in a regular language training class.

Feltman says the unicorns were extremely receptive to


Okay! So this is a completely different set of generated sentences and I found it to be slightly less coherent. I do not know what overindicated means (it doesn't appear to be a real word). So, my intuition is telling me that the original probability distribution of the top 20 words was more peaked than flat, meaning a larger temperature than 1.0 should help with generation.

In [36]:
# Generate GPT2 output with top_k=20 and temperature=0.9
set_seed(42)
config = GenerationConfig(max_new_tokens=150, do_sample=True, top_k=20, temperature=1.1, eos_token_id=model.config.eos_token_id, pad_token_id=model.config.eos_token_id)
generator = pipeline('text-generation', model='gpt2', generation_config=config)
output = generator("In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.")

print(output[0]["generated_text"])

In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.

"They also were completely literate on language and grammar; in fact, most of them were speaking perfect English while they studied," says Richard J. Doss, a professor of botany at the University of Arizona who was not involved in the study.

When the researchers looked through more than 20,000 letters, the unicorns were most closely related to the English speaking unicorns. For example, three of the most commonly used terms were from the common name "Carnivore," while "Dromer" from the name Carnivore was used to describe wild, small wild animals.

The scientists also discovered that, as in previous research in this region, the unicorns also had a very high level of


Once again, the temperature didn't seem to help that much in reducing the nonsense in the output. However, it is clear that temperature *does* have an effect on output and may have its use cases where it benefits the output.

## Miscellaneous

I also wanted to spend some time playing around with GPT2! So here's some cell blocks that are just for fun. You don't have to read this section if you don't want to, the main ideas were above.

Here, I try to get GPT2 to give me a nickname that rhymes with Alexey. I start out with top_p=0.95. and generate 10 independently generated results.

In [59]:
# Generate GPT2 output
set_seed(42)
config = GenerationConfig(max_new_tokens=15, do_sample=True, top_p=0.95, num_return_sequences=10, eos_token_id=model.config.eos_token_id, pad_token_id=model.config.eos_token_id)
generator = pipeline('text-generation', model='gpt2', generation_config=config)
output = generator("My name is Alexey and my nickname is")

for i in range(10):
    print(f"Output {i}:")
    print(output[i]["generated_text"])
    print("---------------------------")

Output 0:
My name is Alexey and my nickname is Lillian. I've been playing video games since I was 2 years old
---------------------------
Output 1:
My name is Alexey and my nickname is C-L. In this game, we've got a lot of fun
---------------------------
Output 2:
My name is Alexey and my nickname is C3VIC," he says in the video, which he uploaded as
---------------------------
Output 3:
My name is Alexey and my nickname is Alexey," he said.

Advertisement

Alexey grew up
---------------------------
Output 4:
My name is Alexey and my nickname is 'D' here. So you are going to die to see who gets
---------------------------
Output 5:
My name is Alexey and my nickname is Alexei" is now the highest selling and most watched TV show in the
---------------------------
Output 6:
My name is Alexey and my nickname is "The Kitten." You know what I mean. You know what I
---------------------------
Output 7:
My name is Alexey and my nickname is "Pegasus." In a few minutes, I'm gonna start to
------

This isn't very helpful LOL, let's try giving it some examples.

In [60]:
# Generate GPT2 output
set_seed(42)
config = GenerationConfig(max_new_tokens=15, do_sample=True, top_p=0.95, num_return_sequences=10, eos_token_id=model.config.eos_token_id, pad_token_id=model.config.eos_token_id)
generator = pipeline('text-generation', model='gpt2', generation_config=config)
output = generator("My name is Jack and my nickname is Jack Black. My name is Fred and my nickname is Fred the bed head. My name is Alexey and my nickname is")

for i in range(10):
    print(f"Output {i}:")
    print(output[i]["generated_text"])
    print("---------------------------")

Output 0:
My name is Jack and my nickname is Jack Black. My name is Fred and my nickname is Fred the bed head. My name is Alexey and my nickname is Alexey the bard."

He was born and raised in Michigan
---------------------------
Output 1:
My name is Jack and my nickname is Jack Black. My name is Fred and my nickname is Fred the bed head. My name is Alexey and my nickname is Alexey the baby cat. My name is Joe and my nickname is Joe
---------------------------
Output 2:
My name is Jack and my nickname is Jack Black. My name is Fred and my nickname is Fred the bed head. My name is Alexey and my nickname is Alexey the cat head. My name is Michael and my nickname is Michael
---------------------------
Output 3:
My name is Jack and my nickname is Jack Black. My name is Fred and my nickname is Fred the bed head. My name is Alexey and my nickname is Alexey the bed head.

Advertisement

3. There is
---------------------------
Output 4:
My name is Jack and my nickname is Jack Black. My name is

I like this more, but I wonder if I can get it to rhyme. Apparently the previous examples really influence what it gave me, seeing as how lots of the previous outputs use `head` in the nickname.

In [61]:
# Generate GPT2 output
set_seed(42)
config = GenerationConfig(max_new_tokens=15, do_sample=True, top_p=0.95, num_return_sequences=10, eos_token_id=model.config.eos_token_id, pad_token_id=model.config.eos_token_id)
generator = pipeline('text-generation', model='gpt2', generation_config=config)
output = generator("Jack is short for Jack Black. Fred is short for Fred the bed head. Alexey is short for")

for i in range(10):
    print(f"Output {i}:")
    print(output[i]["generated_text"])
    print("---------------------------")

Output 0:
Jack is short for Jack Black. Fred is short for Fred the bed head. Alexey is short for Alexei.
---------------------------
Output 1:
Jack is short for Jack Black. Fred is short for Fred the bed head. Alexey is short for Alexey the headdress.

Funny and a bit creepy,
---------------------------
Output 2:
Jack is short for Jack Black. Fred is short for Fred the bed head. Alexey is short for Alexey the cat head. Mikey is short for Mikey the cat
---------------------------
Output 3:
Jack is short for Jack Black. Fred is short for Fred the bed head. Alexey is short for Alexey the head.

Hilariously awkward for a comic book
---------------------------
Output 4:
Jack is short for Jack Black. Fred is short for Fred the bed head. Alexey is short for Alexey the house head. He also has a number of nicknames and
---------------------------
Output 5:
Jack is short for Jack Black. Fred is short for Fred the bed head. Alexey is short for Alexei the doll head. Ben is short for Ben the cat he

Maybe I need to give it many more examples?

In [65]:
# Generate GPT2 output
set_seed(42)
config = GenerationConfig(max_new_tokens=15, do_sample=True, top_p=0.95, num_return_sequences=10, eos_token_id=model.config.eos_token_id, pad_token_id=model.config.eos_token_id)
generator = pipeline('text-generation', model='gpt2', generation_config=config)
output = generator("Jack is short for Jack the Jacked. Fred is short for Fred the bed head. " +\
                   "Keith is short for Keith the Thief. John is short for John the fawn. " +\
                   "Claire is short for Claire the Bear. Blake is short for Blake the drake. " +\
                   "Greg is short for Greggy peggy. Max is short for Max spitting fax. " +\
                   "Alexey is short for")

for i in range(10):
    print(f"Output {i}:")
    print(output[i]["generated_text"])
    print("---------------------------")

Output 0:
Jack is short for Jack the Jacked. Fred is short for Fred the bed head. Keith is short for Keith the Thief. John is short for John the fawn. Claire is short for Claire the Bear. Blake is short for Blake the drake. Greg is short for Greggy peggy. Max is short for Max spitting fax. Alexey is short for Alexei the bard. Charlie is short for Charlie the chimp.
---------------------------
Output 1:
Jack is short for Jack the Jacked. Fred is short for Fred the bed head. Keith is short for Keith the Thief. John is short for John the fawn. Claire is short for Claire the Bear. Blake is short for Blake the drake. Greg is short for Greggy peggy. Max is short for Max spitting fax. Alexey is short for Alexey the monkey. Lillian is short for Lillian the horse.
---------------------------
Output 2:
Jack is short for Jack the Jacked. Fred is short for Fred the bed head. Keith is short for Keith the Thief. John is short for John the fawn. Claire is short for Claire the Bear. Blake is short for

Hmm, not quite. Though `Alexey screaming at a lady at a local strip club` is funny. Let's try changing the sampling method. to top_k=20.

In [66]:
# Generate GPT2 output
set_seed(42)
config = GenerationConfig(max_new_tokens=15, do_sample=True, top_k=20, num_return_sequences=10, eos_token_id=model.config.eos_token_id, pad_token_id=model.config.eos_token_id)
generator = pipeline('text-generation', model='gpt2', generation_config=config)
output = generator("Jack is short for Jack the Jacked. Fred is short for Fred the bed head. " +\
                   "Keith is short for Keith the Thief. John is short for John the fawn. " +\
                   "Claire is short for Claire the Bear. Blake is short for Blake the drake. " +\
                   "Greg is short for Greggy peggy. Max is short for Max spitting fax. " +\
                   "Alexey is short for")

for i in range(10):
    print(f"Output {i}:")
    print(output[i]["generated_text"])
    print("---------------------------")

Output 0:
Jack is short for Jack the Jacked. Fred is short for Fred the bed head. Keith is short for Keith the Thief. John is short for John the fawn. Claire is short for Claire the Bear. Blake is short for Blake the drake. Greg is short for Greggy peggy. Max is short for Max spitting fax. Alexey is short for Alexei the bard.

The following characters were added to the
---------------------------
Output 1:
Jack is short for Jack the Jacked. Fred is short for Fred the bed head. Keith is short for Keith the Thief. John is short for John the fawn. Claire is short for Claire the Bear. Blake is short for Blake the drake. Greg is short for Greggy peggy. Max is short for Max spitting fax. Alexey is short for Alexey the dork. Mike is short for Mike the hottie
---------------------------
Output 2:
Jack is short for Jack the Jacked. Fred is short for Fred the bed head. Keith is short for Keith the Thief. John is short for John the fawn. Claire is short for Claire the Bear. Blake is short for Bla

I like `Alexey the dork` since it seems like GPT2 is making fun of me lol. Let's try changing the input prompt.

In [67]:
# Generate GPT2 output
set_seed(42)
config = GenerationConfig(max_new_tokens=15, do_sample=True, top_k=20, num_return_sequences=10, eos_token_id=model.config.eos_token_id, pad_token_id=model.config.eos_token_id)
generator = pipeline('text-generation', model='gpt2', generation_config=config)
output = generator("Jack rhymes with jacked. Fred rhymes with bed head. " +\
                   "Keith rhymes with thief. John rhymes with fawn. " +\
                   "Claire rhymes with bear. Blake rhymes with drake. " +\
                   "Greggy rhymes with peggy. Max rhymes with facts. " +\
                   "Alexey rhymes with")

for i in range(10):
    print(f"Output {i}:")
    print(output[i]["generated_text"])
    print("---------------------------")

Output 0:
Jack rhymes with jacked. Fred rhymes with bed head. Keith rhymes with thief. John rhymes with fawn. Claire rhymes with bear. Blake rhymes with drake. Greggy rhymes with peggy. Max rhymes with facts. Alexey rhymes with a big ol' grin.

I was born and raised in the
---------------------------
Output 1:
Jack rhymes with jacked. Fred rhymes with bed head. Keith rhymes with thief. John rhymes with fawn. Claire rhymes with bear. Blake rhymes with drake. Greggy rhymes with peggy. Max rhymes with facts. Alexey rhymes with jock.

[1:14]

[3:
---------------------------
Output 2:
Jack rhymes with jacked. Fred rhymes with bed head. Keith rhymes with thief. John rhymes with fawn. Claire rhymes with bear. Blake rhymes with drake. Greggy rhymes with peggy. Max rhymes with facts. Alexey rhymes with cuddly. Michael rhymes with jingle ball. Dax rh
---------------------------
Output 3:
Jack rhymes with jacked. Fred rhymes with bed head. Keith rhymes with thief. John rhymes with fawn. Claire rh

We're close! I see `Alexey rhymes with kennan and berry` and `Alexey rhymes with blueberry`, which are *actual* rhymes lol. Let's play with temperature.

In [68]:
# Generate GPT2 output
set_seed(42)
config = GenerationConfig(max_new_tokens=15, do_sample=True, top_k=20, temperature=2.0, num_return_sequences=10, eos_token_id=model.config.eos_token_id, pad_token_id=model.config.eos_token_id)
generator = pipeline('text-generation', model='gpt2', generation_config=config)
output = generator("Jack rhymes with jacked. Fred rhymes with bed head. " +\
                   "Keith rhymes with thief. John rhymes with fawn. " +\
                   "Claire rhymes with bear. Blake rhymes with drake. " +\
                   "Greggy rhymes with peggy. Max rhymes with facts. " +\
                   "Alexey rhymes with")

for i in range(10):
    print(f"Output {i}:")
    print(output[i]["generated_text"])
    print("---------------------------")

Output 0:
Jack rhymes with jacked. Fred rhymes with bed head. Keith rhymes with thief. John rhymes with fawn. Claire rhymes with bear. Blake rhymes with drake. Greggy rhymes with peggy. Max rhymes with facts. Alexey rhymes with giddy bird. Azzin raps on your forehead! Yup
---------------------------
Output 1:
Jack rhymes with jacked. Fred rhymes with bed head. Keith rhymes with thief. John rhymes with fawn. Claire rhymes with bear. Blake rhymes with drake. Greggy rhymes with peggy. Max rhymes with facts. Alexey rhymes with jock.

[20:33] So when was it recorded
---------------------------
Output 2:
Jack rhymes with jacked. Fred rhymes with bed head. Keith rhymes with thief. John rhymes with fawn. Claire rhymes with bear. Blake rhymes with drake. Greggy rhymes with peggy. Max rhymes with facts. Alexey rhymes with cutesy cat man. Mikey jokingly makes cat noises. J
---------------------------
Output 3:
Jack rhymes with jacked. Fred rhymes with bed head. Keith rhymes with thief. John rhyme

So changing the temperature to 2.0 caused a lot more adult results, like `Alexey rhymes with the devil` and `Alexey rhymes with pimp!`. Though I did see `Alexey rhymes with cutesy cat man` and `Alexey rhymes with giddy bird`. 

I'll take `Alexey rhymes with cutesy cat man` as a win.

Please only refer to me now as "Cutesy Cat Man" (... I'm kidding lol).

# Conclusion

Overall, I really enjoyed making this report. This type of exploration where I'm playing with the output of the model is in stark contrast to my [previous report](https://github.com/yyexela/APPM5720-Reports/blob/master/Report12/April5.ipynb) where it was much more model development and training heavy, so it was a nice change of pace. Understanding the types of generation methods, like greedy search, beam search, top-k, and top-p is not difficult to understand, so going through these examples to build some intuition on how it actually changes results was nice.

Thanks for reading!

*- Alexey the cutesy cat man*