## Generation/Inference

https://github.com/huggingface/blog/blob/main/notebooks/02_how_to_generate.ipynb


This blog post gives a brief overview of different decoding strategies. 
All of the following functionalities can be used for auto-regressive language generation, https://jalammar.github.io/illustrated-gpt2/. 
In short, auto-regressive language generation is based on the assumption that the probability distribution of a word sequence can be decomposed into the product of conditional next word distributions. 

We will give a tour of the currently most prominent decoding methods, mainly 
- Greedy search
- Beam search
- Top-K sampling
- Top-p sampling


In [21]:
import os
import json
import torch
import argparse
from tqdm import tqdm
from pprint import pprint

from transformers import AutoTokenizer, AutoModelForCausalLM

DEVICE = torch.device("cuda:7" if torch.cuda.is_available() else "cpu")


# local_cache_dir is for mannually downloading model params to local env
local_cache_dir = "../../DataCollection/officials/gpt2"
model = AutoModelForCausalLM.from_pretrained(local_cache_dir, output_hidden_states=True).to(DEVICE)
tokenizer = AutoTokenizer.from_pretrained(local_cache_dir)



## naive predict

This is a naive implementation of GPT2 generating text based on Transformers package. 
Its HF repostory is at https://huggingface.co/openai-community/gpt2. 

In [23]:
# Encode initial input
input_text = "What is star war?"
input_ids = tokenizer.encode(input_text, return_tensors='pt').to(DEVICE)  # Shape: [1, 4]

# Set the number of tokens to generate
num_tokens_to_generate = 100

# Iteratively generate tokens
# for _ in tqdm(range(num_tokens_to_generate), mininterval=1):
for _ in range(num_tokens_to_generate):

    # Get model output logits
    outputs = model(input_ids)  # Shape: [1, current_length, 50257] or [batch_size, token length, vocab size]
    logits = outputs.logits

    '''
    Predict the next token based on the last position
    i.e., the i-th position logits is for predicting the i+1-th token
    In this case, we want to predict the next token based on previous tokens, so we use the logits of the final token.
    If you see the source code of forward function, you can notice the shifting of labels and logits for aligning.
    '''
    next_token_logits = logits[:, -1, :]  # Shape: [1, 50257], corresponding to each vocab

    '''
    Greedy decoding: select the token with the highest probability
    Supposily you can try top-k and beam search
    '''
    greedy_token_id = torch.argmax(next_token_logits, dim=-1)  # Shape: [1]

    # Append the predicted token to the input_ids
    input_ids = torch.cat([input_ids, greedy_token_id.unsqueeze(-1)], dim=-1).to(DEVICE)  # Shape: [1, current_length + 1]

    # print(tokenizer.decode(input_ids.squeeze(), skip_special_tokens=True))

# Decode the entire sequence of tokens
generated_text = tokenizer.decode(input_ids.squeeze(), skip_special_tokens=True)
print("Generated Text:\n", generated_text)

Generated Text:
 What is star war?

Star wars are the most common form of warfare in the world. The most common form of warfare is the war of attrition. The most common form of warfare is the war of attrition.

Star wars are the most common form of warfare in the world. The most common form of warfare is the war of attrition. The most common form of warfare is the war of attrition.

Star wars are the most common form of warfare in the world. The most common form of warfare is


In [24]:
logits.shape

torch.Size([1, 104, 50257])

# official functions

In [26]:
# encode context the generation is conditioned on
model_inputs = tokenizer('I enjoy walking with my cute dog', return_tensors='pt').to(DEVICE)

pprint(model_inputs, width=100)

{'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]], device='cuda:7'),
 'input_ids': tensor([[   40,  2883,  6155,   351,   616, 13779,  3290]], device='cuda:7')}


### 1. Greedy Search


Selects the word with the highest probability as its next word at each timestep. The `generate` function use this strategy as default. 

In [27]:
# generate 40 new tokens
# the output of generate is a `GenerateDecoderOnlyOutput` object, we only need the first attribute.
greedy_output = model.generate(**model_inputs, 
    max_new_tokens=40, 
    # max_length=50, 
    )

token_ids = torch.squeeze(greedy_output[0])
print(tokenizer.decode(token_ids, skip_special_tokens=True))


Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with my dog. I'm not sure if I'll ever be able to walk with my dog.

I'm not sure


The generated words following the context are reasonable, but the model quickly starts repeating itself! This is a very common problem in language generation in general and seems to be even more so in greedy and beam search. 

The major drawback of greedy search though is that it misses high probability words hidden behind a low probability word. 

### 2. Beam search

Beam search reduces the risk of missing hidden high probability word sequences by keeping the most likely num_beams of hypotheses at each time step and eventually choosing the hypothesis that has the overall highest probability. So eventually we still get one sequence. 

Beam search will always find an output sequence with higher probability than greedy search, but is not guaranteed to find the most likely output.

In [28]:
# activate beam search and early_stopping
beam_output = model.generate(
    **model_inputs,
    max_new_tokens=40,
    num_beams=5,
    early_stopping=True
)

token_ids = torch.squeeze(beam_output[0])
print(tokenizer.decode(token_ids, skip_special_tokens=True))


Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with him again.

I'm not sure if I'll ever be able to walk with him again. I'm not sure


While the result is arguably more fluent, the output still includes repetitions of the same word sequences.
A simple remedy is to introduce n-grams (a.k.a word sequences of words) penalties. 

The most common n-grams penalty makes sure that no n-gram appears twice by manually setting the probability of next words that could create an already seen n-gram to 0. 

In [29]:
# introduce n-grams (a.k.a word sequences of n words) penalties
# by default, this penalty will set the possibiliy to 0
# The repetition_penalty parameter can be set to discourage the model from generating repeated n-grams. A value greater than 1.0 penalizes repetition. 
beam_output = model.generate(
    **model_inputs,
    max_new_tokens=40,
    num_beams=5,
    no_repeat_ngram_size=2,
    repetition_penalty=1.5,
    early_stopping=True
)

print("[Output (Beam Search)(n-grams penalty)]: ")
token_ids = torch.squeeze(beam_output[0])
print(tokenizer.decode(token_ids, skip_special_tokens=True))


Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


[Output (Beam Search)(n-grams penalty)]: 
I enjoy walking with my cute dog, but I don't think I'll ever be able to walk with her again."

"You're right," she said. "I'm going to have to get used to it. I


## 3. Multiple outcomes of a single generate

By setting `num_return_sequences`, you can get multiple beams, applicable in both beam search and sampling methods. 
Notes that by default, generate will use greedy search, so you will get the same sequence no matter the num_return_sequences. 

In [30]:
# activate beam search and early_stopping
beam_output = model.generate(
    **model_inputs,
    max_new_tokens=40,
    num_beams=5,
    no_repeat_ngram_size=2, 
    num_return_sequences=5, 
    early_stopping=True
)

token_ids = torch.squeeze(beam_output[0])
for j in range(token_ids.shape[0]):
    print(tokenizer.decode(token_ids[j], skip_special_tokens=True))
    print(20*'=')

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with him again.

I've been thinking about this for a while now, and I think it's time for me to
I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with her again.

I've been thinking about this for a while now, and I think it's time for me to
I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with him again.

I've been thinking about this for a while now, and I think it's a good idea to
I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with him again.

I've been thinking about this for a while now, and I think it's time to take a
I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with him again.

I've been thinking about this for a while now, and I think it's a good idea.


As argued in Ari Holtzman et al. (2019), high quality human language does not follow a distribution of high probability next words. In other words, as humans, we want generated text to surprise us and not to be boring/predictable. 

## 4. Sampling



In Top-K sampling, the K most likely next words are filtered and the probability mass is redistributed among only those K next words. 

1. Token Probabilities: After the model processes the input text, it predicts a probability distribution over the possible next tokens. 
2. Filtering to Top-k: Instead of considering all possible tokens, top-k sampling narrows down the choices to the k tokens with the highest probabilities. This "pruning" reduces the potential output space, focusing on the most probable next tokens while ignoring less likely ones. 
3. Random Sampling: From the top-k tokens, one token is sampled randomly according to their probabilities, rather than always choosing the highest probability token. This introduces variety into the generated text, leading to more diverse outputs.
4. Controlling Output Diversity: By adjusting the value of k, High k (e.g., 50 or 100) allows more options, increasing diversity and potentially creativity, at the risk of less coherence. Low k (e.g., 5 or 10) limits options, usually making the text more deterministic and focused but sometimes too repetitive or safe. 

In [4]:
topk_output = model.generate(**model_inputs, 
    max_new_tokens=40,
    do_sample=True, 
    top_k=50
    )

token_ids = torch.squeeze(topk_output[0])
print(tokenizer.decode(token_ids, skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


I enjoy walking with my cute dog Molly. We share a bit and we're sure we'll be spending a lot less.

I always have the sneaking suspicion that sometimes pets get a bit petty, but it turns out


One concern though with Top-K sampling is that it does not dynamically adapt the number of words that are filtered from the next word probability distribution. This can be problematic as some words might be sampled from a very sharp distribution (distribution focused on few words), whereas others from a much more flat distribution. 
Thus, limiting the sample pool to a fixed size K could endanger the model to produce gibberish for sharp distributions and limit the model's creativity for flat distribution. 

Instead of sampling only from the most likely K words, in Top-p sampling chooses from the smallest possible set of words whose cumulative probability exceeds the probability p. 

It only differs from top-k in terms of filtering. Instead of selecting the top-k tokens with the highest individual probabilities, top-p sampling considers the smallest set of tokens whose cumulative probability is at least p.

In [5]:
topp_output = model.generate(**model_inputs, 
    max_new_tokens=40,
    do_sample=True, 
    top_p=0.92
    )

token_ids = torch.squeeze(topp_output[0])
print(tokenizer.decode(token_ids, skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


I enjoy walking with my cute dog and she is so much fun. She always is happy to meet my furry friends. She has a very friendly, warm, friendly demeanor and she likes playing with my pet! I am a huge fan


top-p and top-K sampling seem to produce more fluent text than traditional greedy - and beam search on open-ended language generation. Recently, there has been more evidence though that the apparent flaws of greedy and beam search - mainly generating repetitive word sequences - are caused by the model (especially the way the model is trained), rather than the decoding method. 

# Batching using pipeline

In [1]:
import os
import json
import torch
import argparse
from tqdm import tqdm
from pprint import pprint
from transformers import pipeline

from transformers import AutoTokenizer, AutoModelForCausalLM

DEVICE = torch.device("cuda:7" if torch.cuda.is_available() else "cpu")


# local_cache_dir is for mannually downloading model params to local env
local_cache_dir = "../../DataCollection/officials/gpt2"

pipe = pipeline(task='text-generation', model=local_cache_dir, device=DEVICE)

In [18]:
if not pipe.tokenizer.pad_token_id:
    pipe.tokenizer.pad_token_id = pipe.tokenizer.eos_token_id

In [12]:
result = pipe("Once upon a time", 
            max_new_tokens=50, 
            # top_k=50, 
            top_p=0.92,
            temperature=0.7,
            num_return_sequences=2,
            )
for item in result:
    print(item['generated_text'])
    print(20*'=')

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Once upon a time when no one was interested in having a female character, the character was developed as a romantic drama. And she is a character you need to be a part of. The reason why women want roles as characters is because it's something that we are not
Once upon a time, even the very best minds in the world knew that they were no longer a part of the great machine we had built for ourselves. The people began to understand that a whole new era could be started.

As a result of this knowledge,


In [20]:
prompts = ["The future of technology is", "Once upon a time in a distant land", "Artificial intelligence has changed"]
results = pipe(prompts, max_length=50, batch_size=8)

for idx, result in enumerate(results):
    print(f"Prompt {idx + 1}: {prompts[idx]}")
    print(result[0]['generated_text'])
    print(20*'=')


Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Prompt 1: The future of technology is
The future of technology isThe future of technology is changing more quickly than ever. The ability to connect to your computer and get access to and manipulate your information is gaining momentum, and companies are starting to realize that it is extremely valuable.
Prompt 2: Once upon a time in a distant land
Once upon a time in a distant land when I was a boy, my father asked me why I'd joined the British Army. I explained that it was for the military's own bad. I remember being in a room full of young men with red
Prompt 3: Artificial intelligence has changed
Artificial intelligence has changedIt has become too big, too high and too expensive for our everyday lives, with the main goal of giving us all the tools we need for achieving our goals or making sure we're doing something good. For
