\# Welcome to CS 5242 **Homework 4**

ASSIGNMENT DEADLINE ⏰ : **23:59 04 April 2024**

In this assignment, we will delve into **different generation methods of GPT**. To be specific, you will practice different text generation methods using the powerful [OPT](https://arxiv.org/abs/2205.01068) language model based on [transformers](https://huggingface.co/docs/transformers/en/index) package. You will implement two generation techniques: greedy search, beam search and sampling.

Helpful material: https://huggingface.co/blog/how-to-generate

### Setup model

In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load the OPT model and tokenizer
# To save memory, we only use opt-350m
model = AutoModelForCausalLM.from_pretrained("facebook/opt-350m")
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-350m")

prompt = "In this assignment, we will delve into different generation methods of GPT. To be specific, you will"
input_ids = tokenizer.encode(prompt, return_tensors="pt")
print(f"tokenized input: {input_ids}, shape: {input_ids.shape}")

### Task 1:  Greedy Search

Greedy search is a simple and straightforward method for text generation. At each step, it selects the next token with the highest probability according to the language model's output. This approach is called "greedy" because it greedily chooses the most likely token at each time step without considering future consequences. Greedy search is fast but may lead to repetitive or monotonous results.

Greedy search is a simple method where at each time step, the word with the highest probability is chosen as the next predicted word, until an end token is generated or the maximum length is reached.

It can be represented by the formula:

$ w_t = \arg\max P(w_t | w_{<t}) $

Here, $ w_t $ is the predicted word at time step $ t $, and $ P(w_t | w_{<t}) $ is the probability of predicting $ w_t $ given the previous word sequence $ w_{<t} $.


In [None]:
def greedy_search_generation(input_ids, model) -> str:
    outputs = input_ids
    
    # =========================
    # Your code starts here (3 points)
    # outputs should be a torch Tensor of shape [1, n]
    # where n is the length of the generated text
    # and the elements are the token ids of the generated text
    # 
    # You should not use any external library like transformers's .generate().
    # Only torch is allowed.
    # =========================
    
    # =========================
    # Your code ends here
    # =========================

    return outputs

generated_ids = greedy_search_generation(input_ids, model)[0]
generated_text = tokenizer.decode(generated_ids, skip_special_tokens=True)
print(f"prompt: {prompt}\ngenerated_text: {generated_text}")

### Task 2: Beam Search


Beam search is a more complex method that considers multiple candidate words rather than just a single word. At each time step, it keeps track of the top $ k $ most likely partial sequences (known as the beam width), and then expands these partial sequences based on their scores. Finally, it selects the sequence with the highest score from these expanded sequences. Beam search can provide more diverse results but might still result in repetitive outputs.

Here's the basic algorithm for beam search:

1. **Initialization**:
   - Start with an initial input sequence (usually a special token indicating the beginning of a sequence).
   - Initialize a set of beams, each containing the initial sequence and its probability score.

2. **Expansion**:
   - For each beam, generate the next token in the sequence using the GPT model.
   - Calculate the conditional probability of each possible next token given the current partial sequence.
   - Expand each beam by appending each possible next token to the partial sequence, along with its updated probability score.

3. **Selection**:
   - Select the top-k beams with the highest probability scores, where k is the beam width.
   - Discard the remaining beams.

4. **Termination**:
   - If the generated token is an end-of-sequence token or a maximum sequence length is reached, terminate those beams.
   - Repeat steps 2-4 until all remaining beams either terminate or reach the maximum sequence length.

The beam search algorithm can be expressed mathematically as follows:

- Let $ S_t^b $ denote the partial sequence for beam $ b $ at time step $ t $.
- Let $ P(S_t^b) $ denote the probability of the partial sequence $ S_t^b $ up to time step $ t $.
- Let $ P(w_t | S_t^b) $ denote the conditional probability of the next token $ w_t $ given the partial sequence $ S_t^b $ at time step $ t $.
- Let $ K $ denote the beam width.

The beam search algorithm can be summarized with the following formulas:

1. **Initialization**:
   $ S_1^1 = \text{[BOS]} $
   $ P(S_1^1) = 1 $

2. **Expansion**:
   $ P(w_t | S_t^b) = \text{GPT\_Model}(S_t^b, w_t) $
   $ S_{t+1}^b = S_t^b \cup \{w_t\} $
   $ P(S_{t+1}^b) = P(S_t^b) \times P(w_t | S_t^b) $

3. **Selection**:
   $ (S_{t+1}^{(1)}, ..., S_{t+1}^{(K)}) = \text{Top-K}(S_{t+1}^1, ..., S_{t+1}^B, K) $

4. **Termination**:
   - Terminate beams that reach the maximum length or encounter an end-of-sequence token.

This process continues until all terminated or maximum-length beams are obtained.

In summary, beam search efficiently explores the space of possible sequences and provides a trade-off between exploration and exploitation, helping to find high-quality sequences in sequence generation tasks such as text generation with models like GPT.

In [None]:
def beam_search_generation(input_ids, model, beam_size) -> str:
    outputs = input_ids
    
    # =========================
    # Your code starts here (4 points)
    # outputs should be a torch Tensor of shape [1, n]
    # where n is the length of the generated text
    # and the elements are the token ids of the generated text
    # 
    # You should not use any external library like transformers's .generate().
    # Only torch is allowed.
    # =========================
    
    # =========================
    # Your code ends here
    # =========================

    return outputs

generated_ids = beam_search_generation(input_ids, model, beam_size=4)[0]
generated_text = tokenizer.decode(generated_ids, skip_special_tokens=True)
print(f"prompt: {prompt}\ngenerated_text: {generated_text}")

### Task3: Sampling

Sampling method randomly selects the next word based on the probability distribution of words, instead of always choosing the word with the highest probability. This increases the diversity of generated outputs but lacks stability in results.

It can be represented by the formula:

$ w_t \sim \text{Top-}k\left(P(w_t | w_{<t})\right) $

This formula introduces a top-k constraint to the sampling process, where instead of sampling from the entire probability distribution $ P(w_t | w_{<t}) $, we first select the top $ k $ most likely words according to this distribution, and then sample from this restricted set of words. This helps in controlling the diversity of the generated outputs while still allowing for some randomness in the sampling process.

In this context, $ \text{Top-}k\left(P(w_t | w_{<t})\right) $ represents the selection of the top $ k $ words with the highest probabilities from the distribution $ P(w_t | w_{<t}) $, and $ w_t $ is sampled from this restricted set of words.


In [None]:
def sampling_generation(input_ids, model, topk) -> str:
    outputs = input_ids
    
    # =========================
    # Your code starts here (3 points)
    # outputs should be a torch Tensor of shape [1, n]
    # where n is the length of the generated text
    # and the elements are the token ids of the generated text
    # 
    # You should not use any external library like transformers's .generate().
    # Only torch is allowed.
    # =========================
    
    # =========================
    # Your code ends here
    # =========================

    return outputs

generated_ids = sampling_generation(input_ids, model, topk=5)[0]
generated_text = tokenizer.decode(generated_ids, skip_special_tokens=True)
print(f"prompt: {prompt}\ngenerated_text: {generated_text}")