## CS310 Natural Language Processing
## Lab 10: Explore Natural Language Generation

In [2]:
import torch
import random

### T1. Explore Pretrained GPT-2 Model

In this task, you will explore the GPT-2 model using the `transformers` library.

Just like in the previous lab, you will need to download the pretrained model and unzip it to `./gpt2zh`. Note that this is not the original version of GPT-2 provided by OpenAI (https://huggingface.co/openai-community/gpt2), but rather a fine-tuned version for Chinese text generation.

In [3]:
from transformers import AutoTokenizer, GPT2LMHeadModel

gpt2_tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path="gpt2zh")
gpt2_model = GPT2LMHeadModel.from_pretrained(pretrained_model_name_or_path="gpt2zh")
# Evaluation mode
gpt2_model.eval()

print('vocab size:', gpt2_tokenizer.vocab_size)
print(f'special token {gpt2_tokenizer.sep_token}:', gpt2_tokenizer.sep_token_id)
print(f'special token {gpt2_tokenizer.cls_token}:', gpt2_tokenizer.cls_token_id)
print(f'special token {gpt2_tokenizer.pad_token}:', gpt2_tokenizer.pad_token_id)

# Use [SEP] as end-of-sentence token
gpt2_model.config.eos_token_id = gpt2_tokenizer.sep_token_id

vocab size: 21128
special token [SEP]: 102
special token [CLS]: 101
special token [PAD]: 0


The tokenizer can return the token IDs and the attention mask that indicates which tokens are padding tokens (`1` for real tokens, `0` for padding tokens).

Since we only have one sentence in the "batch", there is no padding used, and thus no `0` in the attention mask.

In [4]:
input_text = '学而时习之，不亦说乎！'
input_encoded = gpt2_tokenizer(input_text, return_tensors="pt")

print('input ids:', input_encoded['input_ids'])
print('input attention mask:', input_encoded['attention_mask'])

# Map token ids back to tokens
print('input tokens:', gpt2_tokenizer.convert_ids_to_tokens(input_encoded['input_ids'][0]))

input ids: tensor([[ 101, 2110, 5445, 3198,  739,  722, 8024,  679,  771, 6432,  725, 8013,
          102]])
input attention mask: tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])
input tokens: ['[CLS]', '学', '而', '时', '习', '之', '，', '不', '亦', '说', '乎', '！', '[SEP]']


It's easy to directly use the `generate` method to generate some sentences:

In [5]:
input_text = "子曰：人"
input_encoded = gpt2_tokenizer(input_text, return_tensors="pt", add_special_tokens=False)
n_outputs = 5

output = gpt2_model.generate(**input_encoded, 
                                 max_length=20, 
                                 num_return_sequences=n_outputs,
                                 do_sample=True, 
                                 top_k=50, 
                                 top_p=0.95, 
                                 temperature=0.7,
                                 pad_token_id=0,
                                 )
# print(type(output))
# print(output.shape)

# for i in range(n_outputs):
#     output_text = gpt2_tokenizer.decode(output[i], skip_special_tokens=True)
#     print(output_text)



We can see that the generation is far from perfect. It still has good chances to produce a lot of repetitions.

---

### T2. Implement Top-k Sampling Algorithms Manually

Let's first try greedy search, i.e., top-1 sampling.

In [7]:
input_text = "今天天气"
input_encoded = gpt2_tokenizer(input_text, return_tensors="pt", add_special_tokens=False)
print('input size:', input_encoded.input_ids.shape[1])

output = gpt2_model(input_encoded.input_ids, 
                    attention_mask=input_encoded.attention_mask)
logits = output.logits
print(logits.shape)

### START YOUR CODE ###
# Get the probability distribution predicted at the last token's position
# last_token_logits = None
last_token_logits = logits[:, -1, :]
# Get the most likely token id from this distribution
# most_likely_token_id = None
most_likely_token_id = last_token_logits.argmax(dim=-1).item()
### END YOUR CODE ###

# Convert the token id to a token
most_likely_token = gpt2_tokenizer.convert_ids_to_tokens(most_likely_token_id)
print(most_likely_token)

# You should expect to see the following output:
# input size: 4
# torch.Size([1, 4, 21128])
# 预

input size: 4
torch.Size([1, 4, 21128])
预


Once you are done with the above code, you can now implement the full generation loop: at each iteration, you select the most likely token and append it to the end input, and then feed the new input to the model for predicting the next token. 

The loop continues until `max_gen_len` is reached, or a `"[SEP]"` token is generated.

**Note**: 
- Use `torch.cat` to append elements to input IDs
- The `attn_mask` also needs be updated at each iteration.

In [8]:
max_gen_len = 50

input_text = "今天天气"
input_encoded = gpt2_tokenizer(input_text, return_tensors="pt", add_special_tokens=False)
input_ids = input_encoded.input_ids
attn_mask = input_encoded.attention_mask

count = 0
while count < max_gen_len:
    output = gpt2_model(input_ids, attention_mask=attn_mask)
    logits = output.logits

    ### START YOUR CODE ###
    # last_token_logits = None
    # sampled_token_id = None
    # if sampled_token_id == gpt2_tokenizer.sep_token_id:
    #     break
    # input_ids = None
    # attn_mask = None
    last_token_logits = logits[:, -1, :]
    sampled_token_id = last_token_logits.argmax(dim=-1).item()
    if sampled_token_id == gpt2_tokenizer.sep_token_id:
        break
    input_ids = torch.cat([input_ids, torch.tensor([[sampled_token_id]])], dim=1)
    attn_mask = torch.cat([attn_mask, torch.tensor([[1]])], dim=1)
    ### END YOUR CODE ###

    count += 1


# Test
special_token_ids = set([gpt2_tokenizer.sep_token_id, 
                         gpt2_tokenizer.cls_token_id, 
                         gpt2_tokenizer.pad_token_id,
                         100]) # 100 for [UNK]

# Decode the generated tokens ids
for i in range(input_ids.shape[1]):
    tok_id = input_ids[0, i].item()
    # Skip the special tokens
    if tok_id not in special_token_ids:
        print(gpt2_tokenizer.convert_ids_to_tokens(input_ids[0, i].item()), end='')

# You should expect to see the following output:
# 今天天气预报：今天白天，我市阴天有小雨，气温：小雨转多云，气温：小雨转多云，气温：小雨转多云，气温：小雨转多

今天天气预报：今天白天，我市阴天有小雨，气温：小雨转多云，气温：小雨转多云，气温：小雨转多云，气温：小雨转多

As you can see, greedy search results in very repetitive text.

Now, let's implement a `top-k` sampling algorithm.

The idea is to **uniformly** sample from top-k most likely next tokens. PyTorch tensor provides a `topk` method to get the top-k values and indices. 

In the following example, you can check the top 5 most likely words following the sentence "今天天气":

In [12]:
k = 5
input_text = "今天天气"
input_encoded = gpt2_tokenizer(input_text, return_tensors="pt", add_special_tokens=False)
input_ids = input_encoded.input_ids
attn_mask = input_encoded.attention_mask

output = gpt2_model(input_ids, attention_mask=attn_mask)
logits = output.logits
### START YOUR CODE ###
# last_token_logits = None
# top_k_logits, top_k_indices = None
last_token_logits = logits[:, -1, :]
top_k_logits, top_k_indices = last_token_logits.topk(k)
top_k_indices=top_k_indices[0]
### END YOUR CODE ###

# Test
print(top_k_logits)
print(top_k_indices)

for i in range(k):
    tok_id = top_k_indices[i].item()
    print(gpt2_tokenizer.convert_ids_to_tokens(tok_id), end=' ')

# You should expect to see the following output:
# tensor([7.8924, 7.8550, 7.5893, 7.3502, 7.3069], grad_fn=<TopkBackward0>)
# tensor([7564, 2523,  679, 1962, 6820])
# 预 很 不 好 还 

tensor([[7.8924, 7.8550, 7.5893, 7.3502, 7.3069]], grad_fn=<TopkBackward0>)
tensor([7564, 2523,  679, 1962, 6820])
预 很 不 好 还 

Next let's integrate the top-k sampling algorithm into the generation process. The uniform sampling can be implemented using `random.choices` among the top-k indices.

In [29]:
def generate_topk(input_text, k=5, max_gen_len=50):
    input_encoded = gpt2_tokenizer(input_text, return_tensors="pt", add_special_tokens=False)
    input_ids = input_encoded.input_ids
    attn_mask = input_encoded.attention_mask

    count = 0
    while count < max_gen_len:
        output = gpt2_model(input_ids, attention_mask=attn_mask)
        logits = output.logits

        ### START YOUR CODE ###
        # last_token_logits = logits[0, -1, :]
        # top_k_logits, top_k_indices = None
        # sampled_token_id = None
        # if sampled_token_id == gpt2_tokenizer.sep_token_id:
        #     break
        # input_ids = None
        # attn_mask = None
        last_token_logits = logits[0, -1, :]
        top_k_logits, top_k_indices = last_token_logits.topk(k)
        sampled_token_id = random.choice(top_k_indices.tolist())
        if sampled_token_id == gpt2_tokenizer.sep_token_id:
            break
        input_ids = torch.cat([input_ids, torch.tensor([[sampled_token_id]])], dim=1)
        attn_mask = torch.cat([attn_mask, torch.tensor([[1]])], dim=1)

        ### END YOUR CODE ###

        count += 1
    
    special_token_ids = set([gpt2_tokenizer.sep_token_id, 
                         gpt2_tokenizer.cls_token_id, 
                         gpt2_tokenizer.pad_token_id,
                         100]) # 100 for [UNK]
    
    generated_text = ''
    for i in range(input_ids.shape[1]):
        tok_id = input_ids[0, i].item()
        if tok_id not in special_token_ids:
            generated_text += gpt2_tokenizer.convert_ids_to_tokens(tok_id)
    
    return generated_text

In [30]:
# Test
input_text = "今天天气"
print(generate_topk(input_text, k=50))

input_text = "子曰：人"
print(generate_topk(input_text, k=50))

今天天气冷呢就去爬了第19页啊啊。到底爬错一种东欧爬坡法？可见没问原问时的想必一下都吓飞大家哦看懂才懂得这都哪
子曰：人情长一处知事出口深（大千善学）


We can note that although the above uniform top-k sampling solves repetition issue, it will however produce *extremely incoherent* text. We can remedy this by using a proportional sampling instead of uniform sampling.

There are plenty of different ways to implement proportionaly sampling. You can either:
- Create list of cumulative relative probabilities of the top k tokens. For instance, if the relative probabilities of $k=5$ tokens are $0.1$, $0.2$, $0.5$, $0.1$, and $0.1$, then you cumulative probability list is `cum_prob = [0.1, 0.3, 0.8, 0.9, 1.0]`. Then you draw a random number $r$ from the unifrom distribution $[0,1]$ by `random.random()`, and you decide which token is sampled by telling which bin of `cum_prob` that $r$ falls into.
- Or, you use the `torch.multinomial()` function to accomplish similar sampling. *Note* the input weight provided to `torch.multinomial` should be teh relative probabilities of the top $k$ tokens, which can be obtained from applying softmax to the logits. 

In [31]:
import torch.nn.functional as F
def generate_topk_prop(input_text, k=50, max_gen_len=50):
    input_encoded = gpt2_tokenizer(input_text, return_tensors="pt", add_special_tokens=False)
    input_ids = input_encoded.input_ids
    attn_mask = input_encoded.attention_mask

    count = 0
    while count < max_gen_len:
        output = gpt2_model(input_ids, attention_mask=attn_mask)
        logits = output.logits

        ### START YOUR CODE ###
        # last_token_logits = None
        # top_k_logits, top_k_indices = None
        # pass
        # sampled_token_id = None
        # if sampled_token_id == gpt2_tokenizer.sep_token_id:
        #     break
        # input_ids = None
        # attn_mask = None

        last_token_logits = logits[0, -1, :]
        top_k_logits, top_k_indices = last_token_logits.topk(k)
        top_k_probs = F.softmax(top_k_logits, dim=-1)
        sampled_token_id = torch.multinomial(top_k_probs, num_samples=1).item()
        if sampled_token_id == gpt2_tokenizer.sep_token_id:
            break
        input_ids = torch.cat([input_ids, torch.tensor([[sampled_token_id]])], dim=1)
        attn_mask = torch.cat([attn_mask, torch.tensor([[1]])], dim=1)
        ### END YOUR CODE ###

        count += 1
    
    special_token_ids = set([gpt2_tokenizer.sep_token_id, 
                         gpt2_tokenizer.cls_token_id, 
                         gpt2_tokenizer.pad_token_id,
                         100]) # 100 for [UNK]
    
    generated_text = ''
    for i in range(input_ids.shape[1]):
        tok_id = input_ids[0, i].item()
        if tok_id not in special_token_ids:
            generated_text += gpt2_tokenizer.convert_ids_to_tokens(tok_id)
    
    return generated_text





In [32]:
# Test
input_text = "今天天气"
print(generate_topk_prop(input_text, k=50))

input_text = "子曰：人"
print(generate_topk_prop(input_text, k=50))

今天天气[unused3][unused2][unused1][unused2][unused5][unused2][unused30][unused30][unused16][unused16][unused32][unused21][unused7][unused3][unused9][unused25][unused31][unused4][unused42][unused2][unused42][unused2][unused2][unused1][unused42][unused8][unused5][unused40][unused16][unused30][unused3][unused6][unused5][unused26][unused31][unused6][unused30][unused43][unused41][unused29][unused46][unused14][unused1]
子曰：人[unused1][unused37][unused8][unused1][unused3][unused23][unused8][unused24][unused1][unused3][unused29][unused49][unused16][unused16][unused44][unused29][unused1][unused32][unused26][unused32][unused14][unused29][unused12][unused2][unused11][unused7][unused20][unused28][unused33][unused21][unused18][unused47][unused5][unused2][unused2][unused30][unused31][unused4][unused8][unused13][unused1][unused41][unused17][unused1][unused8][unused29][unused40]


Do you think the proportional sampling produces better text?

Have fun sampling! :)