In [1]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch # We'll load Pytorch so we can convert a list to a tensor
from scipy.special import softmax
import numpy as np # We're using numpy to use its argmax function
import random
from transformers import pipeline

# Language Generation with Transformers

When predicting the next token, a GPT model can give us a score for all possible next tokens. We can use those probabilities to generate new text, potentially by selecting the most likely next token or by sampling using the probabilities. Let's see how that works.

Let's say that we want to generate more text after the sequence below:

In [2]:
text = 'The quick brown fox jumped over'

We'll need to load the tokenizer and model for `distilgpt2`.

In [3]:
tokenizer = AutoTokenizer.from_pretrained('distilgpt2')
model = AutoModelForCausalLM.from_pretrained('distilgpt2')

As before, we use the tokenizer to tokenize the text and convert each token to its token ID. We will use the `.encode` function to get the token IDs back as a Python list as they are easier to manipulate. We'll want to add extra token IDs that we've generated!

In [4]:
input_ids = tokenizer.encode(text)
input_ids

[464, 2068, 7586, 21831, 11687, 625]

We can use the `tokenizer.decode` function to turn the token IDs back into text. This will be useful after we've generated further token IDs to add on the end

In [5]:
tokenizer.decode(input_ids)

'The quick brown fox jumped over'

Now let's run the token IDs through the `distilgpt2` model and get the probabilities of the next token

In [6]:
as_tensor = torch.tensor(input_ids).reshape(1,-1) # This converts the token ID list to a tensor
output = model(input_ids=as_tensor) # We pass it into the model
next_token_scores = output.logits[0,-1,:].detach().numpy() # We get the scores for next token and the end of the sequence (token index=-1)
next_token_probs = softmax(next_token_scores) # And we apply a softmax function

next_token_probs.shape

(50257,)

Now we've got the probabilities for all possible 50257 tokens to be after our input text sequence.

Let's get the one with the highest probability. For that we can use the `argmax` function.

In [7]:
next_token_id = next_token_probs.argmax()
next_token_id

np.int64(262)

Hmm, the token with ID=262 has the highest probability. But what token is that? `tokenizer.decode` can tell us:

In [8]:
tokenizer.decode(next_token_id)

' the'

Now, we've all the parts we need. Your task is to calculate the next eight tokens after `input_ids` (including the one we calculated above). You'll be adding `1353` to the input token IDs, running it through the model again and deciding the next token. Try writing it as a loop that iterates eight times.

In [19]:
def greedy_ids_generator(input_ids_, model_, n=8):
    for i in range(n):
        output_ = model_(input_ids=input_ids_)
        next_token_scores_ = output_.logits[0,-1,:].detach().numpy() # We get the scores for next token and the end of the sequence (token index=-1)
        next_token_probs_ = softmax(next_token_scores_) # And we apply a softmax function
        next_token_id_ = next_token_probs_.argmax() # Get the token ID with the highest probability
        input_ids_ = torch.cat((input_ids_, torch.tensor([next_token_id_]).reshape(1,-1)), dim=1) # Add the new token ID to the input IDs
    return input_ids_

tensor_modified = torch.cat((as_tensor, torch.tensor([1353]).reshape(1,-1)), dim=1)
greedy_generated = greedy_ids_generator(as_tensor, model)
greedy_generated_modified = greedy_ids_generator(tensor_modified, model)

In [20]:
tokenizer.decode(token_ids=greedy_generated[0, :].tolist())

'The quick brown fox jumped over the fence and ran over the fence.'

In [21]:
tokenizer.decode(token_ids=greedy_generated_modified[0, :].tolist())

'The quick brown fox jumped over top of the fox and then jumped over the'