In [1]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch # We'll load Pytorch so we can convert a list to a tensor
from scipy.special import softmax
import numpy as np # We're using numpy to use its argmax function
import random
from transformers import pipeline

# Language Generation with Transformers

When predicting the next token, a GPT model can give us a score for all possible next tokens. We can use those probabilities to generate new text, potentially by selecting the most likely next token or by sampling using the probabilities. Let's see how that works.

Let's say that we want to generate more text after the sequence below:

In [2]:
text = 'The quick brown fox jumped over'

We'll need to load the tokenizer and model for `distilgpt2`.

In [3]:
tokenizer = AutoTokenizer.from_pretrained('distilgpt2')
model = AutoModelForCausalLM.from_pretrained('distilgpt2')

As before, we use the tokenizer to tokenize the text and convert each token to its token ID. We will use the `.encode` function to get the token IDs back as a Python list as they are easier to manipulate. We'll want to add extra token IDs that we've generated!

In [4]:
input_ids = tokenizer.encode(text)
input_ids

[464, 2068, 7586, 21831, 11687, 625]

We can use the `tokenizer.decode` function to turn the token IDs back into text. This will be useful after we've generated further token IDs to add on the end

In [5]:
tokenizer.decode(input_ids)

'The quick brown fox jumped over'

Now let's run the token IDs through the `distilgpt2` model and get the probabilities of the next token

In [6]:
as_tensor = torch.tensor(input_ids).reshape(1,-1) # This converts the token ID list to a tensor
output = model(input_ids=as_tensor) # We pass it into the model
next_token_scores = output.logits[0,-1,:].detach().numpy() # We get the scores for next token and the end of the sequence (token index=-1)
next_token_probs = softmax(next_token_scores) # And we apply a softmax function

next_token_probs.shape

(50257,)

Now we've got the probabilities for all possible 50257 tokens to be after our input text sequence.

Let's get the one with the highest probability. For that we can use the `argmax` function.

In [7]:
next_token_id = next_token_probs.argmax()
next_token_id

np.int64(262)

Hmm, the token with ID=262 has the highest probability. But what token is that? `tokenizer.decode` can tell us:

In [8]:
tokenizer.decode(next_token_id)

' the'

Now, we've all the parts we need. Your task is to calculate the next eight tokens after `input_ids` (including the one we calculated above). You'll be adding `1353` to the input token IDs, running it through the model again and deciding the next token. Try writing it as a loop that iterates eight times.

In [9]:
def probability_generator(input_ids_, model_):
    output_ = model_(input_ids=input_ids_)
    next_token_scores_ = output_.logits[0,-1,:].detach().numpy() # We get the scores for next token and the end of the sequence (token index=-1)
    return softmax(next_token_scores_) # And we apply a softmax function

def greedy_ids_generator(input_ids_, model_, n=8):
    for i in range(n):
        next_token_probs_ = probability_generator(input_ids_, model_)
        next_token_id_ = next_token_probs_.argmax() # Get the token ID with the highest probability
        input_ids_ = torch.cat((input_ids_, torch.tensor([next_token_id_]).reshape(1,-1)), dim=1) # Add the new token ID to the input IDs
    return input_ids_

tensor_modified = torch.cat((as_tensor, torch.tensor([1353]).reshape(1,-1)), dim=1)
greedy_generated = greedy_ids_generator(as_tensor, model)
greedy_generated_modified = greedy_ids_generator(tensor_modified, model)

In [10]:
tokenizer.decode(token_ids=greedy_generated[0, :].tolist())

'The quick brown fox jumped over the fence and ran over the fence.'

In [11]:
tokenizer.decode(token_ids=greedy_generated_modified[0, :].tolist())

'The quick brown fox jumped over top of the fox and then jumped over the'

With eight extra tokens, you should get a list with IDs = `[464, 2068, 7586, 21831, 11687, 625, 262, 13990, 290, 4966, 625, 262, 13990, 13]` which decodes to give the text: "The quick brown fox jumped over the fence and ran over the fence.".

Now picking the token with highest probability every time can often create quite boring text. Sampling from the tokens can generate more interesting text. Sampling uses the probabilities as weights so that words with higher probabilities are more likely to be chosen. Let's see how that works:

Let's imagine we've got a probabilities for four possible tokens (a very tiny vocabulary).

In [12]:
next_token_probs = np.array([0.1, 0.2, 0.5, 0.3])

As we saw above, we can use `argmax` that tells us the index of the highest value. In this case, it's index=2

In [13]:
next_token_probs.argmax()

np.int64(2)

However, let's say we want to sample randomly from the possible token indices (`[0, 1, 2, 3]`). First, let's create that list to sample from:

In [14]:
indices = list(range(len(next_token_probs)))
indices

[0, 1, 2, 3]

We could use the [choices](https://docs.python.org/3/library/random.html#random.choices) function to pick a single token ID with all four being equally likely to be chosen

In [15]:
next_token_id = random.choices(indices, k=1)[0]
next_token_id

0

Or we could provide weights, such that some of the tokens are more likely to be chosen than others. In this case, we provide `next_token_probs` as weights.

In [16]:
next_token_id = random.choices(indices, k=1, weights=next_token_probs)[0]
next_token_id

3

That would allow us to sample from the token probability distribution.

Your task is to generate some new text (starting from "The quick brown fox jumped over" as before) using sampling and the `random.choices` function to pick your next token. Try it with weighting and without weighting to see what happens.

In [17]:
def sampling_generator(input_ids_, model_, n=8):
    for i in range(n):
        next_token_probs_ = probability_generator(input_ids_, model_)
        indices_ = list(range(len(next_token_probs_)))
        next_token_id_ = random.choices(indices_, k=1, weights=next_token_probs_)[0] # Get the token ID with the highest probability
        input_ids_ = torch.cat((input_ids_, torch.tensor([next_token_id_]).reshape(1,-1)), dim=1) # Add the new token ID to the input IDs
    return input_ids_

sampling_generated = sampling_generator(as_tensor, model)
sampling_generated_modified = sampling_generator(tensor_modified, model)

In [18]:
tokenizer.decode(token_ids=sampling_generated[0, :].tolist())

'The quick brown fox jumped over and over and over again after the attack'

In [19]:
tokenizer.decode(token_ids=sampling_generated_modified[0, :].tolist())

'The quick brown fox jumped over top and whispered loudly, \u202a,�'

Try running your code again and you should get a different output due to the random nature of the sampling. There's a lot of tweaks that can be made to the random sampling strategy.

Fortunately, we don't have to implement all the different text generation functions ourselves. The HuggingFace library provides a `text-generation` pipeline to generate text.

For example, here is how to run it and request 30 extra tokens and 5 different generations.