<a href="https://colab.research.google.com/github/vkjadon/llm/blob/main/04batch_tokenizer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence = "I've been waiting for a HuggingFace course my whole life."

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)


In [None]:
input_ids = torch.tensor(ids)
print("Input IDs:", input_ids)
# This line will fail.
# output = model(input_ids)

In [None]:
input_ids = torch.tensor([ids])
print("Input IDs:", input_ids)

output = model(input_ids)
print("Logits:", output.logits)

Batching is the act of sending multiple sentences through the model, all at once. If you only have one sentence, you can just build a batch with a single sequence:

In [None]:
sequence1_ids = [[100, 200, 200]]
sequence2_ids = [[100, 200]]
batched_ids = [
    [100, 200, 200],
    [100, 200, tokenizer.pad_token_id],
]

print(model(torch.tensor(sequence1_ids)).logits)
print(model(torch.tensor(sequence2_ids)).logits)
print(model(torch.tensor(batched_ids)).logits)

When batching sentences of different lengths, padding tokens are added. Transformer attention layers treat padding tokens like real tokens unless told otherwise. This causes logits in a batch to differ from logits computed on individual sentences.
To fix this, we use an attention maskâ€”a tensor of 1s and 0s that tells the model which tokens to pay attention to (1) and which padded tokens to ignore (0).

In [None]:
batched_ids = [
    [100, 200, 200],
    [100, 200, tokenizer.pad_token_id],
]

attention_mask = [
    [1, 1, 1],
    [1, 1, 0],
]

outputs = model(torch.tensor(batched_ids), attention_mask=torch.tensor(attention_mask))
print(outputs.logits)