<a href="https://colab.research.google.com/github/ruthgn/HF/blob/main/Using_Transformer_Model_to_Handle_Multiple_Sequences.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

After exploring the simplest of uses cases--doing inference on a single sequence of a small length--some questions emerge already:
- How do we handle multiple sequences?
- How do we handle multiple sequences of different lenghts?
- Are vocabulary indices the only inputs that allow a model to work well?
- Is there such a thing as too long a sequence?

Let's see what kinds of problems these questions pose, and how we can solve them using the HF Transformers API.

In [1]:
!pip install datasets transformers[sentencepiece]



# Models expect a batch of inputs

In [2]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence = "Water is the key to life on earth."

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)

# Remember to add new dimension (by using "[]")
# when sending a single sequence to the model
# because Transformer models expect 
# multiple sentences by default
input_ids = torch.tensor([ids])
print("Input IDs:", input_ids)

output = model(input_ids)
print("Logits:", output.logits)

Input IDs: tensor([[2300, 2003, 1996, 3145, 2000, 2166, 2006, 3011, 1012]])
Logits: tensor([[-3.7894,  4.0509]], grad_fn=<AddmmBackward0>)


_Batching_ is the act of sending **multiple sentences** through the model, all at once. If you only have one sentence, you can just build a batch with a single sequence.

This is a batch of two identical sequences:

In [3]:
batched_ids = [ids, ids]

If we convert this `batched_ids` list into a tensor and pass it through a model, we'll obtain the same logits as before (but twice).

In [4]:
input_ids = torch.tensor(batched_ids)
print("Input IDs:", input_ids)

output = model(input_ids)
print("Logits", output.logits)

Input IDs: tensor([[2300, 2003, 1996, 3145, 2000, 2166, 2006, 3011, 1012],
        [2300, 2003, 1996, 3145, 2000, 2166, 2006, 3011, 1012]])
Logits tensor([[-3.7894,  4.0509],
        [-3.7894,  4.0509]], grad_fn=<AddmmBackward0>)


# Padding the inputs

We can use *padding* to make our tensors have a rectangular shape--essentially making sure all our sentences have the same length by adding a special word called the *padding token* to the sentences with fewer values. For example, if you have 10 sentences with 10 words and 1 sentence with 20 words, padding will ensure all the sentences have 20 words.

IMPORTANT: A key feature of Transformer models is attention layers that _contextualize_ each token. **These will take into account the padding tokens since they attend to all of the tokens of a sequence--we need to tell those attention layers to ignore the padding tokens.** This can be done by using an attention mask.

## Attention masks

*Attention masks* are tensors with the exact sama shape as the input IDs tensor, filled with 0s and 1s: 1s indicate the corresponding tokens should be attended to, and 0s indicate the corresponding tokens should not be attentded to (i.e, they should be ignored by the attention layers of the model).

___

**Exercise**: Apply the tokenization manually on two sentences, batch them together using the padding token, then create the proper attention mask. Make sure to obtain the same results when going through the model!

In [5]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Sentences
sequence1 = "I’ve been waiting for a HuggingFace course my whole life."
sequence2 = "I hate this so much!"

# Tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

tokens = tokenizer.tokenize(sequence1)
sequence1_ids = tokenizer.convert_tokens_to_ids(tokens)

tokens = tokenizer.tokenize(sequence2)
sequence2_ids = tokenizer.convert_tokens_to_ids(tokens)

batched_ids = [sequence1_ids, sequence2_ids]

print(batched_ids)

[[146, 787, 1396, 1151, 2613, 1111, 170, 20164, 10932, 2271, 7954, 1736, 1139, 2006, 1297, 119], [146, 4819, 1142, 1177, 1277, 106]]


In [6]:
# Padding
batched_ids = [[146, 787, 1396, 1151, 2613, 1111, 170, 20164, 10932, 2271, 7954, 1736, 1139, 2006, 1297, 119], 
               [146, 4819, 1142, 1177, 1277, 106,
                tokenizer.pad_token_id,
                tokenizer.pad_token_id,
                tokenizer.pad_token_id,
                tokenizer.pad_token_id,
                tokenizer.pad_token_id,
                tokenizer.pad_token_id,
                tokenizer.pad_token_id,
                tokenizer.pad_token_id,
                tokenizer.pad_token_id,
                tokenizer.pad_token_id]
           ]

# Attention mask
attention_mask = [
                  [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
                  [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
]

outputs = model(torch.tensor(batched_ids), attention_mask=torch.tensor(attention_mask))
print(outputs.logits)

tensor([[ 1.6594, -1.4647],
        [ 1.4024, -1.2253]], grad_fn=<AddmmBackward0>)


In [7]:
# Model
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

print(model(torch.tensor([sequence1_ids])).logits)
print(model(torch.tensor([sequence2_ids])).logits)

tensor([[ 1.6594, -1.4647]], grad_fn=<AddmmBackward0>)
tensor([[ 1.4024, -1.2253]], grad_fn=<AddmmBackward0>)


_Note: With Transformer models, there is a limit to the lengths of the sequences we can pass the models. Most models handle sequences of up to 512 or 1024 tokens, and will crash when asked to process longer sequences. The solutions to this problem are: 1.) Use a model with a longer supporter sequence length (e.g.: Longformer, LED). 2.) Truncate your sequences (specify the `max_sequence_length` parameter)_