<a href="https://colab.research.google.com/github/rahiakela/natural-language-processing-case-studies/blob/master/huggingface-transformers-practice/transformers_model_inputs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Transformers Model inputs

Every model is different yet bears similarities with the others. Therefore most models use the same inputs, which are detailed here alongside usage examples.

## Setup

In [None]:
!pip install transformers

In [2]:
from transformers import BertTokenizer

## Input IDs

The input ids are often the only required parameters to be passed to the model as input. They are token indices, numerical representations of tokens building the sequences that will be used as input by the model.

Each tokenizer works differently but the underlying mechanism remains the same. Here’s an example using the BERT tokenizer, which is a WordPiece tokenizer:

In [3]:
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

sequence = "A Titan RTX has 24GB of VRAM"

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=213450.0, style=ProgressStyle(descripti…




The tokenizer takes care of splitting the sequence into tokens available in the tokenizer vocabulary.

In [4]:
tokenized_sequence = tokenizer.tokenize(sequence)
print(tokenized_sequence)

['A', 'Titan', 'R', '##T', '##X', 'has', '24', '##GB', 'of', 'V', '##RA', '##M']


The tokens are either words or subwords. Here for instance, “VRAM” wasn’t in the model vocabulary, so it’s been split in “V”, “RA” and “M”. To indicate those tokens are not separate words but parts of the same word, a double-hash prefix is added for “RA” and “M”:

These tokens can then be converted into IDs which are understandable by the model. This can be done by directly feeding the sentence to the tokenizer.

In [5]:
inputs = tokenizer(sequence)

The tokenizer returns a dictionary with all the arguments necessary for its corresponding model to work properly. The token indices are under the key “input_ids”:

In [6]:
encoded_sequence = inputs["input_ids"]
print(encoded_sequence)

[101, 138, 18696, 155, 1942, 3190, 1144, 1572, 13745, 1104, 159, 9664, 2107, 102]


If we decode the previous sequence of ids.

In [7]:
decoded_sequence = tokenizer.decode(encoded_sequence)
print(decoded_sequence)

[CLS] A Titan RTX has 24GB of VRAM [SEP]


because this is the way a BertModel is going to expect its inputs.

## Attention mask