# End to End Transformers

The transformers API can handle all the tokenization, conversion to input ids, padding, truncation and the attention masks.

In [1]:
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

sequence = "I've been waiting for a HuggingFace course my whole life."

model_inputs=tokenizer(sequence)

## Multiple Sequences

In [2]:
# This can also handle multiple sequences with no code change

sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

model_inputs = tokenizer(sequences)

## Padding

In [3]:
# Will pad the sequences up to the maximum sequence length
model_inputs = tokenizer(sequences, padding="longest")

In [4]:
# Will pad the sequences up to the model max length
# (512 for BERT or DistilBERT)
model_inputs = tokenizer(sequences, padding="max_length")

In [5]:
# Will pad the sequences up to the specified max length
model_inputs = tokenizer(sequences, padding="max_length", max_length=8)

## Truncating

In [6]:
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

In [7]:
# Will truncate sequences that are longer than the maximum model lenghth (512 for BERT)
model_inputs = tokenizer(sequences, truncation=True)

In [8]:
# Will truncate sequences that are longer than the maximum length specified
model_inputs = tokenizer(sequences,max_length=8,truncation=True)

## Framework specific tensors

In [12]:
# Return PyTorch Tensors
model_inputs = tokenizer(sequences,padding=True, return_tensors='pt')

# Return Tensorflow Tensors
#model_inputs = tokenizer(sequences,padding=True, return_tensors='tf')
# Returns NumPy arrays
#model_inputs = tokenizer(sequences, padding=True, return_tensors="np")

In [15]:
print(model_inputs["input_ids"])

tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102],
        [  101,  2061,  2031,  1045,   999,   102,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0]])


## Special Tokens
There are some special tokens that are added by the tokenizer api

In [16]:
sequence = "I've been waiting for a HuggingFace course my whole life."

model_inputs = tokenizer(sequence)
print(model_inputs["input_ids"])

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)

[101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102]
[1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012]


In [18]:
# Decode to check them 
print(tokenizer.decode(model_inputs["input_ids"]))

[CLS] i've been waiting for a huggingface course my whole life. [SEP]


These special tokens are because the model was pretrained with these