<a href="https://colab.research.google.com/github/ruthgn/HF/blob/main/04_HuggingFace_Transformers_API.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Exploring the HuggingFace Transformers API high-level function.

In [None]:
!pip install datasets transformers[sentencepiece]

Collecting datasets
  Downloading datasets-1.18.3-py3-none-any.whl (311 kB)
[K     |████████████████████████████████| 311 kB 5.4 MB/s 
[?25hCollecting transformers[sentencepiece]
  Downloading transformers-4.16.2-py3-none-any.whl (3.5 MB)
[K     |████████████████████████████████| 3.5 MB 42.7 MB/s 
[?25hCollecting xxhash
  Downloading xxhash-3.0.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[K     |████████████████████████████████| 212 kB 50.9 MB/s 
Collecting aiohttp
  Downloading aiohttp-3.8.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 58.0 MB/s 
Collecting fsspec[http]>=2021.05.0
  Downloading fsspec-2022.2.0-py3-none-any.whl (134 kB)
[K     |████████████████████████████████| 134 kB 47.8 MB/s 
Collecting huggingface-hub<1.0.0,>=0.1.0
  Downloading huggingface_hub-0.4.0-py3-none-any.whl (67 kB)
[K     |████████████████████████████████| 67 kB 4.

In [None]:
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
                                          
sequence = "I am ready to SMASH this HuggingFace course."

model_inputs = tokenizer(sequence)

Downloading:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

`model_inputs` defined above contains everything that's necessary for a model to operate well.For the particular checkpoint model that we chose (DistilBERT), that includes the input IDs as well as the attention mask. Other model that accept additional inputs will also have those output by the `tokenizer` object.

This method is very powerful. Case in point--it can handle multiple sequences at a time, with no change in the API:

In [None]:
sequences = ["I am ready to SMASH this HuggingFace course",
             "I've been waiting for a HuggingFace course my whole life"]

model_inputs = tokenizer(sequences)

It can also pad according to different objectives:

In [None]:
# Will pad the sequences up to the maximum sequence length
model_inputs = tokenizer(sequences, padding="longest")

# Will pad the sequences up to the model max length
# (512 for Bert or DistilBERT)
model_inputs = tokenizer(sequences, padding="max_length")

# Will pad the sequences up to the specified max length
model_inputs = tokenizer(sequences, padding="max_length", max_length=8)

It can also truncate sequences:

In [None]:
sequences = ["""And then I said, "Hello!".""",
             "I'm loving this HuggingFace course. I love it so, so much.",
             """"Meow", says the cat.""",
             "Let's keep going!"]

# Will truncate the sequences that are longer than the model max length
# (512 for BERT or DistilBERT)
model_inputs = tokenizer(sequences, truncation=True)

# Will truncate the sequences that are longer than the specified max length
model_inputs = tokenizer(sequences, max_length=8, truncation=True)

The `tokenizer` object can handle the conversion to specific framework tensors, which can then be directly sent to the model. For example, in the following code sample we are prompting the tokenizer to return tensors from the different frameworks-- "pt" returns PyTorch tensors, "tf" returns TensorFlow tensors, and "np" returns Numpy arrays:

In [None]:
sequences = ["This is a sentence.", "This is another sentence!"]

# Get PyTorch tensors
model_inputs = tokenizer(sequences, padding=True, return_tensors="pt")

# Get TensorFlow tensors
model_inputs = tokenizer(sequences, padding=True, return_tensors="tf")

# Get NumPy arrays
model_inputs = tokenizer(sequences, padding=True, return_tensors="np")

## Special Tokens

Some tokenizers (depending on the checkpoint model) add the special word [CLS] at the beggining and the special word [SEP] at the end. This is because the model was pretrained with those, so to get the same results for inference we need to add them as well. Let's take a look at an example:

In [None]:
sequence = "OK now, check this out. Transformers are really cool. Using a Transformer network is simple."

model_inputs = tokenizer(sequence)
print(model_inputs['input_ids'])

[101, 7929, 2085, 1010, 4638, 2023, 2041, 1012, 19081, 2024, 2428, 4658, 1012, 2478, 1037, 10938, 2121, 2897, 2003, 3722, 1012, 102]


If we take a look at the input IDs returned by the tokenizer, we will see they are a tiny bit different from what we have if we try to do it the same way we did previously (by hand), which would be:

In [None]:
tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)

[7929, 2085, 1010, 4638, 2023, 2041, 1012, 19081, 2024, 2428, 4658, 1012, 2478, 1037, 10938, 2121, 2897, 2003, 3722, 1012]


One token ID was added at the beginning, and one at the end. Let’s decode the two sequences of IDs above:

In [None]:
print(tokenizer.decode(model_inputs['input_ids']))
print(tokenizer.decode(ids))

[CLS] ok now, check this out. transformers are really cool. using a transformer network is simple. [SEP]
ok now, check this out. transformers are really cool. using a transformer network is simple.


*Note: Some models don't add special words, or add different ones; models may also add these special words only at the beginning, or only at the end. In any case, the tokenizer knows which ones are expected and will deal with this for you.*

## Wrapping Up: From Tokenizer to Model

Let's see one final time how the `tokenizer` object can handle multiple sequences (padding!), very long sequences (truncation!), and multiple types of tensors with its main API:

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

sequences = ["I am loving this HuggingFace course.",
            "I am looking forward to using transformers in my data science projects."]

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
tokens = tokenizer(sequences, padding=True, truncation=True, return_tensors="pt")

model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
output = model(**tokens) # **kwargs dictionary 

In [None]:
tokens

{'input_ids': tensor([[  101,  1045,  2572,  8295,  2023, 17662, 12172,  2607,  1012,   102,
             0,     0,     0,     0,     0],
        [  101,  1045,  2572,  2559,  2830,  2000,  2478, 19081,  1999,  2026,
          2951,  2671,  3934,  1012,   102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}