# **Hugging Face NLP Course - Chapter 2**

Source: https://huggingface.co/learn/nlp-course/chapter2/1

---

In [None]:
!pip install datasets evaluate transformers[sentencepiece]

## **2. Using HF Transformers**

### **2.1. Behind the pipeline**

The pipeline groups together three steps: preprocessing, passing the inputs through the model, and postprocessing:

![full_nlp_pipeline.svg](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter2/full_nlp_pipeline.svg)

#### **Preprocessing with a tokenizer**

Converts the text inputs into numbers that the model can make sense of. The _tokenizer_ is responsible for:
- splitting the input into words, subwords, or symbols (like punctuation) that are called _tokens_
- mapping each token to an integer (those integers are vocabulary indices, which are typically called _input IDs_)
- adding additional inputs that may be useful to the model

All this preprocessing needs to be done in exactly the same way as when the model was pretrained. The `AutoTokenizer` class, its `from_pretrained()` method and the checkpoint name of our model can be used to download that information from the [Model Hub](https://huggingface.co/models).

Once we have the tokenizer, we can directly pass our sentences to it and we'll get back a dictionary that's ready to feed to our model. The only thing left to do is to convert the list of input IDs to tensors.

Transformer models only accept _tensors_ as input. You can think of them as NumPy arrays. To specify the type of tensors we want to get back (PyTorch, TensorFlow, or plain NumPy), we use the `return_tensors` argument (if no type is passed, you will get a list of lists as a result).

In [None]:
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)  # Downloads tokenizer
raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!",
]
inputs = tokenizer(
    raw_inputs,  # Can pass one sentence or a list of sentences
    padding=True,
    truncation=True,
    return_tensors="pt",  # Specifies tensor type (pt=PyTorch, tf=TensorFlow)
)
print(inputs)

#### **Going through the model**

We can download our pretrained model the same way we did with our tokenizer, with the `AutoModel` class.

The class instantiates the architecture that contains only the base Transformer module: given some inputs, it outputs what we'll call _hidden states_, also known as _features_. For each model input, we'll retrieve a high-dimensional vector representing the __contextual understanding of that input by the Transformer model__.

The architecture consists of an embeddings layer and subsequent layers. The embeddings layer converts each input ID in the tokenized input into a vector that represents the associated token. The subsequent layers manipulate those vectors using the attention mechanism to produce the final representation of the sentences.

In [None]:
from transformers import AutoModel
# TensorFlow: from transformers import TFAutoModel

model = AutoModel.from_pretrained(checkpoint)  # Instantiates model
outputs = model(**inputs)
# TensorFlow: outputs = model(inputs)
print(outputs.last_hidden_state.shape)  # Prints batch size, sequence length, hidden size (vector dimension)

While the hidden states can be useful on their own, they're usually inputs to another part of the model, known as the _head_. Different tasks could be performed with the same architecture, but each of these tasks will have a different head associated with it. The model heads take the high-dimensional vector of hidden states as input and project them onto a different dimension.

For our example, we will need a model with a sequence classification head. So, we won't actually use the `AutoModel` class, but `AutoModelForSequenceClassification`. If we then look at the shape of our outputs, the dimensionality will be much lower: the model head takes as input the high-dimensional vectors we saw before, and outputs vectors containing two values (one per label).

In [None]:
from transformers import AutoModelForSequenceClassification
# TensorFlow: from transformers import TFAutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**inputs)
# TensorFlow: outputs = model(inputs)
print(outputs.logits.shape)

#### **Postprocessing the output**

The model's outputs aren't probabilities but _logits_, which are the raw, unnormalized scores outputted by the last layer of the model. To be converted to probabilities, they need to go through a SoftMax layer.

To get the labels corresponding to each position, we can inspect the `id2label` attribute of the model config.

In [None]:
import torch
# TensorFlow: import tensorflow as tf

predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
# TensorFlow: predictions = tf.math.softmax(outputs.logits, axis=-1)
print(predictions)
print(model.config.id2label)  # Print labels and their positions

### **2.2. Models**

The `AutoModel` class is handy when you want to instantiate any model from a checkpoint. This class and all of its relatives are actually simple wrappers over the wide variety of models available in the library. It's a clever wrapper as it can automatically guess the appropriate model architecture for your checkpoint, and then instantiates a model with this architecture.

However, if you know the type of model you want to use, you can use the class that defines its architecture directly.

In [None]:
# Example: BERT model
from transformers import BertConfig, BertModel
# TensorFlow: from transformers import BertConfig, TFBertModel

# Building the config
config = BertConfig()

# Building the model from the config
model = BertModel(config)  # Model is randomly initialized (random values)!

print(config)  # Config contains many attributes that are used to build the model

The model can be used in this state, but it will output gibberish; it needs to be trained first. We could train the model from scratch on the task at hand, or reuse models that have already been trained.

In [None]:
from transformers import BertModel  # Can be replaced with equivalent `AutoModel` class
# TensorFlow: from transformers import TFBertModel, TFAutoModel

model = BertModel.from_pretrained("bert-base-cased")  # Load already trained Transformer model

This model is now initialized with all the weights of the checkpoint. It can be used directly for inference on the tasks it was trained on, and it can also be fine-tuned on a new task. By training with pretrained weights rather than from scratch, we can quickly achieve good results.

#### **Saving methods**

A model can be saved using the `save_pretrained()` method, which is analogous to the `from_pretrained()` method. It will save two files to your disk: `config.json` and `pytorch_model.bin` in case of PyTorch (or `tf_model.h5` in case of TensorFlow). The `config.json` file contains the attributes necessary to build the model architecture. It also contains some metadata, such as where the checkpoint originated and what HF Transformers version you were using when you last saved the checkpoint. The `pytorch_model.bin` (or `tf_model.h5`) file is known as the _state dictionary_; it contains all your model's weights. The two files go hand in hand; the configuration is necessary to know your model's architecture, while the model weights are your model's parameters.

In [None]:
model.save_pretrained("directory_on_my_computer")

### **2.3. Tokenizers**

Tokenizers are one of the core components of the NLP pipeline. They serve one purpose: to translate text into data that can be processed by the model. In NLP tasks, the data that is generally processed is raw text. Models can only process numbers, so tokenizers need to convert our text inputs to numerical data. There are a lot of ways to go about this. The goal is to find the most meaningful representation - that is, the one that makes the most sense to the model - and, if possible, the smallest representation.

Common examples of tokenization algorithms:
- _word-based_
  - splits text on spaces (variation: spaces and punctuation)
  - cons:
    - very large vocabularies (results in heavy models)
    - large quantity of out-of-vocabulary tokens
    - loss of meaning across very similar words (e.g. "dog" and "dogs" or "run" and "running"; model identifies similar words as unrelated because of different IDs; model learns different meanings)
- _character-based_
  - splits text on characters
  - pros:
    - small vocabularies (always slimmer than their word-based vocabularies counterparts; include letters, numbers, and special characters)
    - very few out-of-vocabulary (unknown) tokens (every word can be built from characters; ability to correctly tokenize misspelled words, rather than discarding them as unknown straight away; vocabularies are more complete than their word-based vocabularies counterparts)
  - cons:
    - very long sequences (sequences are translated into very large amount of tokens to be processed by the model; can have an impact on the size of the context the model will carry around; reduces the size of the text we can use as input for our model)
    - less meaningful individual tokens (characters do not hold as much information individually as a word would hold; true for languages like roman-based languages; not true for all languages, as some languages like ideogram-based languages have a lot of information held in single characters)
- _subword-based_
  - combines word-based and character-based
  - principle: frequently used words should not be split into smaller subwords, but rare words should be decomposed into meaningful subwords
  - pros:
    - tokens have a semantic meaning while being space-efficient
    - allows us to have a relatively good coverage with small vocabularies, and close to no unknown tokens
    - especially useful in agglutinative languages such as Turkish, where you can form (almost) arbitrarily long complex words by stringing together subwords

Other examples:
- byte-level BPE (used in GPT-2)
- WordPiece (used in BERT)
- SentencePiece or Unigram (used in several multilingual models)

#### **Loading and saving**

Loading and saving tokenizers is as simple as it is with models. Actually, it's based on the same two methods: `from_pretrained()` and `save_pretrained()`. These methods will load or save the algorithm used by the tokenizer (a bit like the _architecture_ of the model) as well as its vocabulary (a bit like the _weights_ of the model).

In [None]:
from transformers import BertTokenizer  # Can be replaced with equivalent `AutoTokenizer` class

tokenizer = BertTokenizer.from_pretrained("bert-base-cased")  # Load tokenizer

# Do stuff (like fine-tuning, using the tokenizer, ...)

tokenizer.save_pretrained("directory_on_my_computer")  # Save tokenizer

#### **Encoding**

Translating text to numbers is known as _encoding_. Encoding is done in a two-step process: the tokenization (splitting text into _tokens_, adding potential special tokens), followed by the conversion to input IDs (converting each token to their unique respective ID as defined by the tokenizer's _vocabulary_).

To get a better understanding of the two steps, we'll explore them separately, using appropriate methods. Note that in practice, you should call the tokenizer directly on your inputs (instead of using those methods).

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

sequence = "Using a Transformer network is simple"
tokens = tokenizer.tokenize(sequence)  # Does tokenization process
print(tokens)  # Output is list of strings/tokens

input_ids = tokenizer.convert_tokens_to_ids(tokens)  # Converts tokens to input IDs
print(input_ids)  # List of input IDs, not yet converted to appropriate framework tensor

# Perform missing step: adding special tokens needed by the model
final_inputs = tokenizer.prepare_for_model(input_ids)  # Adds special tokens
print(final_inputs["input_ids"])  # List of input IDs, with IDs of special tokens

#### **Decoding**

_Decoding_ is going the other way around: from vocabulary indices, we want to get a string. This can be done with the `decode()` method. Note that the `decode` method not only converts the indices back to tokens, but also groups together the tokens that were part of the same words to produce a readable sentence. This behavior will be extremely useful when we use models that predict new text (e.g. text generation, translation, summarization).

In [None]:
decoded_string = tokenizer.decode(input_ids)  # Decodes input IDs
print(decoded_string)  # Should be original input text

### **2.4. Handling multiple sequences**

#### **Models expect a batch of inputs**

HF Transformers models expect multiple sentences by default. _Batching_ is the act of sending multiple sentences through the model, all at once. Batching allows the model to work when you feed it multiple sentences.

In [None]:
import torch
# TensorFlow: import tensorflow as tf
from transformers import AutoTokenizer, AutoModelForSequenceClassification
# TensorFlow: from transformers import AutoTokenizer, TFAutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence = "I've been waiting for a HuggingFace course my whole life."

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)

input_ids = torch.tensor([ids])  # Add dimension to convert it to a batch
# TensorFlow: input_ids = tf.constant([ids])
print("Input IDs:", input_ids)

output = model(input_ids)  # Only works for batches, not for single sequences
print("Logits:", output.logits)

But there's an issue. When you're trying to batch together two (or more) sentences, they might be of different lengths. Tensors need to be of rectangular shape, so you won't be able to convert the list of input IDs into a tensor directly. To work around this problem, we usually _pad_ the inputs.

#### **Padding the inputs**

As a workaround, we'll use _padding_ to make our tensors have a rectangular shape. Padding makes sure all our sentences have the same length by adding a special word called the _padding token_ to the sentences with fewer values. For example, if you have 9 sentences with 10 words and 1 sentence with 20 words, padding will ensure all the sentences have 20 words. The padding token ID can be found in `tokenizer.pad_token_id`.

In [None]:
sequence1_ids = [[200, 200, 200]]
sequence2_ids = [[200, 200]]
batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],
]

print(model(torch.tensor(sequence1_ids)).logits)
print(model(torch.tensor(sequence2_ids)).logits)
print(model(torch.tensor(batched_ids)).logits)
# TensorFlow: print(model(tf.constant(...)).logits)

You will notice that some logits in our batched predictions differ from the logits of the individual predictions. This is because the key feature of Transformer models is attention layers that _contextualize_ each token. These will take into account the padding tokens since they attend to all of the tokens of a sequence. To get the same result when passing individual sentences of different length through the model or when passing a batch with the same sentences and padding applied, we need to tell those attention layers to ignore the padding tokens. This is done by using an attention mask.

#### **Attention masks**

_Attention masks_ are tensors with the exact same shape as the input IDs tensor, filled with 0s and 1s: 1s indicate the corresponding tokens should be attended to, and 0s indicate the corresponding tokens should not be attended to (i.e., they should be ignored by the attention layers of the model). We can use attention masks to get the same result for the same sequence, whether it is processed in a batch or individually.

In [None]:
sequence1_ids = [[200, 200, 200]]
sequence2_ids = [[200, 200]]

batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],
]
attention_mask = [
    [1, 1, 1],
    [1, 1, 0],
]

print(model(torch.tensor(sequence1_ids)).logits)
print(model(torch.tensor(sequence2_ids)).logits)
print(model(torch.tensor(batched_ids), attention_mask=torch.tensor(attention_mask)).logits)
# TensorFlow: print(model(tf.constant(batched_ids), attention_mask=tf.constant(attention_mask)).logits)

#### **Longer sequences**

With Transformer models, there is a limit to the lengths of the sequences we can pass the models. Most models handle sequences of up to 512 or 1024 tokens, and will crash when asked to process longer sequences. There are two solutions to this problem:
- Use a model with a longer supported sequence length.
- Truncate your sequences.

Models have different supported sequence lengths, and some specialize in handling very long sequences. If you're working on a task that requires very long sequences, we recommend you take a look at such models (e.g. [Longformer](https://huggingface.co/docs/transformers/model_doc/longformer), [LED](https://huggingface.co/docs/transformers/model_doc/led)). Otherwise, we recommend you truncate your sequences by specifying the `max_sequence_length` parameter.

### **2.5. Putting it all together**

In the last few sections, we've been trying our best to do most of the work by hand. We've explored how tokenizers work and looked at tokenization, conversion to input IDs, padding, truncation, and attention masks.

However, the HF Transformers API can handle all of this for us with a high-level function. When you call your `tokenizer` directly on the sentence, you get back inputs that are ready to pass through your model.

Note that some models add special words to sequences. The tokenizer knows which ones are expected and will deal with this for you.

In [None]:
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

sequence = "I've been waiting for a HuggingFace course my whole life."
model_inputs = tokenizer(sequence)  # Contains everything that's necessary for the model to operate well

# Also works for multiple sequences
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]
model_inputs = tokenizer(sequences)

# Padding

# Will pad the sequences up to the maximum sequence length
model_inputs = tokenizer(sequences, padding="longest")
# Will pad the sequences up to the model max length
model_inputs = tokenizer(sequences, padding="max_length")
# Will pad the sequences up to the specified max length
model_inputs = tokenizer(sequences, padding="max_length", max_length=8)

# Truncation

# Will truncate the sequences that are longer than the model max length
model_inputs = tokenizer(sequences, truncation=True)
# Will truncate the sequences that are longer than the specified max length
model_inputs = tokenizer(sequences, max_length=8, truncation=True)

# Conversion to specific framework tensors

# Returns PyTorch tensors
model_inputs = tokenizer(sequences, padding=True, return_tensors="pt")
# Returns TensorFlow tensors
model_inputs = tokenizer(sequences, padding=True, return_tensors="tf")
# Returns NumPy arrays
model_inputs = tokenizer(sequences, padding=True, return_tensors="np")

# All together

model_inputs = tokenizer(sequences, padding=True, truncation=True, return_tensors="pt")