# Chapter 8: Pretrained Models for Natural Language Processing

Installation Notes
To run this notebook on Google Colab, you will need to install the following libraries: transformers and datasets.

In Google Colab, you can run the following command to install these libraries:

In [None]:
!pip install datasets transformers

## 8.2 Learning Objectives

By the end of this chapter, you should be able to:
- understand the role of tokenization in preprocessing sentences as inputs
- load pretrained models and pipelines for NLP using HuggingFace
- understand the general idea behind generative models for NLP

## 8.3 Natural Language Processing

There are many datasets and models for Natural Language Processing available in the Hugging Face Hub. Each model has a corresponding tokenizer, which can be used to preprocess and format the text, turning it into a proper input for the model. First, we'll use the RoBERTa model to have an overview of the architecture and capabilities of language models in general, and then we'll use Hugging Face pipelines to perform some typical NLP tasks out-of-the-box. In the fourth part of the course, we'll explore these topics in further detail.

### 8.3.1 Model

![](https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch0/model_step1.png)

To create an instance of a [RoBERTa](https://huggingface.co/docs/transformers/en/model_doc/roberta) model, we load its corresponding RobertaConfig, that specifies many aspects of the model architecture and configuration, such as the number of embedding dimensions, the maximum sequence length, and the vocabulary size, and use it as an argument to the RobertaModel class.

In [None]:
import torch
from transformers import RobertaConfig, RobertaModel

configuration = RobertaConfig()
configuration

RobertaConfig {
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "transformers_version": "4.44.0",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 50265
}

The loaded configuration specifies the architectural details of the model. We can then use it to create an instance of the RoBERTa model:



In [None]:
# random weights
model = RobertaModel(configuration)
model

RobertaModel(
  (embeddings): RobertaEmbeddings(
    (word_embeddings): Embedding(50265, 768, padding_idx=1)
    (position_embeddings): Embedding(512, 768, padding_idx=1)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): RobertaEncoder(
    (layer): ModuleList(
      (0-11): 12 x RobertaLayer(
        (attention): RobertaAttention(
          (self): RobertaSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): RobertaSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropou

The created model is, however, untrained. If, instead of training it from scratch, we would like to load its pretrained weights, we can call its from_pretrained() method:

In [None]:
repo_id = "FacebookAI/roberta-base"
model = RobertaModel.from_pretrained(repo_id)
model

Some weights of RobertaModel were not initialized from the model checkpoint at FacebookAI/roberta-base and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


RobertaModel(
  (embeddings): RobertaEmbeddings(
    (word_embeddings): Embedding(50265, 768, padding_idx=1)
    (position_embeddings): Embedding(514, 768, padding_idx=1)
    (token_type_embeddings): Embedding(1, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): RobertaEncoder(
    (layer): ModuleList(
      (0-11): 12 x RobertaLayer(
        (attention): RobertaAttention(
          (self): RobertaSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): RobertaSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (dropou

This is an encoder-based model, as stated by its name, and it's based on the Transformer architecture (which we'll discuss in further detail in "Contextual Word Embeddings with Transformers"). While encoder-based models are useful for tasks like classification, decoder-based models are mostly used for generating data (e.g. GPT), as we'll see later in this chapter.

Talking about classification, what about RoBERTa's "head", that is, the classifier part that we've been seeing at the top of every computer vision model we used so far?

It turns out, this model is headless: there's no classifier head. Moreover, its last layer, pooler.dense, was not loaded either (notice the warning message above, suggesting the model needs to be further trained in a down-stream task).

### 8.3.2 Tokenizers

![](https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch0/data_step3.png)

Similarly to computer vision models, NLP models also have prescribed transformations that we must apply to our input texts. These are carried out by tokenizers, an important and often overlooked part of language models.

Let's load RoBERTa's tokenizer and see what it does:

In [None]:
from transformers import RobertaTokenizer
tokenizer = RobertaTokenizer.from_pretrained(repo_id)
tokenizer



RobertaTokenizer(name_or_path='FacebookAI/roberta-base', vocab_size=50265, model_max_length=512, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'sep_token': '</s>', 'pad_token': '<pad>', 'cls_token': '<s>', 'mask_token': '<mask>'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
	1: AddedToken("<pad>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
	2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
	3: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
	50264: AddedToken("<mask>", rstrip=False, lstrip=True, single_word=False, normalized=False, special=True),
}

A tokenizer performs a whole sequence of operations, as indicated by its many configuration arguments:

vocabulary (vocab_size)
truncation (model_max_length and truncation_side)
padding (padding_side)
special token prepending and/or appending (special_tokens)
Let's take a quick look at each one of these components. We'll get back to all these operations in much more detail in "Contextual Word Embeddings with Transformers".

#### 8.3.2.1 Tokenizer

### NLP: Tokenizers: Tokenization
The tokenizer breaks up a sentence into its components - these are typically words or subwords (one or more syllables, in general). Each token will eventually be converted into an array of numerical values - its corresponding embedding. In computer vision, it was of the utmost importance to properly standardize input images using the same mean and standard deviation used during pretraining. The same holds true for NLP when it comes to using the right tokenizer and vocabulary: you must use the same ones as the pretrained model. We'll get back to it in more detail in "Contextual Word Embeddings with Transformers".

For now, let's see what the output of the tokenizer's tokenize() method looks like:

In [None]:
input_batch = ["I am really liking this course!", "This course is too complicated!"]

tokenized = tokenizer.tokenize(input_batch[0])
tokenized

['I', 'Ġam', 'Ġreally', 'Ġliking', 'Ġthis', 'Ġcourse', '!']

The sentence was split into several components. Even though the tokens themselves represent slightly different versions of the words, they still hold a correspondence to the original words. Let's convert a single token - "I" - to its corresponding numerical ID and back (using the decode() method):

In [None]:
tokenizer.convert_tokens_to_ids('I'), tokenizer.decode(tokenizer.convert_tokens_to_ids('I'))

(100, 'I')

We can also do the same for the whole sentence:

In [None]:
tokenizer.convert_tokens_to_ids(tokenized), tokenizer.decode(tokenizer.convert_tokens_to_ids(tokenized))

([100, 524, 269, 25896, 42, 768, 328], 'I am really liking this course!')

In the example above, each word corresponded to a single token, but that's not always the case. Different tokenizers may split a word into multiple components, depending on their vocabularies.

#### 8.3.2.2 Vocabulary

#### NLP: Tokenizers: Vocabulary
The vocabulary is the exhaustive list of tokens that can be handled by the model. If a token is not present in the vocabulary, though, it still can default to an "unknown" token, as long as the "unknown" token is part of the vocabulary.

The tokenizer's vocabulary can be easily accessed through its get_vocab() method. It returns a dictionary that maps every token to a corresponding numerical ID:

In [None]:
tokens_to_idx = tokenizer.get_vocab()
tokens_to_idx

{'<s>': 0,
 '<pad>': 1,
 '</s>': 2,
 '<unk>': 3,
 '.': 4,
 'Ġthe': 5,
 ',': 6,
 'Ġto': 7,
 'Ġand': 8,
 'Ġof': 9,
 'Ġa': 10,
 'Ġin': 11,
 '-': 12,
 'Ġfor': 13,
 'Ġthat': 14,
 'Ġon': 15,
 'Ġis': 16,
 'âĢ': 17,
 "'s": 18,
 'Ġwith': 19,
 'ĠThe': 20,
 'Ġwas': 21,
 'Ġ"': 22,
 'Ġat': 23,
 'Ġit': 24,
 'Ġas': 25,
 'Ġsaid': 26,
 'Ļ': 27,
 'Ġbe': 28,
 's': 29,
 'Ġby': 30,
 'Ġfrom': 31,
 'Ġare': 32,
 'Ġhave': 33,
 'Ġhas': 34,
 ':': 35,
 'Ġ(': 36,
 'Ġhe': 37,
 'ĠI': 38,
 'Ġhis': 39,
 'Ġwill': 40,
 'Ġan': 41,
 'Ġthis': 42,
 ')': 43,
 'ĠâĢ': 44,
 'Ġnot': 45,
 'Ŀ': 46,
 'Ġyou': 47,
 'ľ': 48,
 'Ġtheir': 49,
 'Ġor': 50,
 'Ġthey': 51,
 'Ġwe': 52,
 'Ġbut': 53,
 'Ġwho': 54,
 'Ġmore': 55,
 'Ġhad': 56,
 'Ġbeen': 57,
 'Ġwere': 58,
 'Ġabout': 59,
 ',"': 60,
 'Ġwhich': 61,
 'Ġup': 62,
 'Ġits': 63,
 'Ġcan': 64,
 'Ġone': 65,
 'Ġout': 66,
 'Ġalso': 67,
 'Ġ$': 68,
 'Ġher': 69,
 'Ġall': 70,
 'Ġafter': 71,
 '."': 72,
 '/': 73,
 'Ġwould': 74,
 "'t": 75,
 'Ġyear': 76,
 'Ġwhen': 77,
 'Ġfirst': 78,
 'Ġshe': 79,
 'Ġtwo': 

Did you notice that most tokens start with a "weird" character (Ġ)? In this particular tokenizer, this character indicates the space preceding a word. Tokens that do not start with Ġ are found either at the beginning of a sequence or as part of another word. Typical examples of the latter are the endings "ed" and "ing" which are tokens but not full words.

In [None]:
tokens_to_idx['ed'], tokens_to_idx['ing'], tokens_to_idx['Ġonly'], tokens_to_idx['only']

(196, 154, 129, 8338)

Let's try tokenizing a sentence that has a word with the "ing" ending:

In [None]:
tokenizer.tokenize('I am dissecting this, am I?')

['I', 'Ġam', 'Ġdissect', 'ing', 'Ġthis', ',', 'Ġam', 'ĠI', '?']

The word "dissecting" isn't used often enough to be taken as a whole word, so it's split into two components: "Ġdissect" which is a word on its own right, and the typical "ing" ending. Some words are more commonly used, though, and both versions may be considered full words by the tokenizer:

In [None]:
tokenizer.tokenize('I am playing with the word play.')

['I', 'Ġam', 'Ġplaying', 'Ġwith', 'Ġthe', 'Ġword', 'Ġplay', '.']

See? Even though "play" is a token in the vocabulary, "playing" is also a token (as opposed to being composed by the tokens "play" and "ing" separately).

#### 8.3.2.3 Max Length

#### NLP: Tokenizers: Max Length
The tokenizer will also truncate the input to the maximum length taken by the model which, in RoBERTa's case, is 512 tokens. Notice that this is different from the maximum length of a single sentence, which is two tokens shorter than the model's maximum length:

In [None]:
tokenizer.max_len_single_sentence, tokenizer.model_max_length

(510, 512)

The difference is due to the special tokens that will be both prepended (beginning of sequence, or BOS, token) and appended (end of sequence, or EOS, token) to the sentence. Our sentences are quite short, so nothing will actually happen at this step. For the sake of illustrating the idea, we can force the tokenizer to truncate the input to a much shorter length, say, five tokens:

In [None]:
truncated_token_ids = tokenizer(input_batch[0], truncation=True, max_length=5)['input_ids']
truncated_token_ids

[0, 100, 524, 269, 2]

Let's see what's left of the original message:

In [None]:
tokenizer.decode(truncated_token_ids)

'<s>I am really</s>'

Actually, only three tokens of the original sentence made it. The remaining two tokens (out of the five we're truncating the input to) are the subject of our next topic, special tokens.

#### 8.3.2.4 Special Tokens

### NLP: Tokenizers: Special Tokens
There are many special tokens, like the token for unknown words (UNK), those words not present in our predefined vocabulary. Special tokens can be used to represent the start of a sequence (BOS), the end of a sequence (EOS), a separation between two sequences (SEP), or to simply pad (PAD) (or "stuff") a sequence to make it of a certain length. We can inspect all special tokens defined in the tokenizer using its special_tokens_map attribute:

In [None]:
tokenizer.special_tokens_map

{'bos_token': '<s>',
 'eos_token': '</s>',
 'unk_token': '<unk>',
 'sep_token': '</s>',
 'pad_token': '<pad>',
 'cls_token': '<s>',
 'mask_token': '<mask>'}

RoBERTa's tokenizer both prepends a "start" token to the beginning of a sequence and appends an "end" token to the end of the sequence:

In [None]:
token_ids = tokenizer.encode(input_batch[0], add_special_tokens=True)
token_ids

[0, 100, 524, 269, 25896, 42, 768, 328, 2]

It's easier to see the added tokens by decoding the IDs back into text:

In [None]:
tokenizer.decode(token_ids)

'<s>I am really liking this course!</s>'

 See? The sequence now starts with an <s> (id zero) token and ends with a </s> (id two) token.

Calling the tokenizer itself (as opposed to one of its methods) does the whole thing at once, so our initial sentences are properly converted into sequences of token indices:

In [None]:
input_batch = ["I am really liking this course!", "This course is too complicated!"]
transformed = tokenizer(input_batch)['input_ids']
transformed

[[0, 100, 524, 269, 25896, 42, 768, 328, 2],
 [0, 713, 768, 16, 350, 6336, 328, 2]]

### 8.3.3 NLP :- Inference

![](https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch0/model_step5.png)

We still need to turn our input sequences into a PyTorch tensor. You may be wondering why the transformation didn't return tensors instead of plain Python lists. The answer is, the sequences may have different lengths, and that literally raises problems if you want to make a (single) tensor out of them:

In [None]:
torch.as_tensor(transformed)

ValueError: expected sequence of length 9 at dim 1 (got 8)

The answer lies in padding the shortest sequence, that is, appending one or more times another special token ([PAD]) to its end so its length matches the longest sentence in the batch. Padding is a common operation, both in Computer Vision and Natural Language Processing, and we'll get back to it in the chapters that follow. The tokenizer already has a special padding token, and we can easily retrieve its corresponding id through the tokenizer's pad_token_id attribute:

In [None]:
tokenizer.pad_token_id

1

Luckily, we don't have to manually add padding tokens to our inputs in order to make PyTorch tensors out of them. We only need to specify padding=True as argument of our tokenizer, thus yielding lists of equal sizes, and then specify return_tensors='pt' (pt stands for PyTorch) so the tokenizer returns tensors instead of Python lists:

In [None]:
input_batch = ["I am really liking this course!", "This course is too complicated!"]
model_input = tokenizer(input_batch, padding=True, return_tensors='pt')['input_ids']
model_input, model_input.shape

(tensor([[    0,   100,   524,   269, 25896,    42,   768,   328,     2],
         [    0,   713,   768,    16,   350,  6336,   328,     2,     1]]),
 torch.Size([2, 9]))

Now we're set: our input has two sentences of the same length, nine tokens each.

Let's see what kind of output our RoBERTa model returns:

In [None]:
model.eval()
output = model(model_input)
output.last_hidden_state.shape

We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.


torch.Size([2, 9, 768])

Two sentences, nine tokens each, each token represented by an array of 768 numerical values. In "Building Your First Dataset" we converted categorical values (such as "Gasoline", "Diesel", etc.) to arrays of numerical values. Do you remember what those were called? Those were embeddings. And so are these! RoBERTa produced embeddings for each token in our two sentences. These tokens were learned by the model during pretraining, and they are more than regular embeddings, they are contextual embeddings.

We already saw regular embeddings, they work like a big lookup table. Imagine that you have every word from Webster's dictionary in your table, each row corresponding to a word, and an array assigned to every word. If you want the values for the word "bank", you look it up, and there you have them! But words aren't as straightforward as we'd like them to be: the word "bank" may stand for a financial establishment or the land alongside a river or lake. Regular embeddings do not account for these differences in meaning, but contextual embeddings do. The array corresponding to the word "bank" may differ depending on the context it is being used in. That's what a language model such as RoBERTa produces: contextual embeddings. We'll see all these in much more detail in "Contextual Word Embeddings with Transformers".

In the meantime, let's see how these contextual embeddings can be used in a "head".

### 8.3.4 NLP : Attaching a Head

![](https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch0/model_step1.png)

A "head" in this context is nothing but a layer, sequence of layers, or a small model, that uses the embeddings as input and converts them into an actual prediction (e.g. "positive" or "negative" labels for sentiment analysis).

Let's create a RoBERTa model with a classifier head that does just that - it takes a sequence of contextual embeddings of 768 dimensions each, and produces two logits, one for each class. There are many RoBERTa models tailored for different downstream tasks, such as sequence classification (the one we're using), token classification, question answering, multiple choice, etc.

In sequence classification, we can specify the number of classes (or distinct labels) pertaining to our task using the num_labels argument from the from_pretrained() method:

In [None]:
from transformers import RobertaForSequenceClassification

torch.manual_seed(11)
model_with_head = RobertaForSequenceClassification.from_pretrained(repo_id, num_labels=2)
model_with_head

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at FacebookAI/roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


RobertaForSequenceClassification(
  (roberta): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(50265, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0-11): 12 x RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
             

As you can see, RoBERTa's classifier head is quite a simple model.

In [None]:
classifier_head = model_with_head.classifier
classifier_head

RobertaClassificationHead(
  (dense): Linear(in_features=768, out_features=768, bias=True)
  (dropout): Dropout(p=0.1, inplace=False)
  (out_proj): Linear(in_features=768, out_features=2, bias=True)
)

Now, our model is (almost) ready to perform binary classification. What if we give our model the same input as before?

In [None]:
model_with_head.eval()
output = model_with_head(model_input)
output, output.logits.shape

(SequenceClassifierOutput(loss=None, logits=tensor([[-0.1540,  0.0212],
         [-0.1685,  0.0220]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None),
 torch.Size([2, 2]))

Two sentences, two logits each.

Notice that the [output](https://huggingface.co/docs/transformers/en/main_classes/output) of a Hugging Face model is, unless otherwise specified, a dictionary-like structure, SequenceClassifierOutput in this case. The typical structure returns:

the loss (if in training mode, when labels are provided)
logits
hidden_states (the contextual embeddings we discussed in the previous section, returned if output_hidden_states=True)
attentions (attention scores, returned if output_attentions=True)
The returned logits are just random because our classifier head wasn't trained yet. And that's your task in the next lab!

### 8.3.5 nlp:  Logits and Loss Functions

![](https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch0/model_step2.png)

So far, we have only used, but not trained, a classifier. The classifier we used was trained to classify images into one thousand possible categories. We saw that the model produces one logit for each possible category, and that these logits may be converted to probabilities using a softmax function.

Now, we'll be training RoBERTa's classifier head in the upcoming lab. What do we know so far? The classifier head, as seen in the previous section, produces two logits for each input, since its purpose is to classify sentences in two categories: "positive" and "negative" sentiment.

A classification task calls for a different loss function, and the choice depends on a couple of factors:

Does it output logits?
How many outputs?
Classifiers usually output logits, not probabilities nor log-probabilities (that's just the logarithm of the probabilities). If the last layer of the model is a regular linear layer, it's safe to assume it is indeed producing logits. Some models may have a sigmoid or logsoftmax layer at the end and, if that's the case, they will be producing probabilities or log-probabilities, respectively. For now, let's focus on the easier variety of models that simply produce logits.

Let's focus on the second question now: we saw models producing 1,000 logits, and our RoBERTa classifier head produces two logits. In both cases, the appropriate loss function is the cross-entropy loss (nn.CrossEntropyLoss), and the task itself is deemed a multiclass classification task (even if we only have two possible categories, as in our case).

#### 8.3.5.1 One Logit or Two Logits?

####NLP: Logits and Loss Functions: One Logit or Two Logits?
It is common to refer to classification tasks as either binary (two categories) or multiclass classification (more than two categories). So, the fact that we have only "positive" and "negative" categories, but the loss function we should be using, cross-entropy loss, is typical of multiclass classification tasks is surely confusing.

The difference boils down to the number of logits being produced by the model. As it turns out, it is possible to achieve binary classification using either a single logit or two logits. Let's quickly go over the difference between the two approaches.

A single logit answers a single "yes/no" question, such as, is the sentence "positive"? It is typical to assume "yes" corresponds to high values of logits (e.g. positive values) and "no" corresponds to low values of logits (e.g. negative values). So, you can use a threshold (typically zero) to split the results in two, and each split corresponds to a category (values above zero mean "positive", values below zero mean "negative"). We'll be using a single logit to perform true binary classification in the next chapter, and then we'll go into more detail about this approach.

![](https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch6/one_logit.png)
###Binary classification with a single logit
Multiple logits answer a series of questions or, better yet, they assign scores to each question. In our case, there are two logits or questions:

- "Is this a positive sentence?"
- "Is this a negative sentence?"
There can be only one answer, and the logit with the highest value (score) wins, as we've already seen a few times. Notice that in this case, the "winning" logit may even be a negative value, it just needs to be higher than the others.
![](https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch6/two_logits.png)
###Binary classification with two logits

Sure, if there are only two questions, this approach is redundant and it could be simplified to a "yes/no" question using a single logit. So, why haven't we done that instead?

Pretrained models are usually trained to classify their inputs into multiple categories, and as we'll see later on, there are also utilities to easily map the "winning logit" to its human-readable category name, and they expect multiple logits, even for something that should have been a binary classification.

#### 8.3.5.2 Cross-Entropy Loss

So, in a nutshell, if your classifier head produces more than one logit (and it is OK to use two for a binary classification), you must use the cross-entropy loss.

The table below may help you organize a little bit the ideas presented in this section (right now, we're going to use the last column only):

|                         | BCE Loss               | BCE With Logits Loss     | NLL Loss                    | Cross-Entropy Loss   
| --- | --- | --- | --- | --- |
|     Classification      | binary                | binary                | multiclass / binary                | multiclass / binary
| Model output (each data point) | probability           | logit                 | array of two or more log probabilities | array of two or more logits    
| Label (each data point) | float (0.0 or 1.0)    | float (0.0 or 1.0)    | long (class index)         | long (class index)
|   Model's last layer    | Sigmoid               | Linear                | LogSoftmax                 | Linear              

#### 8.3.5.3 Losses in Hugging Face Models

####NLP: Logits and Loss Functions: Losses in Hugging Face Models
In the previous section, "Attaching a Head", we've seen that HF models return a dictionary-like structure that may return the loss if the model is in training mode. While it's perfectly possible to compute the loss manually using the returned logits as we've been doing so far, it may be convenient to use the loss values computed automatically by the model since it takes into consideration the task at hand (sequence classification with two labels, in our case).

Let's go over a quick example to illustrate how it works. First, let's create some labels for our two sentences:



In [None]:
input_batch = ["I am really liking this course!", "This course is too complicated!"]
model_input = tokenizer(input_batch, padding=True, return_tensors='pt')['input_ids']
labels = torch.as_tensor([1, 0])

Before, we forward the inputs to the model while in evaluation mode because we were interested either in the hidden states (the contextual embeddings) or in the logits (predictions). Now, we're setting it to training model, so it will return the loss information:

In [None]:
model_with_head.train()
output = model_with_head(model_input, labels=labels)
output

SequenceClassifierOutput(loss=tensor(0.6711, grad_fn=<NllLossBackward0>), logits=tensor([[ 0.0871,  0.1107],
        [ 0.0525, -0.0134]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

In training model, it still returns the logits, so we may compute the loss ourselves if we want to:

In [None]:
import torch.nn as nn

loss_fn = nn.CrossEntropyLoss()
loss = loss_fn(output.logits, labels)
loss

tensor(0.6711, grad_fn=<NllLossBackward0>)

We got matching values. After all, the model is using the appropriate loss function for the task at hand. If you run these cells multiple times, you may get different values for the loss, though. Remember, in training mode, some layers, such as dropout, won't behave deterministically.

## 8.4 TensorBoard

![](https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch0/model_step4.png)

So far, we haven't logged or inspected our losses in real time. Why bother, if it takes only a minute to train the model? This time is different, though: fine-tuning RoBERTa on more than 67,000 data points, even for a single epoch, will take about 15 min or so in Google Colab. So, let's use a convenient tool to see how our loss is doing as training progresses.

Yes, TensorBoard is that good! So good that we’ll be using a tool from the competing framework, TensorFlow. Jokes aside, TensorBoard is a very useful tool, and PyTorch provides classes and methods so that we can integrate it with our model.

First, we need to load TensorBoard’s extension for Jupyter. It is possible to run some special commands inside Jupyter Notebooks using a `%` characters at the start of a line, they are built-in [magic commands](https://ipython.readthedocs.io/en/stable/interactive/magics.html). A magic is a kind of shortcut that extends a notebook's capabilities. Once it is loaded, we can run TensorBoard using the newly available magic:

In [None]:
%load_ext tensorboard
%tensorboard --logdir runs

The magic above tells TensorBoard to look for logs inside the folder specified by the logdir argument: runs. So, there must be a runs folder in the same location as the notebook you’re using to train the model.

Initially, it looks like this:

![](https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch6/empty_tensorboard.png)
####Empty TensorBoard

This is rather uninteresting unless some data is actually sent there so we can visualize it.

It all starts with the creation of a SummaryWriter: since we told TensorBoard to look for logs inside the runs folder, it makes sense to actually log to that folder. Moreover, to be able to distinguish between different experiments or models, we should also specify a sub-folder: test.

In [None]:
from torch.utils.tensorboard import SummaryWriter
writer = SummaryWriter('runs/test')

What about sending the loss values to TensorBoard? We can use the add_scalars() method to send multiple scalar values at once; it needs three arguments:

main_tag: the parent name of the tags, or the "group tag," if you will
tag_scalar_dict: the dictionary containing the key: value pairs for the scalars you want to keep track of (for example, training and validation losses)
global_step: step value; that is, the index you’re associating with the values you’re sending in the dictionary; the index of the mini-batch comes to mind in our case, as losses are computed for each mini-batch
As training progresses, you can go back to the cell where TensorBoard was loaded, click on its refresh button on the top right, and observe the current loss level. It will look similar to this:

![](https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch6/tensorboard_losses.png)

####Losses in TensorBoard

If the losses are oscillating too much (as they will in the next lab), you may smooth the plot using the slider shown in the bottom-right corner:
![](https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch6/smooth_slider.png)

![](https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch6/tensorboard_losses_smooth.png)

Smooth Losses in TensorBoard

This is just a quick overview, so you can use TensorBoard in the next lab, and visualize the losses in real time while your model is training. If you want to know more about running TensorBoard inside notebooks, check out this official [guide](https://www.tensorflow.org/tensorboard/tensorboard_in_notebooks).

## 8.6 HuggingFace Pipelines

There are pipelines available for many different tasks in the Hugging Face Hub:

![](https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch6/hf_nlp_tasks.png)

####Natural Language Processing Pipelines in Hugging Face

Sentiment analysis belongs in the "Text Classification" bucket, so let's check the default model used by text classification pipelines using the SUPPORTED_TASKS dictionary once again:

In [None]:
from transformers.pipelines import SUPPORTED_TASKS
SUPPORTED_TASKS['text-classification']['default']

{'model': {'pt': ('distilbert/distilbert-base-uncased-finetuned-sst-2-english',
   'af0f99b'),
  'tf': ('distilbert/distilbert-base-uncased-finetuned-sst-2-english',
   'af0f99b')}}

The model is a [DistilBERT](https://huggingface.co/docs/transformers/model_doc/distilbert) model fine-tuned on the ["Stanford Sentiment Treebank (SST-2)"](https://huggingface.co/datasets/stanfordnlp/sst2) dataset to perform binary classification. The DistilBERT model is a distilled (that is, more compact with little loss of performance) version of BERT, the famous encoder-based model that spawned a whole family of models, RoBERTa included. We'll dive deeper into these models in "Contextual Word Embeddings with Transformers".

Now, let's create a text classification pipeline and specify its default model:

In [None]:
from transformers import pipeline

model_name = 'distilbert-base-uncased-finetuned-sst-2-english'
classifier = pipeline('text-classification', model=model_name)

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


![](https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch0/model_step5.png)

Just like before, we can feed our input batch directly to the pipeline and get instant predictions:

In [None]:
input_batch = ["I am really liking this course!", "This course is too complicated!"]

classifier(input_batch)

[{'label': 'POSITIVE', 'score': 0.9997199177742004},
 {'label': 'NEGATIVE', 'score': 0.9996912479400635}]

Both sentences are easily classified as positive and negative, respectively. Easy, right?

We can also take a peek under the hood of our pipeline.

### 8.6.1 Transforms / Tokenizer

####Hugging Face Transforms / Tokenizer
In the computer vision pipeline, the transformation was an instance of an ImageProcessor. In HF's pipeline, all these steps are performed inside an instance of a Tokenizer. We can easily access the tokenizer that matches our model using the tokenizer attribute of our pipeline:

In [None]:
classifier.tokenizer

DistilBertTokenizerFast(name_or_path='distilbert-base-uncased-finetuned-sst-2-english', vocab_size=30522, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

You probably can recognize some familiar elements:

- vocab_size: the size of the vocabulary used in the tokenizer;
- model_max_length: maximum length of the input sequence, anything longer gets truncated;
- special_tokens: there are tokens for unknown words, for separating sentences, and for padding - we've already discussed those, but also for classification and masking - we'll get back to those two very special tokens in "Contextual Word Embeddings with Transformers".
Let's tokenize our input batch and decode the result:

In [None]:
tokenized_dict = classifier.tokenizer(input_batch)
tokenized_dict

{'input_ids': [[101, 1045, 2572, 2428, 16663, 2023, 2607, 999, 102], [101, 2023, 2607, 2003, 2205, 8552, 999, 102]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1]]}

In [None]:
classifier.tokenizer.decode(tokenized_dict['input_ids'][0])

'[CLS] i am really liking this course! [SEP]'

This tokenizer prepended a classification token and appended a separation token to the sequence. The separation token marks the separation between two sequences or, as in our case, the end of a sequence. The classification token is a very special token whose purpose is to generate embeddings that will be used by a classifier head. Don't mind if this sounds too esoteric right now, we'll dig deeper into it in "Contextual Word Embeddings with Transformers".

We can also load a pretrained tokenizer using the corresponding model name and the AutoTokenizer class:

In [None]:
from transformers import AutoTokenizer

hf_tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')

tokenized_output = hf_tokenizer(input_batch, add_special_tokens=True, padding=True, return_tensors='pt')
tokenized_output

{'input_ids': tensor([[  101,  1045,  2572,  2428, 16663,  2023,  2607,   999,   102],
        [  101,  2023,  2607,  2003,  2205,  8552,   999,   102,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 0]])}

### 8.6.2 Model

####Hugging Face Model
No surprises there, that's the model we loaded into our pipeline:

In [None]:
classifier.model

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
 

But we can also take a closer look at its configuration:

In [None]:
classifier.model.config

DistilBertConfig {
  "_name_or_path": "distilbert-base-uncased-finetuned-sst-2-english",
  "activation": "gelu",
  "architectures": [
    "DistilBertForSequenceClassification"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "finetuning_task": "sst-2",
  "hidden_dim": 3072,
  "id2label": {
    "0": "NEGATIVE",
    "1": "POSITIVE"
  },
  "initializer_range": 0.02,
  "label2id": {
    "NEGATIVE": 0,
    "POSITIVE": 1
  },
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "output_past": true,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "transformers_version": "4.44.0",
  "vocab_size": 30522
}

It tells us the architecture, dimensions, dropout probabilities, the task used to fine-tune it (sentiment analysis on the SST-2 dataset), the associated labels, and more.

Just like with the tokenizer, we can load a pretrained model from Hugging Face using the corresponding model name and the AutoModel class from Transformers. Instead of loading the distilled BERT for sequence classification, let's load its plain, encoder-only, version instead:

In [None]:
from transformers import AutoModel
headless_model = AutoModel.from_pretrained('distilbert-base-uncased')

In [None]:
headless_model

DistilBertModel(
  (embeddings): Embeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (transformer): Transformer(
    (layer): ModuleList(
      (0-5): 6 x TransformerBlock(
        (attention): MultiHeadSelfAttention(
          (dropout): Dropout(p=0.1, inplace=False)
          (q_lin): Linear(in_features=768, out_features=768, bias=True)
          (k_lin): Linear(in_features=768, out_features=768, bias=True)
          (v_lin): Linear(in_features=768, out_features=768, bias=True)
          (out_lin): Linear(in_features=768, out_features=768, bias=True)
        )
        (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        (ffn): FFN(
          (dropout): Dropout(p=0.1, inplace=False)
          (lin1): Linear(in_features=768, out_features=3072, bias=True)
          (lin2): Li

See? That's the same as the distilbert part of our former DistilBertForSequenceClassification. It is a headless model, and it can be used to produce contextual embeddings:

In [None]:
import torch
headless_model.eval()

with torch.inference_mode():
    output = headless_model(tokenized_output['input_ids'])

output['last_hidden_state'].shape

torch.Size([2, 9, 768])

Two sentences, nine tokens each, each token represented by an array of 768 numerical values: contextual embeddings by DistilBERT instead of RoBERTa.

## 8.7 Generative Models

Generative models are decoder-based models. They're used to predict the next word in a sequence of words thus generating text, a task often referred to as causal language modeling. The most popular of all generative models is the Generative Pretrained Transformer, or GPT for short, developed by OpenAI and currently in its fourth generation (GPT-4).

In this section, we'll briefly use GPT-2 to illustrate a generative pipeline in Hugging Face. If you have already tried its newer versions (GPT-3, chatGPT, or GPT-4) directly from OpenAI, what follows is going to be, unfortunately, quite underwhelming, but nonetheless useful to give you a glimpse of the inner workings of such models.

First, let's load a pretrained GPT-2 using both AutoModel and AutoTokenizer:

In [None]:
from transformers import AutoModelForCausalLM
from transformers import AutoTokenizer

model = AutoModelForCausalLM.from_pretrained('gpt2')
tokenizer = AutoTokenizer.from_pretrained('gpt2')

Then, let's start with a very simple and straightforward sentence: "Hello, how are you..."

In [None]:
sentence = "Hello, how are you"

You probably completed the sentence in your head: "doing". Right? Let's see if the model does the same or not. But, we need to tokenize the sentence to make it an appropriate input for the model:

In [None]:
tokenized = tokenizer(sentence, return_tensors="pt")
tokenized

{'input_ids': tensor([[15496,    11,   703,   389,   345]]), 'attention_mask': tensor([[1, 1, 1, 1, 1]])}

The model will produce a LOT of logits: we have a mini-batch of one sentence, the sentence having five tokens ("Hello", ",", "how", "are", and "you"), and 50,257 logits for each token, one logit for each token in the vocabulary. For each token/word, the model is assigning a probability to every word it knows, the probability it follows the sequence observed so far.

In [None]:
outputs = model(**tokenized)
outputs['logits'].shape

torch.Size([1, 5, 50257])

Let's apply softmax to the last dimension to get the probabilities, and then take the most likely words chosen by GPT-2:

In [None]:
probabilities = torch.nn.functional.softmax(outputs['logits'][0], dim=1)
values, indices = torch.topk(probabilities, 1)
values, indices

(tensor([[0.0960],
         [0.1005],
         [0.0908],
         [0.6630],
         [0.2651]], grad_fn=<TopkBackward0>),
 tensor([[  11],
         [ 314],
         [ 546],
         [ 345],
         [1804]]))

For our first token/word, "Hello", the more likely word to follow is the word with index 11, which will be followed by the word with index 314, and so on. Let's decode all of them:

In [None]:
predictions = tokenizer.decode(indices[:, 0])
predictions

', I about you doing'

What does this mean? It means that, as the model receives more words in a sequence, it adjusts its predictions.

In [None]:
tokens = [tokenizer.decode(t) for t in tokenized['input_ids'][0]]
predicted_tokens = predictions.split(' ')

for i, p in enumerate(predicted_tokens):
    print(f"{i+1}. Tokens so far: {' '.join(tokens[:i+1])}\n   Predicted token to follow: {p}")

1. Tokens so far: Hello
   Predicted token to follow: ,
2. Tokens so far: Hello ,
   Predicted token to follow: I
3. Tokens so far: Hello ,  how
   Predicted token to follow: about
4. Tokens so far: Hello ,  how  are
   Predicted token to follow: you
5. Tokens so far: Hello ,  how  are  you
   Predicted token to follow: doing


"Hello, how are you doing", says GPT-2. Mid-sentence, it got the first (","), and fourth ("you") predictions right (as the most likely word, that is). A few years ago, that would be pretty cool, and people would be excited about it. Nowadays, the astounding performance of chatGPT and GPT-4 makes the example above look like child's play.

Still, the idea of "predicting the most likely words that follow" is at the base of every generative language model, and that's what's being illustrated here. In the final chapter, we'll revisit and explore these concepts further.