<a href="https://colab.research.google.com/github/ruthgn/HF/blob/main/01_Reproducing_HF_Sentiment_Analysis_Pipeline_Function.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this notebook, we'll replicate all the process that happens under the hood when we run some text through HuggingFace's `sentiment-analysis` pipeline.

In [None]:
# Install Transformers and Datasets libraries
!pip install datasets transformers[sentencepiece]

Collecting datasets
  Downloading datasets-1.18.3-py3-none-any.whl (311 kB)
[K     |████████████████████████████████| 311 kB 5.2 MB/s 
[?25hCollecting transformers[sentencepiece]
  Downloading transformers-4.16.2-py3-none-any.whl (3.5 MB)
[K     |████████████████████████████████| 3.5 MB 51.3 MB/s 
Collecting fsspec[http]>=2021.05.0
  Downloading fsspec-2022.2.0-py3-none-any.whl (134 kB)
[K     |████████████████████████████████| 134 kB 39.1 MB/s 
[?25hCollecting aiohttp
  Downloading aiohttp-3.8.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 47.9 MB/s 
Collecting huggingface-hub<1.0.0,>=0.1.0
  Downloading huggingface_hub-0.4.0-py3-none-any.whl (67 kB)
[K     |████████████████████████████████| 67 kB 5.2 MB/s 
Collecting xxhash
  Downloading xxhash-2.0.2-cp37-cp37m-manylinux2010_x86_64.whl (243 kB)
[K     |████████████████████████████████| 243 kB 22.9 MB/s 
Collecting asy

In [None]:
from transformers import pipeline

In [None]:
sentiment_classifier = pipeline("sentiment-analysis")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)


Downloading:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/255M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

In [None]:
sentiment_classifier(["Wow, what a day.", "Excuse you!", "All alone in this house again"])

[{'label': 'POSITIVE', 'score': 0.9992961883544922},
 {'label': 'NEGATIVE', 'score': 0.9990440011024475},
 {'label': 'POSITIVE', 'score': 0.7301211953163147}]

The pipeline groups together 3 different steps:
1. Preprocessing
2. Passing the inputs through the model
3. Postprocessing

Let's go over each step.

# Preprocessing with a tokenizer

Like other neural networks, Transformer models can't process raw text directly, so the first step we need to take is to convert the text inputs into numbers that the model can understand. To achieve this, we use a **tokenizer**, which will be responsible for:
- Splitting the input into words, subwords, or symbols (like punctuation) that are called **tokens**.
- Mapping each token to an integer
- Adding additional inputs that may be useful to the model

Note that all this preprocessing needs to be one in exactly the same way as when the model was pretrained, so we first need to download that information from the HF Model Hub. To do this, we use the `AutoTokenizer` class and its `from_pretrained()` method. Using the chechpoint name of our model, it will automatically fetch the data associated with the model's tokenizer and cache it (so it's only downloaded the first time you run the code below).

Since the default checkpoint of the `sentiment-analysis` pipeline is `distilbert-base-uncased-finetuned-sst-2-english`, we run the follow:

In [None]:
from transformers import AutoTokenizer

checkpoint = 'distilbert-base-uncased-finetuned-sst-2-english'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

Once we have the tokenizer, we can directly pass our sentences to it. We will get back a dictionary that's ready to feed to our model! The only thing left to do is to convert the list of input IDs to tensors.

_Note: Transformer models only accepts **tensors** as inputs. To specify the type of tensors we want to get back (PyTorch, TensorFlow, or plain NumPy), use the `return_tensors` argument (if no type is passed, the result will be a list of lists)._ 

In [None]:
raw_inputs = ["Wow, what a day.", "Excuse you!", "All alone in this house again"]

inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors='pt')

The main thing to remember here is that you can pass one sentence (string) or a list of sentences, and get tensors in return.

Here's what the results look like as PyTorch tensors:

In [None]:
print(inputs)

{'input_ids': tensor([[  101, 10166,  1010,  2054,  1037,  2154,  1012,   102],
        [  101,  8016,  2017,   999,   102,     0,     0,     0],
        [  101,  2035,  2894,  1999,  2023,  2160,  2153,   102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1]])}


The output itself is a dictionary containing two keys `input_ids` and `attention_mask`. `input_ids` contains three rows of integers (one for each sentence) that are unique identifiers of the tokens in each sentence. 

# Going through the model

We can download our pretrained model the same way we did with our tokenizer. HF Transformers provides an AutoModel class which also has a `from_pretrained()` method:


In [None]:
from transformers import AutoModel

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(checkpoint)

Some weights of the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english were not used when initializing DistilBertModel: ['pre_classifier.weight', 'classifier.bias', 'pre_classifier.bias', 'classifier.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


This architecture contains only the base Transformer module: given some inputs, it outputs what we'll call _hidden states_, also known as _features_. Essentially, for each model input, we'll retrieve a high-dimensional vector representing the **contextual understanding of that input by the Transformer model**.

While these hidden states can be useful on their own, they're usually inputs to another part of the model, known as the _head_. (In HF Course Chapter 1, the different tasks could have been performed with the same architecture, but each of these tasks will have a different head associated with it.)

## A high-dimensional vector?

The vector output by the Transformer module is usually large. It generally has three dimensions:
- Batch size: The number of sequences processed at at time (3 in our example).
- Sequence length: The length of the numerical representation of the sequence (8 in our example).
- Hidden size: The vector dimension of each model input.

It is said to be "high dimensional" because of the last value. The hidden size can be very large (768 is common for smaller models, and in larger models this can reach 3072 or more).

We can see this if we feed the inputs we preprocessed to our model:



In [None]:
outputs = model(**inputs)
print(outputs.last_hidden_state.shape)

torch.Size([3, 8, 768])


## Model heads: Making sense out of numbers

The model heads take the high-dimensional vector of hidden states as input and project them onto a different dimension. They are usually composed of one or a few linear layers. There are many different architectures available in Transformers, with each one designed around tackling a specific task. For our example, **we will need a model with a sequence classification head** to be able to classify the sentences as positive or negative). So, we won't actually use the `AutoModel` class, but `AutoModelForSequenceClassification`:

In [None]:
from transformers import AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**inputs)

Now if we look at the shape of our inputs, the dimensionality will be much lower: the model head takes as input the high-dimensional vectors we saw before, and outputs vectors containing two values (one per label):

In [None]:
print(outputs.logits.shape)

torch.Size([3, 2])


Since we have three sentences and two labels (positive vs. negative sentiment), the results we get from our model is of shape 3 * 2.

## Postprocessing the output

The values we get as output from our model don't necessarily make sense by themselves. Let's take a look:

In [None]:
print(outputs.logits)

tensor([[-3.4878,  3.7706],
        [ 3.8694, -3.0824],
        [-0.3870,  0.6082]], grad_fn=<AddmmBackward0>)


Our model predicted [-3.4878,  3.7706] for the the first sentence ("Wow, what a day."), [3.8694, -3.0824] for the second sentence ("Excuse you!"), and [-0.3870,  0.6082]] for the third sentence ("All alone in this house again").

Those are not probabilities but _logits_, the raw, unnormalized scores outputted by the last layer of the model. To be converted to probabilities, they need to go through a **SoftMax layer** (all HF Transformers models output the logits, as the loss function for training will generally fuse the last activation function, such as SoftMax, with the actual loss function, such as cross entropy).

In [None]:
import torch

predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(predictions)

tensor([[7.0375e-04, 9.9930e-01],
        [9.9904e-01, 9.5594e-04],
        [2.6988e-01, 7.3012e-01]], grad_fn=<SoftmaxBackward0>)


Now we can see that the model predicted [0.0007, 0.9993] for the first sentence ("Wow, what a day."), [0.9990. 0.0009] for the second sentence ("Excuse you!"), and [0.2699, 0.7301] for the third sentence ("All alone in this house again"). These are recognizable probability scores.

To get the labels corresponding to each position, we can inspect the `id2label` attribute of the model config:

In [None]:
model.config.id2label

{0: 'NEGATIVE', 1: 'POSITIVE'}

In conclusion:
- First sentence: NEGATIVE: 0.0007, POSITIVE: 0.9993 
- Second sentence: NEGATIVE: 0.9990, POSITIVE: 0.0009
- Third sentence: NEGATIVE: 0.2699, POSITIVE: 0.7301

(all of which should match the output of the `sentiment-analysis` pipeline below):

In [None]:
sentiment_classifier(raw_inputs)

[{'label': 'POSITIVE', 'score': 0.9992961883544922},
 {'label': 'NEGATIVE', 'score': 0.9990440011024475},
 {'label': 'POSITIVE', 'score': 0.7301211953163147}]

We have successfully reproduced the three steps of the pipeline: preprocessing with tokenizers, passing the inputs through the model, and postprocessing!