<a href="https://colab.research.google.com/github/tiennguyen2310/NLP/blob/main/How_Pipeline_Works.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Preprocessing - Tokenizer Step**

This pre-processing steps will initialize a tokenizer that can:

*   Split input into words/subwords (tokens)
*   Map each token to an integer
*   Add additional inputs that can be useful to the model

**Example of a Tokenizer:**

`"Today the weather is amazing!"`

-> Tokens: `["Today", "the", "weather", "is", amazing", "!"]`

-> Special Tokens: `[[CLS], "Today", "the", "weather", "is", amazing", "!", [SEP]]`

In [1]:
from transformers import AutoTokenizer

#default checkpoint for sentiment analysis
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"

#from_pretrained method: a method from AutoTokenizer used to pre-process data
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Now we pass in 2 sentences, and see what the tokenizer does to those 2 inputs.

**Note:**
- `padding = True`: Since the 2 sentences are of different size, we will increase (pad) the size of the shorter one in order to build an array.
- `truncation = True`: Truncation means cutting down. So this command makes sure that if the inputs are longer than what the model can handle, they will be cut down.
- `return_tensors = "pt"`: the tokenizer returns a `Pytorch (pt)` tensor.

In [2]:
raw_inputs = [
    "I love the environment of this place so much.",
    "I hate my haircut!",
]

inputs = tokenizer(raw_inputs, padding = True, truncation = True, return_tensors = "pt")
print(inputs)

{'input_ids': tensor([[  101,  1045,  2293,  1996,  4044,  1997,  2023,  2173,  2061,  2172,
          1012,   102],
        [  101,  1045,  5223,  2026,  2606, 12690,   999,   102,     0,     0,
             0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0]])}


The result contains **2 keys** (tensors):
- `input_ids`: ids of both sentences, and that `0s` are where padding is applied.
- `attention_mask`: indicates where padding has been applied (where `0s` are represented), so that the model ***does not have to pay attention*** to those.

# **The Model**

In [3]:
from transformers import AutoModel

#same checkpoint
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"

#instantiate the model
model = AutoModel.from_pretrained(checkpoint)

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Now, let's see the inputs we pre-processed to the model.

In [4]:
outputs = model(**inputs)
print(outputs.last_hidden_state.shape)

torch.Size([2, 12, 768])


We see that the output `torch` looks like a `3D array`. Indeed it is.
1. `Batch size: 2` (meaning we have 2 sequences processed)
2. `Sequence length: 12` (meaning each sequence has a numerical representation of size 12, including padding positions)
3. `Hidden size: 768` (the vector dimension of each model input)

## **The Model with a *Sequence Classification* head**

**Def:** Classify sentences as Positive/Negative

In [5]:
from transformers import AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**inputs)

In [6]:
print(outputs.logits.shape)

torch.Size([2, 2])


**Explanation:** `[2, 2]`

- `2 sentences` of the raw inputs
- `2 labels` (Positive/Negative)

# **Postprocessing the Output**

In [7]:
print(outputs.logits)

tensor([[-4.2248,  4.5701],
        [ 4.5276, -3.6426]], grad_fn=<AddmmBackward0>)


These numbers **do not make sense**, because they are **logits** - `unnormalized scores`.

**Solution:** Go through the ***SoftMax layer***, as below.

In [8]:
import torch

predictions = torch.nn.functional.softmax(outputs.logits, dim = -1)
print(predictions)

tensor([[1.5149e-04, 9.9985e-01],
        [9.9972e-01, 2.8289e-04]], grad_fn=<SoftmaxBackward0>)


Now the number makes sence:
- For the first sentence, we have `[1.5149e-04, 9.9985e-01]`
- The second sentence gives us `[9.9972e-01, 2.8289e-04]`

The outputs of each sentence add up to 1, meaning they are the probabilities. But what are the labels? We need to check using `id2label`.

In [9]:
model.config.id2label

{0: 'NEGATIVE', 1: 'POSITIVE'}

- **First sentence** - `I love the environment of this place so much.` - has a **POSITIVE** value of `9.9985e-01`.

- **Second one** - `I hate my haircut!` - has a **NEGATIVE** value of `9.9972e-01`.

Now the results make full sence! We can add on a last step to check whether the result is similar if we run `sentiment-analysis`.

In [10]:
from transformers import pipeline
classifier = pipeline("sentiment-analysis")
classifier(
    ["I love the environment of this place so much.",
     "I hate my haircut!"])

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Device set to use cpu


[{'label': 'POSITIVE', 'score': 0.999848484992981},
 {'label': 'NEGATIVE', 'score': 0.9997170567512512}]