In [1]:
from transformers import pipeline

clf = pipeline('sentiment-analysis')
clf('This is terrible')

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)


[{'label': 'NEGATIVE', 'score': 0.9996459484100342}]

## Stages in pipeline

- Tokenizer => Raw Text turned into numbers
- Model => Input from Tokenizer turned into logits
- Postprocessing => Logits turned into predictions

Tokenization steps are several steps. 
First it will split each word, part of text or punctuation, creating the tokens.
Then it will add in front a CLS token and SEP token at the end. 
Then these tokens are turned into numbers mapped to the vocabulary of trained model.


In [10]:
#Tokenizer

from transformers import AutoTokenizer

checkpoint = 'distilbert-base-uncased-finetuned-sst-2-english' #=>Default one
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

raw_data = ['This is good', 'I do not like this thing man', 'How ya doing']

'''
    Tokenizer inputs:
    raw data
    padding => This will add 0s to sentences that are shorter than the max length
    truncation => This will truncate sentences that are longer than the max length allowed by the model
    return_tensors => Specify what type of tensor to receive back, pt is for PyTorch
'''
inputs = tokenizer(raw_data, padding=True, truncation=True, return_tensors='pt')
print(inputs)

{'input_ids': tensor([[ 101, 2023, 2003, 2204,  102,    0,    0,    0,    0],
        [ 101, 1045, 2079, 2025, 2066, 2023, 2518, 2158,  102],
        [ 101, 2129, 8038, 2725,  102,    0,    0,    0,    0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 0, 0, 0, 0]])}


In [11]:
#Model
from transformers import AutoModel

model = AutoModel.from_pretrained(checkpoint)

outputs = model(**inputs)
print(outputs.last_hidden_state.shape)

Some weights of the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english were not used when initializing DistilBertModel: ['classifier.bias', 'pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


torch.Size([3, 9, 768])


The tensor returned has batch size 3 (This depends on the raw data), sequence lenght 9 (This will depend on the raw data), and hidden dimension 768.
The model heads take in this tensor as input and project it to a different dimension.

<img src='pics/ModelHeads.png'>


Depending on the task, there are different architectures available:

*Model (retrieve the hidden states)
*ForCausalLM
*ForMaskedLM
*ForMultipleChoice
*ForQuestionAnswering
*ForSequenceClassification
*ForTokenClassification

and more

For a text classification task, we can use the *ForSequenceClassification



In [15]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**inputs)
print(outputs.logits)

tensor([[-4.1668,  4.5490],
        [ 2.7439, -2.3426],
        [-2.2669,  2.3813]], grad_fn=<AddmmBackward0>)


The output returned is a 2x2 vector with logits. These logits need to be processed so that we can get the predictions.


In [16]:
#PostProcessing

import torch

pred = torch.nn.functional.softmax(outputs.logits, dim=-1)
pred

tensor([[1.6393e-04, 9.9984e-01],
        [9.9386e-01, 6.1418e-03],
        [9.4882e-03, 9.9051e-01]], grad_fn=<SoftmaxBackward0>)

The logits are now convertged and numbers are between 0 and 1.<br>
Let's check the model configuration for the labels 0 and 1 <br>
In this case it means the first sentence is POS as the second label (1) has a probability of 0.99984 to be POS and 0.000164 to be NEG

In [17]:
model.config.id2label

{0: 'NEGATIVE', 1: 'POSITIVE'}