![image](en_chapter2_full_nlp_pipeline.svg)

## preprocessing with a tokenizer
* Tokenizer splits input into tokens (words, subwords or symbols)
* Maps each token to an integer
* Adds additional inputs that maybe useful to the model

<code>AutoTokenizer</code> and its & <code>from_pretrained</code> are used to fetch tokenizer's data and cache it
* Output consists of two keys <code>input_ids</code> and <code>attention_mask</code>

In [1]:
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
raw_inputs = ["I've been waiting for a delicious pizza my whole life",
              "Italians dislike pineapples on pizza!"]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
print(inputs)

{'input_ids': tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 12090, 10733,
          2026,  2878,  2166,   102],
        [  101, 16773, 18959,  7222, 23804,  2015,  2006, 10733,   999,   102,
             0,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0]])}


## Model
* Vector output generally has 3 dimensions:
    *  **Batch size**
    *  **Sequence length**
    *  **Hidden size**
      
* The outputs of Transformer models behave like namedtuples or dictionaries. The element can be accessed by attributes

In [5]:
from transformers import AutoModel

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(checkpoint)
outputs = model(**inputs)
print(outputs.last_hidden_state.shape)

torch.Size([2, 14, 768])


Model head 

In [6]:
from transformers import AutoModelForSequenceClassification

checkpoint = 'distilbert-base-uncased-finetuned-sst-2-english'
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**inputs)
print(outputs.logits.shape)

torch.Size([2, 2])


Postprocessing the output
* Transformers models output the logits: raw unnormalized scores by the last layer of the model
* Pass logits through Softmax to convert to probabilities

In [7]:
print(outputs.logits)

tensor([[-3.3514,  3.5402],
        [ 2.7452, -2.3027]], grad_fn=<AddmmBackward0>)


In [8]:
import torch
import torch.nn.functional as F

predictions = F.softmax(outputs.logits, dim=-1)
print(predictions)

tensor([[0.0010, 0.9990],
        [0.9936, 0.0064]], grad_fn=<SoftmaxBackward0>)


In [9]:
model.config.id2label

{0: 'NEGATIVE', 1: 'POSITIVE'}

In [18]:
print("{} : {}".format(raw_inputs[0],model.config.id2label[torch.argmax(predictions[0]).item()]))
print("{} : {}".format(raw_inputs[1],model.config.id2label[torch.argmax(predictions[1]).item()]))

I've been waiting for a delicious pizza my whole life : POSITIVE
Italians dislike pineapples on pizza! : NEGATIVE


In [23]:
raw_inputs2 = ["Some might be appalled by durian's strong smell.",
              "Durian is the best fruit in the world."]

inputs2 = tokenizer(raw_inputs2, padding=True, truncation=True, return_tensors="pt")
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs2 = model(**inputs2)
predictions2 = F.softmax(outputs2.logits, dim=-1)
print(predictions2)
print(predictions2.shape)
print("{} : {}".format(raw_inputs2[0],model.config.id2label[torch.argmax(predictions2[0]).item()]))
print("{} : {}".format(raw_inputs2[1],model.config.id2label[torch.argmax(predictions2[1]).item()]))

tensor([[9.7303e-01, 2.6972e-02],
        [1.6480e-04, 9.9984e-01]], grad_fn=<SoftmaxBackward0>)
torch.Size([2, 2])
Some might be appalled by durian's strong smell. : NEGATIVE
Durian is the best fruit in the world. : POSITIVE


In [25]:
raw_inputs2 = ["I take a taxi to the airport happily.",
              "Genocide is a bad thing."]

inputs2 = tokenizer(raw_inputs2, padding=True, truncation=True, return_tensors="pt")
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs2 = model(**inputs2)
predictions2 = F.softmax(outputs2.logits, dim=-1)
print(predictions2)
print(predictions2.shape)
print("{} : {}".format(raw_inputs2[0],model.config.id2label[torch.argmax(predictions2[0]).item()]))
print("{} : {}".format(raw_inputs2[1],model.config.id2label[torch.argmax(predictions2[1]).item()]))

tensor([[2.4188e-04, 9.9976e-01],
        [9.9967e-01, 3.3493e-04]], grad_fn=<SoftmaxBackward0>)
torch.Size([2, 2])
I take a taxi to the airport happily. : POSITIVE
Genocide is a bad thing. : NEGATIVE
