### This notebook is based on the [course](https://huggingface.co/course/en/chapter2/1)  from hugging face.

In [1]:
import torch
from transformers import pipeline, AutoModel, AutoTokenizer, AutoModelForSequenceClassification

### Model to be used, task, and raw input

In [2]:
__model__ = "distilbert-base-uncased-finetuned-sst-2-english" # Uncased: english == English 
task = "sentiment-analysis"
raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!",
]

<img src="https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter2/transformer_and_head.svg"></img> 

### Step by step to do a sentiment analysis to the raw inputs.  
1. Tokenization: Raw Input to Tokens
2. Encoding: Tokens to Model Input
3. Embedding and further

In [3]:
tokenizer = AutoTokenizer.from_pretrained(__model__)
model = AutoModelForSequenceClassification.from_pretrained(__model__)

* Tokenization

In [4]:
tokens = [tokenizer.tokenize(i) for i in raw_inputs]
print(tokens)

[['i', "'", 've', 'been', 'waiting', 'for', 'a', 'hugging', '##face', 'course', 'my', 'whole', 'life', '.'], ['i', 'hate', 'this', 'so', 'much', '!']]


* Encoding

In [5]:
_model_input = [tokenizer.convert_tokens_to_ids(i) for i in tokens] 
print(_model_input)

[[1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012], [1045, 5223, 2023, 2061, 2172, 999]]


* We can do tokenization and encoding in one step

In [6]:
model_input = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt") 
print(model_input.input_ids.tolist())

[[101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102], [101, 1045, 5223, 2023, 2061, 2172, 999, 102, 0, 0, 0, 0, 0, 0, 0, 0]]


* Run the model (the model has an embedding layer)

In [7]:
outputs = model(**model_input) 

In [8]:
logits = torch.nn.Softmax(dim=1)(outputs.logits) 
scores, indices = torch.max(logits, dim=1)

In [9]:
for idx, score in zip(indices.tolist(), scores.tolist()):
    print(model.config.id2label[idx], score)

POSITIVE 0.9598050713539124
NEGATIVE 0.9994558691978455


### All in one

In [10]:
pipeline(task, model=__model__)(raw_inputs)

[{'label': 'POSITIVE', 'score': 0.9598049521446228},
 {'label': 'NEGATIVE', 'score': 0.9994558691978455}]

#### tokenizer add special tokens
* [CLS] Indicates this is a classification task
* [SEP] Separator 
* [PAD] Padding

In [11]:
tokenizer.batch_decode(tokenizer(raw_inputs, 
                                 padding=True,  
                                 return_tensors="pt").input_ids, 
                       skip_special_tokens=False)

["[CLS] i've been waiting for a huggingface course my whole life. [SEP]",
 '[CLS] i hate this so much! [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]']

#### Let's play with tokenizer

In [12]:
_ = tokenizer("a " * 1234 + "a.",  # 1235 words in one sentence
              padding=True,        # if the raw input is too short. Padding is needed.
              truncation=True,     # if the raw input is too long (1235-word sentence), no truncation leads to error.
              return_tensors="pt", # not pytorch, then a list is return.
              max_length=6, 
             )