# What does the pipeline do?

In [None]:
from transformers import pipeline
import torch

Preprocessing with a tokenizer

In [None]:
from transformers import AutoTokenizer
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!",
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
print(inputs)

Why Can tokenizer Accept Arguments?
`tokenizer` as a Callable Object:
In Python, classes can define a special method called __call__. If a class has this method, instances of the class can be called like a function.
The AutoTokenizer (or more specifically, the tokenizer class that AutoTokenizer.from_pretrained returns) has a `__call__` method defined, which allows you to pass arguments directly to it as if it were a function.

The following model contains only the base Transformer module: given some inputs, it outputs what we’ll call hidden states, also known as features. 

For each model input, we’ll retrieve a high-dimensional vector representing the contextual understanding of that input by the Transformer model.

A high-dimensional vector?
The vector output by the Transformer module is usually large. It generally has three dimensions:

1. Batch size: The **number of sequences** processed at a time (2 in our example).

2. Sequence length: The length of the numerical representation of the sequence (16 in our example).

3. Hidden size: The vector dimension of each model input. (768 in our example). The hidden_size refers to the dimensionality (i.e., the number of features or components)

In [None]:
from transformers import AutoModel
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(checkpoint)
outputs = model(**inputs)
print(outputs.last_hidden_state.shape)
print(type(outputs['last_hidden_state']))

The following model has a sequence classification head (to be able to classify the sentences as positive or negative). So, we won’t actually use the AutoModel class, but AutoModelForSequenceClassification

In [None]:
from transformers import AutoModelForSequenceClassification
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**inputs)
print(outputs.logits.shape)
print(outputs.logits)

In [None]:
import torch

predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(predictions)

In [None]:
model.config.id2label

 the model predicted [0.0402 Negative, 0.9598 Positive] for the first sentence and [0.9995, 0.0005] for the second one

# Model
## AutoModel
- AutoModel class, which is handy when you want to instantiate any model from a checkpoint.
- wrappers over model library. 
- automatically guess the appropriate model architecture for your checkpoint 

Randomly initializing Bert

In [None]:
from transformers import BertConfig, BertModel

# Building the config
config = BertConfig()

# Building the model from the config
model = BertModel(config)

Loading a Transformer model that is already trained is simple — we can do this using the from_pretrained() method:


In [None]:
from transformers import BertModel
model = BertModel.from_pretrained("bert-base-cased")

saves two files

config.json: architecture

pytorch_model.bin: weights

In [None]:
# import pathlib
# cur_path = pathlib.Path(__file__).parent.resolve()
# import inspect
# cur_path = inspect.getfile(lambda: None) #The code inspect.getfile(lambda: None) returns the path to the file where the lambda function is defined. In a Jupyter Notebook, this will not return the directory of the notebook itself. Instead, it will return a path related to the Jupyter environment, which is not useful for saving files relative to the notebook.
# # model.save_pretrained(cur_path)
import os
cur_path = os.getcwd()
print(cur_path)
model.save_pretrained(cur_path)

# Tokenizer

Loading the BERT tokenizer trained with the same checkpoint as BERT 

In [None]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

AutoModel, the AutoTokenizer class will grab the proper tokenizer class in the library based on the checkpoint name, and can be used directly with any checkpoint

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

In [None]:
tokenizer("Using a Transformer network is simple")

In [None]:
tokenizer.save_pretrained(cur_path)

Tokenization

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

sequence = "Using a Transformer network is simple"
tokens = tokenizer.tokenize(sequence)

print(tokens)

Encoding: from tokenizaer to input_ids

In [None]:
ids = tokenizer.convert_tokens_to_ids(tokens)

print(ids)

Decoding

In [None]:
decoded_string = tokenizer.decode([7993, 170, 11303, 1200, 2443, 1110, 3014])
print(decoded_string)

# Handling Muptiple Sequences

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence = "I've been waiting for a HuggingFace course my whole life."

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)
input_ids = torch.tensor(ids)
# This line will fail.
model(input_ids)

Why IndexError: too many indices for tensor of dimension 1?
the tokenizer didn’t just convert the list of input IDs into a tensor, it added a dimension on top of it

In [None]:
tokenized_inputs = tokenizer(sequence, return_tensors="pt")
print(tokenized_inputs["input_ids"])

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence = "I've been waiting for a HuggingFace course my whole life."

tokens = tokenizer.tokenize(sequence)  # ['i', "'", 've', 'been', 'waiting', 'for', 'a', 'hugging', '##face', 'course', 'my', 'whole', 'life', '.']
ids = tokenizer.convert_tokens_to_ids(tokens)

input_ids = torch.tensor([ids])  #add a dimension
print("Input IDs:", input_ids)

output = model(input_ids)
print("Logits:", output.logits)

**Batching** is the act of sending multiple sentences through the model, all at once. If you only have one sentence, you can just build a batch with a single sequence

In [None]:
batched_ids = [ids, ids]
batched_ids=torch.tensor(batched_ids)
output = model(batched_ids)
print("Logits:", output.logits)

In [None]:
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence1_ids = [[200, 200, 200]]
sequence2_ids = [[200, 200]]
batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],
]

print(model(torch.tensor(sequence1_ids)).logits)
print(model(torch.tensor(sequence2_ids)).logits)
print(model(torch.tensor(batched_ids)).logits)

Why does the second sentence output inconsistency?
Use attention mask to ignore the padding tokens.
Attention masks are tensors with the exact same shape as the input IDs tensor, filled with 0s and 1s: 1s indicate the corresponding tokens should be attended to, and 0s indicate the corresponding tokens should not be attended to 

In [None]:
batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],
]

attention_mask = [
    [1, 1, 1],
    [1, 1, 0],
]

outputs = model(torch.tensor(batched_ids), attention_mask=torch.tensor(attention_mask))
print(outputs.logits)

In [None]:
sequence3="I hate this so much!"
sequence4="I have been waiting for a HuggingFace course my whole life."
sequence3_tokens = tokenizer.tokenize(sequence3)
sequence4_tokens = tokenizer.tokenize(sequence4)
sequence3_ids = tokenizer.convert_tokens_to_ids(sequence3_tokens)
print("Length of sequence3_ids is:",len(sequence3_ids))
sequence4_ids = tokenizer.convert_tokens_to_ids(sequence4_tokens)
print("Length of sequence4_ids is:",len(sequence4_ids))
sequence3_output = model(torch.tensor([sequence3_ids]))
print("sequence3's Logits:", sequence3_output.logits)
sequence4_output=model(torch.tensor([sequence4_ids]))
print("sequence4's Logits:", sequence4_output.logits)

In [None]:
#padding sequence3_ids to 13
print("length of sequence3_ids", len(sequence3_ids))
padding_sequence3_ids = sequence3_ids + [tokenizer.pad_token_id]*(len(sequence4_ids)-len(sequence3_ids))
batch34_ids=torch.tensor([padding_sequence3_ids, sequence4_ids]) 
batch34_output_noAttentionMask=model(batch34_ids)
print("batch34_output_noAttentionMask's logits are",batch34_output_noAttentionMask.logits)
# attention_mask
attention_mask = [
    [1]*len(sequence3_ids) + [0]*(len(sequence4_ids)-len(sequence3_ids)),  # [1]*13 + [0]*(13-13) => [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
    [1]*len(sequence4_ids)  # [1]*13 => [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
]
print("attention mask is: ",attention_mask)
batch34_output_withAttentionMask=model(batch34_ids, attention_mask=torch.tensor(attention_mask))
print("batch34_output_withAttentionMask's logits are",batch34_output_withAttentionMask.logits)


Longer sequences

Try other models that can handle longer sequences
 Longformer, LED

or truncate the sequences

In [None]:
# sequence = sequence[:max_sequence_length]

# Put it all together

In [None]:
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

sequence = "I've been waiting for a HuggingFace course my whole life."
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

model_inputs = tokenizer(sequence)

In [None]:
#Padding the sequences 
# Will pad the sequences up to the maximum sequence length
model_inputs = tokenizer(sequences, padding="longest")

# Will pad the sequences up to the model max length
# (512 for BERT or DistilBERT)
model_inputs = tokenizer(sequences, padding="max_length")

# Will pad the sequences up to the specified max length
model_inputs = tokenizer(sequences, padding="max_length", max_length=8)

In [None]:
#Truncate the sequences
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

# Will truncate the sequences that are longer than the model max length
# (512 for BERT or DistilBERT)
model_inputs = tokenizer(sequences, truncation=True)

# Will truncate the sequences that are longer than the specified max length
model_inputs = tokenizer(sequences, max_length=8, truncation=True)

In [None]:
#tensors
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]
# Returns PyTorch tensors
model_inputs = tokenizer(sequences, padding=True, return_tensors="pt")
# Returns TensorFlow tensors
model_inputs = tokenizer(sequences, padding=True, return_tensors="tf")
# Returns NumPy arrays
model_inputs = tokenizer(sequences, padding=True, return_tensors="np")

## Special tokens
The tokenizer added the special word [CLS] at the beginning and the special word [SEP] at the end. This is because the model was pretrained with those, so to get the same results for inference we need to add them as well. Note that some models don’t add special words, or add different ones; models may also add these special words only at the beginning, or only at the end. In any case, the tokenizer knows which ones are expected and will deal with this for you.

In [None]:
#special tokens
sequence = "I've been waiting for a HuggingFace course my whole life."

model_inputs = tokenizer(sequence)
print(model_inputs["input_ids"])

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)

#decode
print(tokenizer.decode(model_inputs["input_ids"]))
print(tokenizer.decode(ids))

# Wrapping up: From tokenizer to model

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

tokens = tokenizer(sequences, padding=True, truncation=True, return_tensors="pt")
output = model(**tokens)
print(output.logits)