<a href="https://colab.research.google.com/github/ucheokechukwu/zero_to_mastery_courses/blob/main/HuggingFace_NLP_Course/2_Using_Transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction
https://huggingface.co/learn/nlp-course/chapter2/1?fw=pt

# Inside the pipeline

https://www.youtube.com/watch?v=1pedAIvTWXk

* raw text->**tokenizer**-> input IDs
* input IDs->**model**->logits
* logits->**postprocessing with softmax and id2label**->predictions

In [1]:
%pip install transformers[sentencepiece]
from transformers import pipeline
classifier = pipeline('sentiment-analysis')
classifier([
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!",
])

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

[{'label': 'POSITIVE', 'score': 0.9598048329353333},
 {'label': 'NEGATIVE', 'score': 0.9994558691978455}]

## preprocessing with a tokenizer:

1. split the text into tokens (words and subwords
2. maps each token into an integer based on a pretrained corpus
3. add additional inputs that might be useful to the model (padding, attention masking, etc)

In [2]:
from transformers import AutoTokenizer
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

In [4]:
# have to specify the type of tensor
raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!",
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors='pt')
inputs

{'input_ids': tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102],
        [  101,  1045,  5223,  2023,  2061,  2172,   999,   102,     0,     0,
             0,     0,     0,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}

## going through the model

inputs->AutoModel->hiddenstates/features->model_head->output

The model_head defines the specific NLP task(QA,NER, sentiment analysis etc)


The hidden states or features are a high-dimensional vector representing the contextual understanding of the inputs by the model. Dimensions are:

    batchsize x sequence_length x hidden_size

In [7]:
from transformers import AutoModel
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(checkpoint)

### a high-dimensional vector

In [9]:
outputs = model(**inputs)
outputs.last_hidden_state.shape

torch.Size([2, 16, 768])

### model heads
high-dimensional-vector->head->model_output

* but instead of `AutoModel` then model head, we can `Auto`..specific architecture for the task e.g. `AutoModelForSequenceClassification`, `AutoModelForQuestionAnswering`, etc.

In [11]:
from transformers import AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

outputs = model(**inputs)
print(outputs.logits.shape)

torch.Size([2, 2])


## post processing the output

logits->postprocessing->meaningfuloutput

In [16]:
outputs.logits

tensor([[-1.5607,  1.6123],
        [ 4.1692, -3.3464]], grad_fn=<AddmmBackward0>)

In [17]:
# get predictions
import torch
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(predictions)

tensor([[4.0195e-02, 9.5980e-01],
        [9.9946e-01, 5.4418e-04]], grad_fn=<SoftmaxBackward0>)


In [18]:
# match predictions to labels
model.config.id2label # is where the label mapping is stored"

{0: 'NEGATIVE', 1: 'POSITIVE'}

In [20]:
predictions.abs()

tensor([[4.0195e-02, 9.5980e-01],
        [9.9946e-01, 5.4418e-04]], grad_fn=<AbsBackward0>)

In [26]:
predictions.argmax(dim=-1)

tensor([1, 0])

In [31]:
for pred in predictions.argmax(dim=-1):
    display(model.config.id2label[pred.item()])

'POSITIVE'

'NEGATIVE'

In [32]:
# which corresponds with the original pipeline...
sentiment_analyzer = pipeline('sentiment-analysis')
sentiment_analyzer(raw_inputs
)

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'label': 'POSITIVE', 'score': 0.9598048329353333},
 {'label': 'NEGATIVE', 'score': 0.9994558691978455}]

# Models
https://www.youtube.com/watch?v=AhChOFRegn4

* ->config file -> config class -> model config + model class -> model

* ->model file with weights

* `AutoConfig`

## Creating a transformer

if you know the model you want to use, you can use the class that defines its architecture directly.

* Loading the Model with its config loads an untrained model. To load a trained model, you need to use the `from_pretrained` method.

* The weights have been downloaded and cached (so future calls to the from_pretrained() method won’t re-download them) in the cache folder, which defaults to `~/.cache/huggingface/transformers`. You can customize your cache folder by setting the HF_HOME environment variable.

In [34]:
from transformers import BertConfig, BertModel

# building the config
config = BertConfig()
# building the model from the config
model = BertModel(config)
config

BertConfig {
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.32.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

### Different loading methods

* from config
* from pretrained

In [36]:
# loading from config
from transformers import BertConfig, BertModel
config = BertConfig()
model = BertModel(config)
# this not a trained model. its weights are randomized gibberish values

In [35]:
# loading a pretrained model
from transformers import BertModel
model = BertModel.from_pretrained("bert-base-cased")

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

* Using the `AutoModel` class instead of the model-specific `...Model` class is better because it makes the code checkpoint-agnostic.

### Saving models
`save_pretrained` -> saves the config file and the model weights into the specified directory

In [37]:
model.save_pretrained('directory_on_working_folder')
%ls directory_on_working_folder

config.json  pytorch_model.bin


## Using a transformer model for inference

In [40]:
sequences = ['hello!', 'cool.', 'nice.']
tokenizer(sequences)

{'input_ids': [[101, 7592, 999, 102], [101, 4658, 1012, 102], [101, 3835, 1012, 102]], 'attention_mask': [[1, 1, 1, 1], [1, 1, 1, 1], [1, 1, 1, 1]]}

### using tensors as model inputs
without specifying, the tokenizer output would be a list of lists.

In [43]:
encoded_sequences = [
    [101, 7592, 999, 102],
    [101, 4658, 1012, 102],
    [101, 3835, 999, 102],
]
import torch
model_inputs = torch.tensor(encoded_sequences)

outputs = model(model_inputs)

# Tokenizers
https://www.youtube.com/watch?v=VFp38yj8h3A

- word-based
- character-based
- sub-word based

## Word-based

https://www.youtube.com/watch?v=nhJxYji1aho

In [44]:
tokenized_text = "Jim Henson was a puppeteer".split()
tokenized_text

['Jim', 'Henson', 'was', 'a', 'puppeteer']

## Character based
https://www.youtube.com/watch?v=ssLq_EK2jLE

In [46]:
tokenized_text = list("Jim Henson was a puppeteer")
tokenized_text

['J',
 'i',
 'm',
 ' ',
 'H',
 'e',
 'n',
 's',
 'o',
 'n',
 ' ',
 'w',
 'a',
 's',
 ' ',
 'a',
 ' ',
 'p',
 'u',
 'p',
 'p',
 'e',
 't',
 'e',
 'e',
 'r']

## Sub-word based

https://www.youtube.com/watch?v=zHvTiHr506c

## loading and saving

loading:
1. the algorithm of the tokenizer and
2. its vocabulary

In [47]:
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')

# better to use AutoTokenizer to keep it checkpoint-agnostic
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

In [48]:
tokenizer("Using a Transformer network is simple.")

{'input_ids': [101, 7993, 170, 13809, 23763, 2443, 1110, 3014, 119, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [49]:
# saving it with the save_pretrained
tokenizer.save_pretrained('directory_on_working_directory')

('directory_on_working_directory/tokenizer_config.json',
 'directory_on_working_directory/special_tokens_map.json',
 'directory_on_working_directory/vocab.txt',
 'directory_on_working_directory/added_tokens.json',
 'directory_on_working_directory/tokenizer.json')

## Encoding

https://www.youtube.com/watch?v=Yffk5aydLzg

encoding is done in a 2-step process: tokenization, conversion to input IDs.

raw text->tokens->addspecialtokens->encodewithvocabulary

```python
# raw texts->tokens
tokens = tokenizer.tokenizer(raw_text)
# tokens to inputids (numbers)
input_ids = tokenizer.convert_tokens_to_ids(tokens)
# add special tokens
final_inputs = tokenizer.prepare_for_model(input_ids)


```

In [53]:
raw_text = "Using a Transformer network is simple"
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')

# raw to tokens
tokens = tokenizer.tokenize(raw_text)
print("Tokens:\n",tokens, sep="")

# tokens to ids
ids = tokenizer.convert_tokens_to_ids(tokens)
print("IDS:",ids)

# ids to final inputs by adding special tokens
input_ids = tokenizer.prepare_for_model(ids)
print("input_ids:", input_ids)


You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Tokens:
['Using', 'a', 'Trans', '##former', 'network', 'is', 'simple']
IDS: [7993, 170, 13809, 23763, 2443, 1110, 3014]
input_ids: {'input_ids': [101, 7993, 170, 13809, 23763, 2443, 1110, 3014, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}


## Decoding

use `decode()` method to go from vocabulary indices to the string

In [55]:
decoded_string = tokenizer.decode([101, 7993, 170, 13809, 23763, 2443, 1110, 3014, 102])
decoded_string

'[CLS] Using a Transformer network is simple [SEP]'

# Handling multiple sequences / Batch
https://www.youtube.com/watch?v=M6adb1j2jPI


because tensors can only convert 'rectangular' shapes, we need to pad sentences when we batch them so they are all of the same length.

then the attention_mask tells the attention layers of the model what tokens to ignore

## models expect inputs in batches

In [67]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence = "I've been waiting for a Hugging Face course my whole life"

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)
input_ids = torch.tensor(ids)

try:
    model(input_ids)
except Exception as error:
    print(type(error).__name__,": ",error,sep="")

IndexError: too many indices for tensor of dimension 1


In [77]:
input_ids.shape # it needs to be 2-dimension
# this will work
model(input_ids.unsqueeze(0))

SequenceClassifierOutput(loss=None, logits=tensor([[-3.7595,  4.0442]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

In [78]:
# another method
input_ids = torch.tensor([ids])
model(input_ids)

SequenceClassifierOutput(loss=None, logits=tensor([[-3.7595,  4.0442]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

## padding and attention masks

* the padding token is found in tokenizer.pad_token_id


In [80]:
tokenizer.pad_token_id

0

In [85]:
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
sequence1_ids = [[200, 200, 200]]
sequence2_ids = [[200, 200]]
batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],
]

print(model(torch.tensor(sequence1_ids)).logits)
print(model(torch.tensor(sequence2_ids)).logits)

print("\nResults Without Attention Masks:")
print("================================")

print(model(torch.tensor(batched_ids)).logits)

attention_mask = [
    [1, 1, 1],
    [1, 1, 0],
]

print("\nResults With Attention Masks:")
print("================================")
outputs = model(torch.tensor(batched_ids), attention_mask=torch.tensor(attention_mask))
print(outputs.logits)

tensor([[ 1.5694, -1.3895]], grad_fn=<AddmmBackward0>)
tensor([[ 0.5803, -0.4125]], grad_fn=<AddmmBackward0>)

Results Without Attention Masks:
tensor([[ 1.5694, -1.3895],
        [ 1.3374, -1.2163]], grad_fn=<AddmmBackward0>)

Results With Attention Masks:
tensor([[ 1.5694, -1.3895],
        [ 0.5803, -0.4125]], grad_fn=<AddmmBackward0>)


# Putting it all together

the tokenizer does everything we did above -> converting to tokens, then IDs, then adding special character, padding and masking...

In [87]:
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

sequence = "I've been waiting for a HuggingFace course my whole life."

model_inputs = tokenizer(sequences, padding=True, return_tensors="pt")

model(**model_inputs)

SequenceClassifierOutput(loss=None, logits=tensor([[-3.7235,  3.9691],
        [-4.2219,  4.5807],
        [-4.2943,  4.6340]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

In [102]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
sentences = [
    "I've been waiting for a HuggingFace course my whole life.",
    "So have I!"
]
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

tokens = tokenizer(sentences,
                   padding=True,
                   truncation=True,
                   return_tensors="pt")
output = model(**tokens)
output

SequenceClassifierOutput(loss=None, logits=tensor([[-1.5607,  1.6123],
        [-3.6183,  3.9137]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

In [103]:
# from model output to postprocessed output
predictions = torch.nn.functional.softmax(output.logits, dim=-1)
labels = predictions.argmax(dim=-1)
for label in labels:
    print(model.config.id2label[label.item()])

POSITIVE
POSITIVE
