<a href="https://colab.research.google.com/github/shamikdhar/Hugging_Face/blob/main/NLP_Hugging_Face_Ch_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [3]:
!pip install transformers
!pip install transformers[sentencepiece]

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.20.1-py3-none-any.whl (4.4 MB)
[K     |████████████████████████████████| 4.4 MB 21.3 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 49.3 MB/s 
[?25hCollecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 42.0 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.8.1-py3-none-any.whl (101 kB)
[K     |████████████████████████████████| 101 kB 11.0 MB/s 
Installing collected packages: pyyaml, tokenizers, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Found existing installation: PyYAML 3.13
    Uninsta

Sentiment Analysis with pipeline()

In [4]:
from transformers import pipeline

In [5]:
classifier = pipeline("sentiment-analysis")
classifier("The battery life of this camera is too short")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)


Downloading:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/255M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

[{'label': 'NEGATIVE', 'score': 0.9994750618934631}]

The default checkpoint of the sentiment-analysis pipeline is distilbert-base-uncased-finetuned-sst-2-english

Now preprocessing with a tokenizer

In [6]:
import torch
from transformers import AdamW, AutoTokenizer, AutoModelForSequenceClassification, AutoModel

In [7]:
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

PyTorch tensors

In [8]:
raw_inputs = ["The battery life of this camera is too short",]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
print(inputs)

{'input_ids': tensor([[ 101, 1996, 6046, 2166, 1997, 2023, 4950, 2003, 2205, 2460,  102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}


 Transformers provides an AutoModel class which also has a from_pretrained() method

 Going through the model

In [9]:
model = AutoModel.from_pretrained(checkpoint)

Some weights of the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english were not used when initializing DistilBertModel: ['classifier.weight', 'pre_classifier.weight', 'classifier.bias', 'pre_classifier.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [10]:
outputs = model(**inputs)
print(outputs.last_hidden_state.shape)

torch.Size([1, 11, 768])


If we need a model with a sequence classification head (to be able to classify the sentences as positive or negative). So, here we will use the AutoModelForSequenceClassification

In [11]:
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

In [12]:
outputs = model(**inputs)
print(outputs.logits.shape)

torch.Size([1, 2])


Since we have just one sentence and two labels, the result we get from our model is of shape 1 x 2

Postprocessing the output

The values we get as output from our model

In [13]:
print(outputs.logits)

tensor([[ 4.1389, -3.4127]], grad_fn=<AddmmBackward0>)


These are not probabilities but logits, the raw, unnormalized scores outputted by the last layer of the model. They need to be converted to probabilities, so they need to go through a SoftMax layer

In [14]:
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(predictions)

tensor([[9.9948e-01, 5.2498e-04]], grad_fn=<SoftmaxBackward0>)


To get the labels corresponding to each position, we can inspect the id2label attribute of the model config

In [15]:
model.config.id2label

{0: 'NEGATIVE', 1: 'POSITIVE'}

Now we can conclude that the model predicted the following:

NEGATIVE: 0.099948, POSITIVE: 0.00052498

Creating a Transformer

In [16]:
from transformers import BertConfig, BertModel

# Building the config
config = BertConfig()

# Building the model from the config
model = BertModel(config)

print(config)

BertConfig {
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.20.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}



Different loading methods

Creating a model from the default configuration initializes it with random values

In [17]:
from transformers import BertConfig, BertModel

config = BertConfig()
model = BertModel(config)

Loading a Transformer model that is already trained

In [18]:
model = BertModel.from_pretrained("bert-base-cased")

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/416M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertModel: ['cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Saving methods

In [19]:
model.save_pretrained("directory_on_my_computer")

In [20]:
ls directory_on_my_computer

config.json  pytorch_model.bin


Using a Transformer model for inference

In [21]:
sequences = ["Hello!", "Cool.", "Nice!"]
input = tokenizer(sequences)
print(input["input_ids"])

[[101, 7592, 999, 102], [101, 4658, 1012, 102], [101, 3835, 999, 102]]


In [22]:
encoded_sequences = input["input_ids"]
model_inputs = torch.tensor(encoded_sequences)

Using the tensors as inputs to the model

In [23]:
output = model(model_inputs)

Tokenizers

Splitting the text on the basis of word by Python's split() function

In [24]:
tokenized_text = "I am Shamik Dhar".split()
print(tokenized_text)

['I', 'am', 'Shamik', 'Dhar']


Loading and saving tokenizers

In [25]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
tokenizer("I am Shamik Dhar")

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/426k [00:00<?, ?B/s]

{'input_ids': [101, 146, 1821, 156, 2522, 4847, 141, 7111, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

Saving a tokenizer is identical to saving a model

In [26]:
tokenizer.save_pretrained("directory_on_my_computer")

('directory_on_my_computer/tokenizer_config.json',
 'directory_on_my_computer/special_tokens_map.json',
 'directory_on_my_computer/vocab.txt',
 'directory_on_my_computer/added_tokens.json',
 'directory_on_my_computer/tokenizer.json')

Encoding


Tokenization

In [27]:
sequence = "Statistics and Mathematics are correlated"
tokens = tokenizer.tokenize(sequence)
print(tokens)

['Statistics', 'and', 'Mathematics', 'are', 'correlated']


From tokens to input IDs

In [28]:
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)

[10910, 1105, 9833, 1132, 27053]


Decoding

In [29]:
decoded_string = tokenizer.decode([10910, 1105, 9833, 1132, 27053])
print(decoded_string)

Statistics and Mathematics are correlated


Handling multiple sequences

Models expect a batch of inputs

In [30]:
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

In [31]:
sequence = "Statistics and Mathematics are correlated"
tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)

In [32]:
input_ids = torch.tensor([ids])
print("Input IDs:", input_ids)

output = model(input_ids)
print("Logits:", output.logits)

Input IDs: tensor([[ 6747,  1998,  5597,  2024, 23900]])
Logits: tensor([[-0.6533,  0.6502]], grad_fn=<AddmmBackward0>)


Batching is the act of sending multiple sentences through the model, all at once

Padding the inputs

In [33]:
padding_id = 100
batched_ids = [[200, 200, 200], [200, 200, padding_id],]

In [34]:
sequence1_ids = [[200, 200, 200]]
sequence2_ids = [[200, 200]]
batched_ids = [[200, 200, 200], [200, 200, tokenizer.pad_token_id],]

print(model(torch.tensor(sequence1_ids)).logits)
print(model(torch.tensor(sequence2_ids)).logits)
print(model(torch.tensor(batched_ids)).logits)

tensor([[ 1.5694, -1.3895]], grad_fn=<AddmmBackward0>)
tensor([[ 0.5803, -0.4125]], grad_fn=<AddmmBackward0>)
tensor([[ 1.5694, -1.3895],
        [ 1.3374, -1.2163]], grad_fn=<AddmmBackward0>)


In [36]:
# Will pad the sequences up to the maximum sequence length
model_inputs = tokenizer(sequences, padding="longest")

In [37]:
# Will pad the sequences up to the model max length
# (512 for BERT or DistilBERT)
model_inputs = tokenizer(sequences, padding="max_length")

In [38]:
# Will pad the sequences up to the specified max length
model_inputs = tokenizer(sequences, padding="max_length", max_length=8)

Attention masks

In [35]:
batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],]

attention_mask = [
    [1, 1, 1],
    [1, 1, 0],]

outputs = model(torch.tensor(batched_ids), attention_mask=torch.tensor(attention_mask))
print(outputs.logits)

tensor([[ 1.5694, -1.3895],
        [ 0.5803, -0.4125]], grad_fn=<AddmmBackward0>)


Trancation of Sequences

In [39]:
model_inputs = tokenizer(sequences, max_length=8, truncation=True)

In [40]:
# Returns PyTorch tensors
model_inputs = tokenizer(sequences, padding=True, return_tensors="pt")

In [41]:
# Returns TensorFlow tensors
model_inputs = tokenizer(sequences, padding=True, return_tensors="tf")

In [42]:
# Returns NumPy arrays
model_inputs = tokenizer(sequences, padding=True, return_tensors="np")

Special tokens

In [43]:
model_inputs = tokenizer(sequence)
print(model_inputs["input_ids"])

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)

[101, 6747, 1998, 5597, 2024, 23900, 102]
[6747, 1998, 5597, 2024, 23900]


Decode the sequence of ids

In [44]:
print(tokenizer.decode(model_inputs["input_ids"]))
print(tokenizer.decode(ids))

[CLS] statistics and mathematics are correlated [SEP]
statistics and mathematics are correlated


Wrapping up: From tokenizer to model

In [45]:
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
tokens = tokenizer(sequences, padding=True, truncation=True, return_tensors="pt")
output = model(**tokens)