<a href="https://colab.research.google.com/github/somewhereovertherainbo/TRANSFORMERS/blob/main/FINE_TUNING_SENTIMENT_ANALYSIS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import torch
from pprint import pprint
import numpy as np


# Pipeline:
---
# 1. Text preprocessing: Converting text into numbers
# 2. Model: Use the numbers to make predictions

# TOKENIZER

There are different tokenizers for every model. Eg: BERT, Distill, GPT-2 etc.

Use the same tokenizer and model.

In [2]:
from transformers import AutoTokenizer

In [3]:
checkpoint = 'bert-base-cased' # Model checkpoint for BERT
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [4]:
tokenizer('hello world')

{'input_ids': [101, 19082, 1362, 102], 'token_type_ids': [0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1]}

*Some insight*

In [5]:
tokens = tokenizer.tokenize('hello world')
print(tokens)

['hello', 'world']


Note that there are only two tokens. But we get 4 numbers. Let's see

In [6]:
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)

[19082, 1362]


Previous two steps can be done in one step using `encode` method

In [7]:
ids_ = tokenizer.encode('hello world')
print(ids_)

[101, 19082, 1362, 102]


In [8]:
tokenizer.convert_ids_to_tokens(ids_)

['[CLS]', 'hello', 'world', '[SEP]']

Hence, BERT adds '[CLS]' and '[SEP]' at the beginning and end of tokens. This is BERT specific

In [9]:
tokenizer.decode(ids_)

'[CLS] hello world [SEP]'

In [10]:
tokenizer('hello world', return_tensors = 'pt')

{'input_ids': tensor([[  101, 19082,  1362,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1]])}

If we have multiple inputs, we need to do padding, truncating etc.

In [11]:
data = ['I am Thanos, I am inevitable', 'And I am Ironman']

model_inputs = tokenizer(data, padding = True, truncation = True, return_tensors = 'pt')
pprint(model_inputs)

{'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 0, 0, 0]]),
 'input_ids': tensor([[  101,   146,  1821, 16062,  2155,   117,   146,  1821, 14014,   102],
        [  101,  1262,   146,  1821,  5621,  1399,   102,     0,     0,     0]]),
 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])}


In [12]:
tokenizer.convert_ids_to_tokens(model_inputs['input_ids'][0])

['[CLS]', 'I', 'am', 'Than', '##os', ',', 'I', 'am', 'inevitable', '[SEP]']

Thanos getting split is unexpected ❄

# MODEL

In [13]:
# from transformers import AutoModelForSequenceClassification

In [14]:
# model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

The warning states that some of the layers have not been trained well. This is precisely what we want for transfer learning

In [15]:
try:
  from datasets import load_dataset
except:
  !pip install datasets -q
  from datasets import load_dataset

In [16]:
raw_datasets = load_dataset('glue','sst2')

In [17]:
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 67349
    })
    validation: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 872
    })
    test: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 1821
    })
})

In [18]:
raw_datasets['train']

Dataset({
    features: ['sentence', 'label', 'idx'],
    num_rows: 67349
})

In [19]:
raw_datasets['train'][0]

{'sentence': 'hide new secretions from the parental units ',
 'label': 0,
 'idx': 0}

In [20]:
raw_datasets['train'][400:403]

{'sentence': ['gory as the scenes of torture and self-mutilation ',
  'just as the recent argentine film son of the bride reminded us that a feel-good movie can still show real heart ',
  "ub equally spoofs and celebrates the more outre aspects of ` black culture ' and the dorkier aspects of ` white culture , ' even as it points out how inseparable the two are . "],
 'label': [0, 1, 1],
 'idx': [400, 401, 402]}

In [21]:
raw_datasets['train'].features

{'sentence': Value(dtype='string', id=None),
 'label': ClassLabel(names=['negative', 'positive'], id=None),
 'idx': Value(dtype='int32', id=None)}

In [22]:
dir(raw_datasets['train'])

['_TF_DATASET_REFS',
 '__class__',
 '__del__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__enter__',
 '__eq__',
 '__exit__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__getitems__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_build_local_temp_path',
 '_check_index_is_initialized',
 '_data',
 '_estimate_nbytes',
 '_fingerprint',
 '_format_columns',
 '_format_kwargs',
 '_format_type',
 '_generate_tables_from_cache_file',
 '_generate_tables_from_shards',
 '_get_cache_file_path',
 '_get_output_signature',
 '_getitem',
 '_indexes',
 '_indices',
 '_info',
 '_map_single',
 '_new_dataset_with_indices',
 '_output_all_columns',
 '_push_parquet_shards_to_hub',
 '_save_to_disk_single',
 '_select_contiguous',
 '_select_wi

Note that we did not pass padding or torch tensor arguments because they will be handled later on by the training functions


In [23]:
!pip install transformers[torch]



In [24]:
from transformers import AutoTokenizer

In [25]:
checkpoint = 'distilbert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

In [26]:
tokenized_sentences = tokenizer(raw_datasets['train'][0:3]['sentence'])

pprint(tokenized_sentences)

{'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
                    [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
                    [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]],
 'input_ids': [[101, 5342, 2047, 3595, 8496, 2013, 1996, 18643, 3197, 102],
               [101,
                3397,
                2053,
                15966,
                1010,
                2069,
                4450,
                2098,
                18201,
                2015,
                102],
               [101,
                2008,
                7459,
                2049,
                3494,
                1998,
                10639,
                2015,
                2242,
                2738,
                3376,
                2055,
                2529,
                3267,
                102]]}


In [27]:
def tokenize_fn(batch):
  return tokenizer(batch['sentence'], truncation = True)

tokenized_datasets = raw_datasets.map(tokenize_fn, batched = True)

Map:   0%|          | 0/1821 [00:00<?, ? examples/s]

In [28]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label', 'idx', 'input_ids', 'attention_mask'],
        num_rows: 67349
    })
    validation: Dataset({
        features: ['sentence', 'label', 'idx', 'input_ids', 'attention_mask'],
        num_rows: 872
    })
    test: Dataset({
        features: ['sentence', 'label', 'idx', 'input_ids', 'attention_mask'],
        num_rows: 1821
    })
})

In [29]:
from transformers import TrainingArguments

In [30]:
training_args = TrainingArguments(
    output_dir = 'my_trainer',
    evaluation_strategy = 'epoch',
    save_strategy = 'epoch',
    num_train_epochs = 1
)

In [31]:
from transformers import AutoModelForSequenceClassification

In [32]:
model = AutoModelForSequenceClassification.from_pretrained(
    checkpoint,
    num_labels = 2
)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias', 'pre_classifier.weight', 'pre_classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [33]:
type(model)

transformers.models.distilbert.modeling_distilbert.DistilBertForSequenceClassification

In [34]:
model

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
 

In [35]:
!pip install torchinfo



In [36]:
from torchinfo import summary

In [37]:
summary(model)

Layer (type:depth-idx)                                  Param #
DistilBertForSequenceClassification                     --
├─DistilBertModel: 1-1                                  --
│    └─Embeddings: 2-1                                  --
│    │    └─Embedding: 3-1                              23,440,896
│    │    └─Embedding: 3-2                              393,216
│    │    └─LayerNorm: 3-3                              1,536
│    │    └─Dropout: 3-4                                --
│    └─Transformer: 2-2                                 --
│    │    └─ModuleList: 3-5                             42,527,232
├─Linear: 1-2                                           590,592
├─Linear: 1-3                                           1,538
├─Dropout: 1-4                                          --
Total params: 66,955,010
Trainable params: 66,955,010
Non-trainable params: 0

In [38]:
# keeping track of parameters

params_before = []

for name, p in model.named_parameters():
  params_before.append(p.detach().cpu().numpy())

In [39]:
from transformers import Trainer

In [40]:
from datasets import load_metric

In [41]:
metric = load_metric('glue', 'sst2')

  metric = load_metric('glue', 'sst2')
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.


In [42]:
# def compute_metrics(eval_preds):
#   # logits, labels = logits_and_lables
#   predictions = np.argmax(eval_preds, axis = -1)
#   return metric.compute(predictions = predictions, reference = labels)

In [43]:
trainer = Trainer(
    model,
    training_args,
    train_dataset = tokenized_datasets['train'],
    eval_dataset = tokenized_datasets['validation'],
    tokenizer = tokenizer,
    compute_metrics =
)

In [44]:
trainer.train()

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss
1,0.2226,0.352956


TrainOutput(global_step=8419, training_loss=0.26765904787360156, metrics={'train_runtime': 411.4573, 'train_samples_per_second': 163.684, 'train_steps_per_second': 20.461, 'total_flos': 518596929468840.0, 'train_loss': 0.26765904787360156, 'epoch': 1.0})

In [45]:
trainer.save_model('my_saved_model')

In [46]:
!ls

my_saved_model	my_trainer  sample_data


In [47]:
!ls my_saved_model

config.json	   special_tokens_map.json  tokenizer.json     vocab.txt
model.safetensors  tokenizer_config.json    training_args.bin


In [49]:
from transformers import pipeline

In [50]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'

In [51]:
newmodel = pipeline('text-classification', model = 'my_saved_model', device = device)

In [53]:
newmodel('This movie sucks')

[{'label': 'LABEL_0', 'score': 0.9972392320632935}]

In [54]:
newmodel('This movie is awesome')

[{'label': 'LABEL_1', 'score': 0.9994176626205444}]

In [55]:
newmodel('This movie is fine but boring at some places but the climax was interesting')

[{'label': 'LABEL_1', 'score': 0.9628579616546631}]

In [57]:
!cat my_saved_model/config.json

{
  "_name_or_path": "distilbert-base-uncased",
  "activation": "gelu",
  "architectures": [
    "DistilBertForSequenceClassification"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "problem_type": "single_label_classification",
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "torch_dtype": "float32",
  "transformers_version": "4.35.2",
  "vocab_size": 30522
}


In [59]:
import json

In [61]:
config_path = 'my_saved_model/config.json'

with open(config_path) as f:
  j = json.load(f)

j['id2label'] = {0:'negative', 1:'positive'}

with open(config_path, 'w') as f:
  json.dump(j, f, indent = 2)


In [62]:
newmodel = pipeline('text-classification', model = 'my_saved_model')

In [63]:
newmodel('This movie is good')

[{'label': 'positive', 'score': 0.9994250535964966}]

In [64]:
# keeping track of parameters

params_after = []

for name, p in model.named_parameters():
  params_after.append(p.detach().cpu().numpy())

In [65]:
for p1,p2 in zip(params_before, params_after):
  print(np.sum(np.abs(p1-p2)))

13451.526
89.644325
1.7472234
1.1174855
1305.1522
1.7431557
1299.8098
0.002956793
1204.2213
1.0763932
1138.3402
0.8473623
1.7168597
0.83087003
4964.3345
5.6985693
4574.072
0.71866715
1.5702083
0.69705504
1273.9503
1.3955863
1271.2676
0.0028607547
1124.4442
0.8037366
1077.0328
0.7245903
1.5661645
0.7197476
4904.2217
5.289961
4502.499
0.7193879
1.5121133
0.80978465
1270.1055
1.6722708
1276.6786
0.0026723715
1118.6473
0.81623226
1093.519
0.74095803
1.5708466
0.79569536
4928.175
5.498934
4401.745
0.7143285
1.4496495
0.6779174
1277.6794
1.4309702
1291.9348
0.0031636553
1143.3931
0.7003326
1092.811
0.73884284
1.4212782
0.7067465
4835.437
5.377535
4214.3604
0.81798553
1.3667474
0.80053586
1194.2725
1.6177548
1178.3024
0.0024505921
971.7678
0.79074264
981.9073
0.99157035
1.3642647
0.9321915
4283.7964
5.0876603
3341.6174
0.74552554
1.2668383
0.6873616
1070.7448
1.2676694
1089.944
0.0013765331
919.6099
0.8581296
912.6987
1.0156909
1.3472986
1.1652426
3442.6997
4.501417
3091.9087
0.94613624
1.242

This means the model has actually trained and updated its parameters