<a href="https://colab.research.google.com/github/tisRobin/BERT_Research_Project/blob/main/1_Model_Fine_tuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Reference**:

BERT for computational social scientists:

https://www.youtube.com/watch?v=UmyOhl9AciI


* Install the transformers and datasets library 

In [None]:
!pip install transformers datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


* Import the load_dataset function 
* Import numpy

In [None]:
from transformers import DistilBertTokenizerFast, DistilBertForSequenceClassification, Trainer, TrainingArguments

In [None]:
from datasets import load_dataset
import numpy as np
import torch

* Call the load_dataset function to bring in the "financial phrasebank" dataset
* the "financial phrasebank" dataset is divided into multiple datasets depending on the rate of agreement between the annotators (16). 
* I chose 'sentences_allagree' for my experiment, which only contains Number of instances with 100% annotator agreement: 2264 out of 4846 sentences

In [None]:
raw_datasets = load_dataset("financial_phrasebank", "sentences_allagree")



  0%|          | 0/1 [00:00<?, ?it/s]

* The dataset has two columns, containing the sentence and the corresponding sentiment respectively.

* Unlike many other datasets in huggingface, there is no train/validation/test split for the dataset, Therfore, it needs to be done manually. 

In [None]:
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label'],
        num_rows: 2264
    })
})

In [None]:
dir(raw_datasets['train'])

['_TF_DATASET_REFS',
 '__class__',
 '__del__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__enter__',
 '__eq__',
 '__exit__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_build_local_temp_path',
 '_check_index_is_initialized',
 '_data',
 '_fingerprint',
 '_format_columns',
 '_format_kwargs',
 '_format_type',
 '_get_cache_file_path',
 '_get_output_signature',
 '_getitem',
 '_indexes',
 '_indices',
 '_info',
 '_iter',
 '_map_single',
 '_new_dataset_with_indices',
 '_output_all_columns',
 '_push_parquet_shards_to_hub',
 '_select_contiguous',
 '_select_with_indices_mapping',
 '_split',
 'add_column',
 'add_elasticsearch_index',
 'add_faiss_index',
 'add_faiss_index_from_external_arrays',
 'ad

In [None]:
raw_datasets['train'].data

MemoryMappedTable
sentence: string
label: int64
----
sentence: [["According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing .","For the last quarter of 2010 , Componenta 's net sales doubled to EUR131m from EUR76m for the same period a year earlier , while it moved to a zero pre-tax profit from a pre-tax loss of EUR7m .","In the third quarter of 2010 , net sales increased by 5.2 % to EUR 205.5 mn , and operating profit by 34.9 % to EUR 23.5 mn .","Operating profit rose to EUR 13.1 mn from EUR 8.7 mn in the corresponding period in 2007 representing 7.7 % of net sales .","Operating profit totalled EUR 21.1 mn , up from EUR 18.6 mn in 2007 , representing 9.7 % of net sales .","Finnish Talentum reports its operating profit increased to EUR 20.5 mn in 2005 from EUR 9.3 mn in 2004 , and net sales totaled EUR 103.3 mn , up from EUR 96.4 mn .","Clothing retail chain Sepp+ñl+ñ 's sales increased by 8 % to EUR 155.2 mn , and ope

In [None]:
raw_datasets['train'][0]

{'label': 1,
 'sentence': 'According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing .'}

In [None]:
split_data = raw_datasets['train'].train_test_split(train_size=int(2264*0.8) , test_size = 2264 - int(2264*0.8))

In [None]:
split_data

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label'],
        num_rows: 1811
    })
    test: Dataset({
        features: ['sentence', 'label'],
        num_rows: 453
    })
})

In [None]:
split_data['train']

Dataset({
    features: ['sentence', 'label'],
    num_rows: 1811
})

In [None]:
split_data['test']

Dataset({
    features: ['sentence', 'label'],
    num_rows: 453
})

* Divide the data further into 4 categories 

In [None]:
train_texts = split_data['train']['sentence']
train_labels = split_data['train']['label']
test_texts = split_data['test']['sentence']
test_labels = split_data['test']['label']

In [None]:
len(train_texts), len(train_labels), len(test_texts), len(test_labels)

(1811, 1811, 453, 453)

In [None]:
# example
train_labels[0], train_texts[0]

(2,
 'Circulation revenue has increased by 5 % in Finland and 4 % in Sweden in 2008 .')

In [None]:
model_name = 'distilbert-base-cased'  
tokenizer = DistilBertTokenizerFast.from_pretrained(model_name)

* Padding, truncation, special tokens, division into word pieces

In [None]:
# The maximum number of tokens in any document sent to BERT.
max_length = 512   

# tokenize the train and test text according to the BERT tokenizer 
train_encodings = tokenizer(train_texts, truncation=True, padding=True, max_length=max_length)
test_encodings  = tokenizer(test_texts, truncation=True, padding=True, max_length=max_length)

# Since the labels were already encoded into integers, I simply passed along the same values.
train_labels_encoded = train_labels
test_labels_encoded  = test_labels

In [None]:
set(train_labels_encoded), set(test_labels_encoded)

({0, 1, 2}, {0, 1, 2})

* Convert labels and data into Torch dataset object (since this is what Huggingface accpets)

https://huggingface.co/transformers/v3.4.0/custom_datasets.html


In [None]:
class MyDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

# this __getitem__ Dunder method is what convets the data into torch objects when passed into BERT.
    def __getitem__(self, idx):
        
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        # add the converted label of the corresponding text 
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

In [None]:
# instantiate
train_dataset = MyDataset(train_encodings, train_labels_encoded)
test_dataset = MyDataset(test_encodings, test_labels_encoded)

  Index 
  
    * 0   - [PAD]
    * 101 - [CLS] - classification token
    * 102 - [SEP] - separator token

https://huggingface.co/docs/transformers/glossary

In [None]:
# sample of tokenized instance
' '.join(train_dataset.encodings[0].tokens[0:100])

'[CLS] C ##ir ##cula ##tion revenue has increased by 5 % in Finland and 4 % in Sweden in 2008 . [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]'

* Employ DistilBert, a smaller version of BERT 


The attention network in DistilBertForSequenceClassification is designed so that the indexes of the final [CLS ] embedding score output (this is called 'logits') correspond to the annotated numbers. That is why if the annotated sentiment is a string, they usually need to be encoded into numbers. 


This can also be seen in the output of the code below. 

"id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2"
}


For example, when the highest embedding score is located on index '0', the model would output "LABEL_0", which we would interpret as 'negative'. So the model would be trained to produce the highest embedding score for the index corresponding to the encoded annotation. 

This is why we see 'np.argmax()' in the custom metric definition below. 


In [None]:
model = DistilBertForSequenceClassification.from_pretrained(model_name, num_labels=3)

Some weights of the model checkpoint at distilbert-base-cased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.weight', 'vocab_projector.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_transform.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-cased and are newly initialized: ['classifier.weight', 'pre_classifier.bias', 'classifier.bia

In [None]:
training_args = TrainingArguments(
    num_train_epochs=3,              # total number of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=20,   # batch size for evaluation
    learning_rate=5e-5,              # initial learning rate for Adam optimizer
    warmup_steps=100,                # number of warmup steps for learning rate scheduler (set lower because of small dataset size)
    weight_decay=0.01,               # strength of weight decay
    output_dir='./results',          # output directory
    logging_dir='./logs',            # directory for storing logs
    logging_steps=100,               # number of steps to output logging (set lower because of small dataset size)
    evaluation_strategy='epoch',     # evaluate during fine-tuning so that we can see progress
)

In [None]:
from datasets import load_metric

More on Logits and argmax for custom metric functions (Stackoverflow): 


Usually logits is the output tensor of a classification network (the output of [CLS] token embedding going through BERTForSequenceClassification), whose content is the unnormalized (not scaled between 0 and 1) probabilities.

np.argmax gives you the index of maximum value along the specified axis, which corresponds to the class that you are trying to predict. 

For instance, let's say that a logits output looks like the following:

logits = [ [ 10, 500, -1, 0.5, 12 ] ]

The tensor shape is [1, 5]. Just looking at the tensor values, you can easily understand that the class with the highest probability is the one associated to the position 1, with value 500.

How can you extract the position of the highest value? You can use argmax. 

top = np.argmax(logits, -1)
Once executed it will return the value 1



In [None]:
def compute_metrics(eval_pred):
    metric1 = load_metric("precision")
    metric2 = load_metric("recall")
    metric3 = load_metric("accuracy")
    metric4 = load_metric("f1")

    
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    precision = metric1.compute(predictions=predictions, references=labels, average="weighted")["precision"]
    recall = metric2.compute(predictions=predictions, references=labels, average="weighted")["recall"]
    accuracy = metric3.compute(predictions=predictions, references=labels)["accuracy"]
    f1 = metric4.compute(predictions=predictions, references=labels, average="weighted")["f1"]
    return {"precision": precision, "recall": recall, "accuracy": accuracy, "f1": f1}

In [None]:
trainer = Trainer(
    model=model,                         # the instantiated Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=test_dataset,           # evaluation dataset 
    compute_metrics=compute_metrics      # our custom evaluation function 
)

After more than 3 epochs, the validation loss tends to increase again, which implies overfitting. So I set the epoch to 3. 

In [None]:
trainer.train()

***** Running training *****
  Num examples = 1811
  Num Epochs = 3
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 342


Epoch,Training Loss,Validation Loss,Precision,Recall,Accuracy,F1
1,0.7048,0.122034,0.962093,0.962472,0.962472,0.961894
2,0.1607,0.076482,0.975665,0.975717,0.975717,0.975665
3,0.0531,0.075605,0.984522,0.984547,0.984547,0.984491


***** Running Evaluation *****
  Num examples = 453
  Batch size = 20
***** Running Evaluation *****
  Num examples = 453
  Batch size = 20
***** Running Evaluation *****
  Num examples = 453
  Batch size = 20


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=342, training_loss=0.27392478086795025, metrics={'train_runtime': 7172.7785, 'train_samples_per_second': 0.757, 'train_steps_per_second': 0.048, 'total_flos': 132133929481404.0, 'train_loss': 0.27392478086795025, 'epoch': 3.0})

* Save model for reuse 

In [None]:
trainer.save_model("distilbert_platform_growth")

Saving model checkpoint to distilbert_platform_growth
Configuration saved in distilbert_platform_growth/config.json
Model weights saved in distilbert_platform_growth/pytorch_model.bin
