# Fine-Tuning BERT Using Azure Machine Learning Notebooks and Hugging Face Tools

## Introduction

In this lab we will fine-tune a BERT model using tools, models and datasets from Hugging Face on an Azure Machine Learning (AML) GPU instance.  In particular, we will add a _classification head_ on top of a pre-trained instance of BERT and fine-tune it for _sentiment analysis_ using an appropropriate data set.  We will be using a small GPU such as the T4 using the [PyTorch](https://pytorch.org/) deep-learning framework.

We will see two different ways to perform this task:

1.  Using a Hugging Face class that is specifically designed for a a sequence classification task
2.  Using a custom-written class 


## Prerequisites
To run this lab you need to have the following:
* An AML GPU-based compute instance
* A conda environment named `bert_ft`, created using the enclosed `1_setup_conda.sh` file

## Tools Used
The Python tools used in this lab are the following open-source Hugging Face tools:

* [Transformers](https://huggingface.co/docs/transformers/v4.17.0/en/index) - Implementation of a number of deep-learning models using the Transformer architecture



In [53]:
!pwd

/mnt/batch/tasks/shared/LS_root/mounts/clusters/t4-instance/code/Users/yuvalmazor/FT BERT


## Imports & Definitions
In this section we will import the classes we need and setup some definitions to be used later on.

In [54]:
from transformers import BertModel, BertTokenizer, BertForSequenceClassification
import datasets
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import random
import pandas as pd

pd.options.display.max_colwidth = 100

# The pre-trained BERT model name
model_name      = "google-bert/bert-base-uncased"

# The sentiment analysis dataset used for fine-tuning
dataset_name    = "stanfordnlp/sentiment140"

# Num of instances from dataset to use for fine-tuning
num_data_samples = 5000

# GPU Device 
device = 'cuda:0'

# Random seed
seed_val = 42

# Evaluation texts for the testing the fine-tuned model for sentiment
evaluation_texts = [
    'The critics praised this movie but I hated it!', 
    'I loved the new restaurant despite its bad service.',
]


# The following are taken from the BERT paper (https://arxiv.org/abs/1810.04805)

# Batch size for fine-tuning 
batch_size = 32

# Number of epochs for fine-tuning
epochs=2

# Learning rate 
learning_rate = 2e-5



## Load and Prepare the Data

In [55]:
# Download the tokenizer
tokenizer = BertTokenizer.from_pretrained(model_name)

# Download the data and sample
dataset = datasets.load_dataset(dataset_name)
df = dataset['train'].to_pandas().sample(num_data_samples)

labels = df['sentiment'].apply(lambda x: 0 if x == 0 else 1)

df[['text', 'sentiment']].head(10)

Unnamed: 0,text,sentiment
541200,@chrishasboobs AHHH I HOPE YOUR OK!!!,0
750,"@misstoriblack cool , i have no tweet apps for my razr 2",0
766711,@TiannaChaos i know just family drama. its lame.hey next time u hang out with kim n u guys like...,0
285055,School email won't open and I have geography stuff on there to revise! *Stupid School* :'(,0
705995,upper airways problem,0
379611,Going to miss Pastor's sermon on Faith...,0
1189018,on lunch....dj should come eat with me,4
667030,@piginthepoke oh why are you feeling like that?,0
93541,gahh noo!peyton needs to live!this is horrible,0
1097326,@mrstessyman thank you glad you like it! There is a product review bit on the site Enjoy knitti...,4


In [56]:
# Encoding our data in preparation for fine-tuning

encoded_texts = tokenizer.batch_encode_plus(
    df['text'].values,                  # Data is in the `text` column
    add_special_tokens=True,            # Make sure to wrap the text in [CLS] and [SEP] tokens
    max_length=64,                      # Maximum length for a single sentence
    padding='longest',                  # Padding strategy:  pad to max_length
    return_attention_mask=True,         # Return attention mask:  Valid vs. padding tokens
    return_tensors='pt'                 # Return as PyTorch tensors
)

# Prepare PyTorch Dataset 

dataset = torch.utils.data.TensorDataset(
    encoded_texts['input_ids'],         # Encoded data
    encoded_texts['attention_mask'],    # Attention masks 
    torch.Tensor(labels.values).long()  # Sentiment values as long values
)

## Quick Introdcution to BERT 

[BERT (Bidirectional Encoder Representations from Transformers, Devlin et al, 2018)](https://arxiv.org/abs/1810.04805) is a language model introduced in 2018 by Google researchers for generating text representations using the [Transformer architecture](https://arxiv.org/abs/1706.03762).  By encoding text in multiple directions (hence the 'bidirectional' name) it is able to generate very powerful representations - known as 'embeddings' - which can then be used for further downstream tasks such as classification or question answering.

Text is encoded into BERT in the following manner:

[[ CLS ]] token_1 token_2 ... token_n [[ SEP ]] token_n+1 ... token_m [[ SEP ]]

where CLS is the _class token_ that enables BERT to be used for classification tasks and SEP is a _separator_ token allowing BERT to optionally contain 2 sentences in each input.  BERT is originally designed to have a maximum of 512 tokens per input.

For sentiment analysis we will fine-tune BERT with a custom dataset so that the `CLS` token to have a score between 0 (negative sentiment) and 1 (positive sentiment).


## First Approach:  Fine-Tuning BERT with Hugging Face Classes

In this section we will fine tune the pretrained BERT model using HuggingFace's `BertForSequenceClassficiation` class.  This class extends the plain BERT model class to support an additional classification head on top of BERT, composed of a feed-forward and a dropout layer.  We will then try the model with the evaulation texts from above.

In [59]:
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)


# Prepare the pretrained model
model = BertForSequenceClassification.from_pretrained(
    model_name,                       # The pretrained model
    num_labels=len(labels.unique()),  # Number of labels for the classification task
    output_attentions=False,          
    output_hidden_states = False, 
    ).to(device)

# Set up optimizer and data loader
adam = torch.optim.Adam(model.parameters(), lr=learning_rate)
loader = torch.utils.data.DataLoader(dataset, batch_size=batch_size, shuffle=True)

# Training Loop
for epoch in range(epochs):
    print (f"Epoch {epoch}\n---------")
    model.train()
    losses = []

    for batch_idx, batch in enumerate(loader):
        input_ids, attention_masks, labels = (batch[i].to(device) for i in range(len(batch)))
        
        model.zero_grad()

        # Note the the HF classes are able to calculate loss themselves
        loss, logits = model(input_ids, token_type_ids=None, attention_mask=attention_masks, labels=labels, return_dict=False)
        
        losses.append(loss.item())
        loss.backward()
        adam.step()

        if batch_idx % 40 == 0:
            print(f"Loss for batch {batch_idx}: ", end="")
            print(loss.item())
    print("\n")
    print(f"Average loss: {np.mean(losses)}\n")

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 0
---------
Loss for batch 0: 0.6926805377006531
Loss for batch 40: 0.5924568176269531
Loss for batch 80: 0.4777229428291321
Loss for batch 120: 0.46647247672080994


Average loss: 0.5119854244077282

Epoch 1
---------
Loss for batch 0: 0.36755692958831787
Loss for batch 40: 0.3337683379650116
Loss for batch 80: 0.4166205823421478
Loss for batch 120: 0.29434990882873535


Average loss: 0.31982326730611216



In [60]:
model.eval()

eval_df = pd.DataFrame(data=evaluation_texts, columns=['text'])

with torch.no_grad():
    encoded_texts = tokenizer.batch_encode_plus(
        eval_df['text'],
        add_special_tokens=True, max_length=64, 
        padding='longest', 
        return_attention_mask=True, 
        return_tensors='pt')
    
    # The Hugging Face classes return a number of different outputs - we need the sequence in the first
    # output
    eval_results = model(encoded_texts['input_ids'].to(device), return_dict=False)[0]
    eval_results = F.softmax(eval_results, dim=1).cpu().numpy()

    eval_df['Negative'] = eval_results[:, 0]
    eval_df['Positive'] = eval_results[:, 1]

eval_df

Unnamed: 0,text,Negative,Positive
0,The critics praised this movie but I hated it!,0.977916,0.022084
1,I loved the new restaurant despite its bad service.,0.087739,0.912261


## Second Approach:  Fine-Tuning BERT with Custom Classes

In this section we will fine tune the pretrained BERT using a custom-written class.  This gives us more flexibility and we will also create our own feed-foward and dropout layers on top of the existing model.

---

In [62]:
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

# Our custom model - using 2 Linear layers and dropout
class CustomModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.bert = BertModel.from_pretrained(model_name)
        self.linear1 = nn.Linear(768, 10)
        self.linear2 = nn.Linear(10, 2)
        self.dropout = nn.Dropout(0.3)

    def forward(self, input_ids, token_type_ids=None, attention_mask=None, return_dict=None, labels=None):
        ret = self.bert(input_ids=input_ids, token_type_ids=token_type_ids, attention_mask=attention_mask, return_dict=return_dict)
        pooler_output = ret[1]
        ret = F.gelu(self.linear1(pooler_output))
        ret = self.dropout(ret)
        return F.softmax(self.linear2(ret), dim=1)

model = CustomModel().to(device)

adam = torch.optim.Adam(model.parameters(), lr=learning_rate)
loss_fn = nn.CrossEntropyLoss()
loader = torch.utils.data.DataLoader(dataset, batch_size=batch_size, shuffle=True)





for epoch in range(epochs):
    print (f"Epoch {epoch}\n---------")
    model.train()
    losses = []

    for batch_idx, batch in enumerate(loader):
        input_ids, attention_masks, labels = (batch[i].to(device) for i in range(len(batch)))
        
        model.zero_grad()

        # With a custom class it is up to us to calculate loss for each batch
        res = model(input_ids, token_type_ids=None, attention_mask=attention_masks, labels=labels, return_dict=False)
        loss=loss_fn(res, labels)
        
        losses.append(loss.item())
        loss.backward()
        adam.step()

        if batch_idx % 40 == 0:
            print(f"Loss for batch {batch_idx}: ", end="")
            print(loss.item())
    print("\n")
    print(f"Average loss: {np.mean(losses)}\n")

   

Epoch 0
---------
Loss for batch 0: 0.7031998634338379
Loss for batch 40: 0.5776113867759705
Loss for batch 80: 0.535115659236908
Loss for batch 120: 0.5173466205596924


Average loss: 0.5989916662501681

Epoch 1
---------
Loss for batch 0: 0.4463011622428894
Loss for batch 40: 0.5075926184654236
Loss for batch 80: 0.5965098142623901
Loss for batch 120: 0.5559391975402832


Average loss: 0.50429661391647



In [75]:
model.eval()
eval_df = pd.DataFrame(data=evaluation_texts, columns=['text'])

with torch.no_grad():
    encoded_texts = tokenizer.batch_encode_plus(
        eval_df['text'],
        add_special_tokens=True, max_length=64, 
        padding='longest', 
        return_attention_mask=True, 
        return_tensors='pt')
    
    # Our custom class returns the precise outputs we're interested in
    eval_results = model(encoded_texts['input_ids'].to(device), return_dict=False)
    eval_results = F.softmax(eval_results, dim=1).cpu().numpy()

    eval_df['Negative'] = eval_results[:, 0]
    eval_df['Positive'] = eval_results[:, 1]

eval_df



Unnamed: 0,text,Negative,Positive
0,The critics praised this movie but I hated it!,0.725574,0.274426
1,I loved the new restaurant despite its bad service.,0.448317,0.551683
