<a href="https://colab.research.google.com/github/rahiakela/transformers-for-natural-language-processing/blob/main/2-fine-tuning-BERT-models/BERT_fine_tuning_for_sentence_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## BERT Fine-Tuning for Sentence Classification

In this notebook, we will fine-tune a BERT model to predict the downstream task of Acceptability Judgements and measure the predictions with the Matthews Correlation Coefficient (MCC).


[Reference Article by Chris McCormick and Nick Ryan](https://mccormickml.com/2019/07/22/BERT-fine-tuning/)

## Setup

Pretraining a multi-head attention transformer model requires the parallel
processing GPUs can provide.

The program first starts by checking if the GPU is activated:

In [None]:
%tensorflow_version 2.x     # magic command instructing to use TensorFlow version 2+
import tensorflow as tf

device_name = tf.test.gpu_device_name()
if device_name != "/device:GPU:0":
  raise SystemError("GPU device not found")
print("Found GPU at: {}".format(device_name))

print(tf.__version__)

Hugging Face provides modules in TensorFlow and PyTorch. I recommend that a
developer feels comfortable with both environments. Excellent AI research teams use either or both environments.

In [2]:
!pip install -q transformers

[K     |████████████████████████████████| 2.0MB 16.2MB/s 
[K     |████████████████████████████████| 890kB 53.7MB/s 
[K     |████████████████████████████████| 3.2MB 50.5MB/s 
[?25h  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone


In [3]:
import torch
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from transformers import BertTokenizer, BertConfig
from transformers import AdamW, BertForSequenceClassification, get_linear_schedule_with_warmup

from tqdm import tqdm, trange

import pandas as pd
import io
import numpy as np
import matplotlib.pyplot as plt

We will now specify that torch uses the Compute Unified Device Architecture
(CUDA) to put the parallel computing power of the NVIDIA card to work for our
multi-head attention model:

In [4]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

n_gpu = torch.cuda.device_count()
torch.cuda.get_device_name(0)

'Tesla T4'

In [None]:
%%shell

wget https://raw.githubusercontent.com/rahiakela/transformers-for-natural-language-processing/main/2-fine-tuning-BERT-models/in_domain_train.tsv
wget https://raw.githubusercontent.com/rahiakela/transformers-for-natural-language-processing/main/2-fine-tuning-BERT-models/out_of_domain_dev.tsv

## Loading the dataset

General Language Understanding Evaluation (GLUE) considers Linguistic
Acceptability as a top-priority NLP task.

In [6]:
# load the datasets
df = pd.read_csv("in_domain_train.tsv", delimiter="\t", header=None, names=["sentence_source", "label", "label_notes", "sentence"])
df.shape

(8551, 4)

A 10-line sample is displayed to visualize the Acceptability Judgment task and see if a sequence makes sense or not:

In [7]:
df.sample(10)

Unnamed: 0,sentence_source,label,label_notes,sentence
4540,ks08,1,,"because john persuaded sally to , he did n't h..."
7228,sks13,1,,i sent it to you .
6073,c_13,1,,the tuna had been being eaten .
6688,m_02,1,,the cook saved no scraps for the dog .
2315,l-93,1,,the oil separated from the vinegar .
609,bc01,1,,john hit the stone against the wall .
7529,sks13,0,*,himself likes john .
3568,ks08,0,*,i am anxious for you should study english gram...
7261,sks13,1,,the three sunbathers went swimming .
5641,c_13,1,,the king loved peanut butter cookies .


Each sample in the .tsv files contains four tab-separated columns:

- Column 1: the source of the sentence (code)
- Column 2: the label (0=unacceptable, 1=acceptable)
- Column 3: the label annotated by the author
- Column 4: the sentence to be classified

## Preparing input for BERT

We will creating sentences, label lists, and adding BERT tokens.

In [10]:
# Creating sentence, label lists and adding Bert tokens
sentences = df.sentence.values

# Adding CLS and SEP tokens at the beginning and end of each sentence for BERT
sentences = ["[CLS]" + sentence + "[SEP]" for sentence in sentences]
labels = df.label.values

sentences[:5]

["[CLS]our friends wo n't buy this analysis , let alone the next one we propose .[SEP]",
 "[CLS]one more pseudo generalization and i 'm giving up .[SEP]",
 "[CLS]one more pseudo generalization or i 'm giving up .[SEP]",
 '[CLS]the more we study verbs , the crazier they get .[SEP]',
 '[CLS]day by day the facts are getting murkier .[SEP]']

## Activating the BERT tokenizer

We will initialize a pretrained BERT tokenizer. This will save the time
it would take to train it from scratch.

The program selects an uncased tokenizer, activates it, and displays the first
tokenized sentence:

In [12]:
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased", do_lower_case=True)
tokenized_texts = [tokenizer.tokenize(sent) for sent in sentences]

print("Tokenize the first sentence:")
print(tokenized_texts[0])

Tokenize the first sentence:
['[CLS]', 'our', 'friends', 'wo', 'n', "'", 't', 'buy', 'this', 'analysis', ',', 'let', 'alone', 'the', 'next', 'one', 'we', 'propose', '.', '[SEP]']


## Processing the data

We need to determine a fixed maximum length and process the data for the model. The sentences in the datasets are short. But, to make sure of this, the program sets the maximum length of a sequence to 512 and the sequences are padded:

In [13]:
# Set the maximum sequence length. The longest sequence in our training set is 47, but we'll leave room on the end anyway. 
# In the original paper, the authors used a length of 512.
MAX_LEN = 128

# Use the BERT tokenizer to convert the tokens to their index numbers in the BERT vocabulary
input_ids = [tokenizer.convert_tokens_to_ids(x) for x in tokenized_texts]

# Pad our input tokens
input_ids = pad_sequences(input_ids, maxlen=MAX_LEN, dtype="long", truncating="post", padding="post")

## Creating attention masks

Now comes a tricky part of the process. We padded the sequences in the previous cell. But we want to prevent the model from performing attention on those padded tokens!

The idea is to apply a mask with a value of 1 for each token, which will be followed by 0s for padding:

In [14]:
attention_masks = []

# Create a mask of 1s for each token followed by 0s for padding
for seq in input_ids:
  seq_mask = [float(i > 0) for i in seq]
  attention_masks.append(seq_mask)

## Splitting data into training and validation sets