# Lab 10: Transfer Learning with BERT

In this lab we are going to be looking at fine-tuning a BERT model to carry out a sequence classification task.

We are going to load in some text from a small number of books in the Gutenberg corpus and see if we can train a classifier to classify which book a piece of text is from

First lets pull up all of the filenames available.

In [1]:
import pandas as pd

In [2]:
try :
  from google.colab import drive
  drive.mount('/content/drive')
  IN_COLAB = True
except:
  IN_COLAB = False
print('IN COLAB: ', IN_COLAB)

Mounted at /content/drive
IN COLAB:  True


In [3]:
import os,random,math
if IN_COLAB:
  TRAINING_DIR="/content/drive/MyDrive/MSc/modules/2.2/2.2-Language P-2/week4-NN_bigram_unigram/lab4resources_full/sentence-completion/Holmes_Training_Data" #this needs to be the parent directory for the training corpus
else:
  TRAINING_DIR="../../../week4/lab4/lab4resources/sentence-completion/Holmes_Training_Data" #this needs to be the parent directory for the training corpus

filenames=os.listdir(TRAINING_DIR)
n=len(filenames)
print("There are {} files in the training directory: {}".format(n,TRAINING_DIR))

#print(filenames)

There are 525 files in the training directory: /content/drive/MyDrive/MSc/modules/2.2/2.2-Language P-2/week4-NN_bigram_unigram/lab4resources_full/sentence-completion/Holmes_Training_Data


We are going to create a Book class to store the text from a class and do some very basic pre-processing.  We need to
* load in the text line-by-line
* get rid of header lines (the stuff in the file before the line which starts \*END\*THE SMALL PRINT!)
* make chunks of text which are longer than 1 line.  These should be easier to classify.  We will try to get sentences - but some chunks may contain multiple sentences.  We are not going to worry about this here.
* return some labelled data split randomly between training and testing

In [4]:
class Book():

    header_end="*END*THE SMALL PRINT!"
    seed=53

    def __init__(self, filename, training_dir, label=""):
        self.TRAINING_DIR = training_dir
        self.filename=filename
        self.loadfile()
        self.make_chunks()
        if label=="":
            self.label=self.filename
        else:
            self.label=label

    def loadfile(self):
        filepath=os.path.join(self.TRAINING_DIR,self.filename)
        self.lines=[]
        beyond_header=False
        try:
            with open(filepath) as instream:
                for line in instream:
                    line=line.rstrip()

                    if len(line)>0 and beyond_header:
                        self.lines.append(line)
                    if line.startswith(Book.header_end):
                        beyond_header=True
        except UnicodeDecodeError:
            print(f"UnicodeDecodeError processing {filepath}")

    def length(self):
        return len(self.chunks)

    def head(self,n=10):
        return self.chunks[:n]

    def make_chunks(self):
        self.chunks=[]

        current=""
        for line in self.lines:
            current+=line+" "
            if line.endswith("."):
                self.chunks.append(current.rstrip())
                current=""

    def get_labelled_data(self,split=0.8):
        labelled_data=[(chunk,self.label) for chunk in self.chunks]
        random.seed(Book.seed)
        random.shuffle(labelled_data)
        index=int(self.length()*split)
        return (labelled_data[:index],labelled_data[index:])


### Exercise 1
- Create an instance of a Book() and store it in the variable `emma`.  The filename is `EMMA10.TXT` and the label should be `Emma`
- Check the number of sentences = 2028 and have a look at the first 10 sentence
- Repeat to create `ivanhoe` to store the text from `IVNHO12.TXT` with the label `Ivanhoe`.  
- The number of sentences in `ivanhoe` should be 1743

In [5]:
emma = Book('EMMA10.TXT', TRAINING_DIR, 'Emma')
emma.loadfile()

In [6]:
emma.length()

2028

In [7]:
ivanhoe = Book('IVNHO12.TXT', TRAINING_DIR, 'Ivanhoe')
ivanhoe.loadfile()

In [8]:
ivanhoe.length()

1743

### Exercise 2
- Use the `get_labelled_data()` method to get a training and testing portion from each book (split = 80%).  
- Create 2 pandas dataframes
    - 1 dataframe called `training_df` with all of the training data (from both books)
    - 1 dataframe called  `testing_df`  with all of the test data (from both books).  
    - The columns of both dataframes should be `text` and `label`

In [9]:
iv_train, iv_test = ivanhoe.get_labelled_data()
em_train, em_test = emma.get_labelled_data()

print(len(iv_train))
print(len(iv_test))
print(len(em_train))
print(len(em_test))


1394
349
1622
406


In [10]:
train = iv_train + em_train
test = iv_test + em_test

In [11]:
print(len(train))
print(len(test))

3016
755


In [12]:
train_X, train_y = zip(*train)
test_X, test_y = zip(*test)

In [13]:
training_df = pd.DataFrame({'text': train_X, 'label': train_y })
testing_df = pd.DataFrame({'text': test_X, 'label': test_y })

## Finetuning BERT to tell the difference between sentences from each book

Now we are going to look at building a classifier on top of BERT.  The first thing we need to do is map the informative label names we have (`Emma` and `Ivanhoe`) to integers which will be used by BERT.  In this simple case, we could just create a dictionary manually.  However, the code below will make a sorted list (without duplicates) of all of the labelnames in the two dataframes and then generate a dictionary which maps each label name to an integer.


In [14]:
#first we need a map for the labels

#make a list of all of the unique labels in the training and testing dataframes
#sorting it means that it will also be in the same order (alphabetical) rather than depending on order in the training / testing data
labellist=sorted(list(set(training_df['label'].unique()).union(set(testing_df['label'].unique()))))

labels={label:i for i,label in enumerate(labellist)}
labels

{'Emma': 0, 'Ivanhoe': 1}

### Exercise 3
Write some code to create a reverse index for the labels which maps the numbers back to the more informative strings

This should result in something which looks as follows:

```
reverse_index={0:'Emma',1:'Ivanhoe'}
```

But obviously, you should create it automatically from the labels dictionary rather than typing it in!

In [15]:
reverse_index = {num_label: st_label for (st_label, num_label) in labels.items()}
reverse_index


{0: 'Emma', 1: 'Ivanhoe'}

Now we need to store the data in a Dataset class.  This inherits properties from torch's Dataset class and is what is expected.  It is initialised by giving it dataframe from which it can extract a list of labels and a list of texts.   It handles preprocessing including adding CLS and SEP tokens at the beginning and end, tokenization, lower-casing truncation and padding.  It also provides a `\_\_getitem\_\_()` method which allows the Dataset to be indexed into like a list i.e., `myDataset[3]` will return a pair which is the label and text with index 3.

In [16]:
import torch
import numpy as np
from transformers import BertTokenizer, BertModel

tokenizer=BertTokenizer.from_pretrained('bert-base-uncased')

class Dataset(torch.utils.data.Dataset):

    def __init__(self,df,column='text'):
        self.labels=[labels[label] for label in df['label']]
        self.texts=[tokenizer(text.lower(),padding='max_length',max_length=512,truncation=True,return_tensors="pt") for text in df[column]]

    def classes(self):
        return self.labels

    def __len__(self):
        return len(self.labels)

    def get_batch_labels(self,idx):
        return np.array(self.labels[idx])

    def get_batch_texts(self,idx):
        return self.texts[idx]

    def __getitem__(self,idx):
        batch_texts=self.get_batch_texts(idx)
        batch_y=self.get_batch_labels(idx)

        return batch_texts,batch_y


train_data=Dataset(training_df)
test_data=Dataset(testing_df)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

In [50]:
! pip list


Package                          Version
-------------------------------- ---------------------
absl-py                          1.4.0
aiohttp                          3.9.5
aiosignal                        1.3.1
alabaster                        0.7.16
albumentations                   1.3.1
altair                           4.2.2
annotated-types                  0.6.0
anyio                            3.7.1
appdirs                          1.4.4
argon2-cffi                      23.1.0
argon2-cffi-bindings             21.2.0
array_record                     0.5.1
arviz                            0.15.1
astropy                          5.3.4
astunparse                       1.6.3
async-timeout                    4.0.3
atpublic                         4.1.0
attrs                            23.2.0
audioread                        3.0.1
autograd                         1.6.2
Babel                            2.14.0
backcall                         0.2.0
beautifulsoup4                   4.12.3


Lets have a look at one of the dataset items.

The \_\_getitem\_\_ method allows us to index into the dataset and get a particular item

In [17]:
train_data[0]

({'input_ids': tensor([[  101,  2045, 23481,  1999,  2009,  7132,  2791,  1998, 15003,  1025,
           1998,  2065,  1037, 28642,  2063,  1997,  1996,  2088,  1005,  1055,
           6620,  2030,  3158,  6447,  2089,  4666,  2007,  2019,  3670,  2061,
           8403,  1010,  2129,  2323,  2057,  9610,  3207,  2008,  2029,  2003,
           1997,  3011,  2005,  7682,  2070,  6120,  1997,  2049,  2434,  1029,
           2146,  1010,  2146,  2097,  1045,  3342,  2115,  2838,  1010,  1998,
          19994,  2643,  2008,  1045,  2681,  2026,  7015,  8116,  2121,  2142,
           2007,  1011,  1011,  1011,  1005,  1005,  2016,  3030,  2460,  1011,
           1011,  1011,  2014,  2159,  3561,  2007,  4000,  1012,   102,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,   

We can also jsut look at the text (or label) for an item as follows.  Note that the tokens have been replaced by their indices in the BERT wordpiece vocabulary.

In [18]:
train_data[0][0]['input_ids']

tensor([[  101,  2045, 23481,  1999,  2009,  7132,  2791,  1998, 15003,  1025,
          1998,  2065,  1037, 28642,  2063,  1997,  1996,  2088,  1005,  1055,
          6620,  2030,  3158,  6447,  2089,  4666,  2007,  2019,  3670,  2061,
          8403,  1010,  2129,  2323,  2057,  9610,  3207,  2008,  2029,  2003,
          1997,  3011,  2005,  7682,  2070,  6120,  1997,  2049,  2434,  1029,
          2146,  1010,  2146,  2097,  1045,  3342,  2115,  2838,  1010,  1998,
         19994,  2643,  2008,  1045,  2681,  2026,  7015,  8116,  2121,  2142,
          2007,  1011,  1011,  1011,  1005,  1005,  2016,  3030,  2460,  1011,
          1011,  1011,  2014,  2159,  3561,  2007,  4000,  1012,   102,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,  

### Exercise 4
Can you turn the token ids back into subwords for one of the inputs?

In [48]:
decoded_text = tokenizer.decode(train_data[0][0]['input_ids'].squeeze(), skip_special_tokens=True)
decoded_text

"there reigns in it gentleness and goodness ; and if a tinge of the world's pride or vanities may mix with an expression so lovely, how should we chide that which is of earth for bearing some colour of its original? long, long will i remember your features, and bless god that i leave my noble deliverer united with - - -'' she stopped short - - - her eyes filled with tears."

Now we need to prepare the inputs for the particular device (GPU or CPU) that the model is going to be run on.  Let's just check first whether GPU / CUDA has been enabled.

In [24]:
use_cuda=torch.cuda.is_available()
if use_cuda:
  print("GPU acceleration enabled")
else:
  print("GPU acceleration NOT enabled.  If using Colab, have you changed the runtype type and selected GPU as the hardware accelerator?")
device=torch.device("cuda" if use_cuda else "cpu")

GPU acceleration enabled


In [25]:
def prepare_inputs(input1,label,device):
  label=label.to(device)
  mask=input1['attention_mask'].to(device)
  input_id=input1['input_ids'].squeeze(1).to(device)
  return (input_id,mask,label)

Lets try preparing some inputs and running them through BERT.  We will use the torch DataLoader to manage iterating over the datasets during training and testing.  Here we will just process the first item produced by the DataLoader to see the output from the pre-trained BERT model.

In [30]:
from transformers import BertModel


bert=BertModel.from_pretrained('bert-base-uncased').to(device)

for train_input,train_label in train_dataloader:
    input_id,mask,label=prepare_inputs(train_input,train_label,device)
    output=bert(input_ids=input_id,attention_mask=mask,return_dict=False)
    break

print(input_id,mask,label)

print(len(output))

output[1]

tensor([[ 101, 2720, 1012,  ...,    0,    0,    0],
        [ 101, 2021, 2002,  ...,    0,    0,    0]], device='cuda:0') tensor([[[1, 1, 1,  ..., 0, 0, 0]],

        [[1, 1, 1,  ..., 0, 0, 0]]], device='cuda:0') tensor([0, 1], device='cuda:0')
2


tensor([[-0.3721, -0.1482, -0.9126,  ..., -0.8406, -0.4007,  0.5815],
        [-0.1324, -0.0206, -0.2064,  ..., -0.5971, -0.0908, -0.2047]],
       device='cuda:0', grad_fn=<TanhBackward0>)

Now, we need to construct our classification network.  Look at the code below and then answer Exercise 5

In [38]:
#now we need to put a simple classification layer on top of BERT

from torch import nn
from transformers import BertModel

class BertClassifier(nn.Module):

    def __init__(self,dropout=0.5,num_classes=2):
        super(BertClassifier,self).__init__()

        self.bert=BertModel.from_pretrained('bert-base-uncased')
        self.dropout=nn.Dropout(dropout)
        self.linear=nn.Linear(768,num_classes)
        self.relu=nn.ReLU()

    def forward(self,input_id,mask):

        last_hidden_layer,pooled_output = self.bert(input_ids=input_id,attention_mask=mask,return_dict=False)
        dropout_output=self.dropout(pooled_output)
        linear_output=self.linear(dropout_output)
        final_layer=self.relu(linear_output)

        return final_layer

### Exercise 5

Use the definitions of the initialisation method and the forward method of the BertClassifier class to sketch out what the neural network architecture looks like.

What do you understand by the terms:
- pooled output
- dropout layer
- linear layer
- relu layer

Next, we define a function which will carry out the training of the network.  It will handle
- setting up the DataLoaders
- preparing the inputs for CPU / GPU
- carrying out training and validation for a number of epochs.  In each epoch:
    - iterate over the training data in batches
    - get the output for each input and compute the batch loss (Cross Entropy)
    - use the batch loss to carry out optimisation
    - compute the accuracy and total loss for the training data
    - iterate over the testing / validation data in batches
    - get the output for each input and compute the batch loss
    - compute the accuracy and total loss for the validation data
    - output stats

In [39]:
#we now need a training loop

from torch.optim import Adam
from tqdm import tqdm  #useful library to report on progress through an iteration



def train(model, train_data,val_data,learning_rate,epochs):

    train_dataloader=torch.utils.data.DataLoader(train_data,batch_size=2,shuffle=True)
    val_dataloader=torch.utils.data.DataLoader(test_data,batch_size=2)

    use_cuda=torch.cuda.is_available()
    device=torch.device("cuda" if use_cuda else "cpu")

    criterion=nn.CrossEntropyLoss()
    optimizer=Adam(model.parameters(),lr=learning_rate)

    if use_cuda:
        model=model.cuda()
        criterion=criterion.cuda()

    for epoch_num in range(epochs):
        total_acc_train=0
        total_loss_train=0
        model.train()
        for train_input,train_label in tqdm(train_dataloader):

            input_id,mask, train_label=prepare_inputs(train_input,train_label,device)

            output=model(input_id,mask)

            batch_loss=criterion(output,train_label.long())
            total_loss_train +=batch_loss.item()

            acc=(output.argmax(dim=1)==train_label).sum().item()
            total_acc_train+=acc

            model.zero_grad()
            batch_loss.backward()
            optimizer.step()

        total_acc_val=0
        total_loss_val=0
        model.eval()
        with torch.no_grad():
            for val_input,val_label in val_dataloader:

                input_id,mask, val_label=prepare_inputs(val_input,val_label,device)

                output=model(input_id,mask)

                batch_loss=criterion(output,val_label.long())

                total_loss_val+=batch_loss.item()

                acc=(output.argmax(dim=1)==val_label).sum().item()
                total_acc_val+=acc

        print(f'Epochs: {epoch_num+1} | Train Loss: {total_loss_train / len(train_data):.3f} | Train Accuracy: {total_acc_train/len(train_data):.3f}')
        print(f'Val loss: {total_loss_val/len(val_data):.3f} | Val Accuracy: {total_acc_val / len(val_data):.3f}')


Here, we define the number of epochs we are going to train for, the learning rate and an instance of our BertClassifier network.

In [41]:
EPOCHS=1
model=BertClassifier(num_classes=len(labels.keys()))
LR=1e-6


Now we are actually going to train the model.  This might take some time - particularly if you are running on a CPU (1.5hrs per epoch on my laptop!)

In [42]:
train(model,train_data,test_data,LR,EPOCHS)

100%|██████████| 1508/1508 [05:45<00:00,  4.36it/s]


Epochs: 1 | Train Loss: 0.270 | Train Accuracy: 0.746
Val loss: 0.125 | Val Accuracy: 0.971


We need to be able to save the model to be able to use it elsewhere (without training again!)

In [43]:
output_dir="bert-base-uncased-bookclassifier"
torch.save(model,output_dir)

We can load it up like this.  This could be in another notebook.  If loading in another notebook, you should make sure the BertClassifier class is also defined in that notebook (along with other necessary imports).

My trained bert-base-uncased-bookclassifier is included in the resources directory as `julie-bert-base-uncased-bookclassifier`.  Use this if you don't have access to GPU to train your own model.  Note it is a large file (around 0.5Gb)

In [44]:
input_dir="bert-base-uncased-bookclassifier"
#input_dir ="julie-bert-base-uncased-bookclassifier"
complete_model=torch.load(input_dir)

Here's an evaluation loop we can use to evaluate on some test data.  We also return the predictions which can than be added to the dataframe

In [45]:
batchsize=2
def evaluate(model,test_dataset):
    model.eval()
    test_dataloader=torch.utils.data.DataLoader(test_dataset,batch_size=batchsize)

    use_cuda=torch.cuda.is_available()
    device=torch.device("cuda" if use_cuda else "cpu")

    if use_cuda:
        model=model.cuda()

    total_acc_test=0
    with torch.no_grad():
        count=0
        predictions=[]
        for test_input,test_label in tqdm(test_dataloader):
            count+=batchsize
            test_label=test_label.to(device)
            mask=test_input['attention_mask'].to(device)
            input_id=test_input['input_ids'].squeeze(1).to(device)
            output=model(input_id,mask)
            #print(output.argmax(dim=1),test_label)
            predictions.append(output.argmax(dim=1))  #save the prediction for further analysis
            acc=(output.argmax(dim=1)==test_label).sum().item()

            total_acc_test+=acc
            if count%100==0:
                print(f'Accuracy so far = {total_acc_test/count: .3f}')

    print(f'Test accuracy: {total_acc_test/len(test_dataset): .3f}')
    return predictions

This takes around 6 minutes to run on my laptop.  I have the evaluation loop printing out "Accuracy so far" as I get bored waiting to the end to see the results - it also gives a very rough indication of how stable the method is

In [46]:
predictions=evaluate(model, test_data)

 14%|█▍        | 52/378 [00:03<00:25, 12.91it/s]

Accuracy so far =  0.960


 27%|██▋       | 102/378 [00:07<00:21, 12.70it/s]

Accuracy so far =  0.965


 40%|████      | 152/378 [00:11<00:18, 12.42it/s]

Accuracy so far =  0.943


 53%|█████▎    | 202/378 [00:15<00:14, 12.34it/s]

Accuracy so far =  0.958


 67%|██████▋   | 252/378 [00:20<00:10, 12.23it/s]

Accuracy so far =  0.966


 80%|███████▉  | 302/378 [00:24<00:06, 11.96it/s]

Accuracy so far =  0.972


 93%|█████████▎| 352/378 [00:28<00:02, 11.93it/s]

Accuracy so far =  0.970


100%|██████████| 378/378 [00:30<00:00, 12.39it/s]

Test accuracy:  0.971





### Exercise 6
Add the predicted label for each test item to the dataframe with the test data.

### Exercise 7
Compute the confusion matrix / precision and recall scores for the different classes.  What does this analysis tell you about the errors?

### Extension 1
Carry out some more detailed error analysis.  Look at examples where the class is incorrectly predicted.  Do you notice anything that the examples have in common (e.g., length, vocabulary, structure).  Can you quantify the number of errors that this effect might account for?

### Extension 2
Some might say that it is easier to distinguish these books due to particular vocabulary e.g., character names (Emma and Ivanhoe!)
Compare the performance of the BertClassifier with a Naïve Bayes classifier.

### Extension 3
Look at some more pairs of books (e.g., 4 pairs of books).  Are you able to identify pairs of books which are easier or harder than others to distinguish?

### Extension 4
Choose 5 books and build a 5-way classifier to distinguish between them.  Evaluate your classifier and carry out some basic error analysis as before.