# Framework Mapping with Deep Learning using Bidirectional Encoder Representations from Transformers (BERT)

Previously I designed a number of models using Naive-Bayes & Support Vector Machine algorithms which gave 40-60% accuracy.  The 60% accuracy for the SVM model would be acceptable and cut resource required for FW Mapping in half.  However,  after recent research I have found a number of methods which should, in theory, significantly out perform SVCM and NB.  

The first of these methods is using BERT.
[(Research paper.)](https://)

BERT is a machine learning framework for Natural Language Processing that is open source (NLP). BERT is a programme that uses surrounding text to help computers grasp the meaning of ambiguous words in text. The BERT framework was trained using Wikipedia text and may be fine-tuned using question and answer datasets, or text classification datasets. 

BERT (Bidirectional Encoder Representations from Transformers) is based on Transformers, a deep learning model in which each output element is connected to each input element, and the weightings between them are dynamically calculated based on their connection. (This is referred to as attention in NLP.)

BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1

## Import Dependancies & Load Data

In [None]:
!pip install transformers # install the transformer package

In [None]:
# import dependancies
import torch
import pandas as pd
import numpy as np
from tqdm.notebook import tqdm

from transformers import BertTokenizer
from torch.utils.data import TensorDataset

from transformers import BertForSequenceClassification


In [None]:
# load data
df = pd.read_csv("traindata.csv", encoding="Latin 1")
df.head()

## Prepare Data

In [3]:
df["fwnum"].value_counts()

RM6068    470
RM3821    272
RM3830    265
RM6088    167
RM3822    151
         ... 
RM3858      1
RM3857      1
RM3856      1
RM3855      1
RM6100      1
Name: fwnum, Length: 156, dtype: int64

As we can see the classes (framework numbers) are imbalanced, this needs to be addressed. 

Encoding the labels helps the model perform more accurately

In [8]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

df["label"] = le.fit_transform(df["fwnum"])

df


Unnamed: 0,title,fwnum,label
0,Action Learning Set Courses,RM6145,126
1,Additions Plant & Machinry,RM6157,129
2,Additions Software,RM6068,102
3,Additns-Info Techno,RM6068,102
4,Adds - Assets Under Construc,RM6088,105
...,...,...,...
2373,Security and Information Management,RM1557,4
2374,Cloud Infrastructure Consultancy,RM1557,4
2375,Acorn and Acorn Profiler software with Paychec...,RM1557,4
2376,"Support, Maintenance and upgrade costs for the...",RM1557,4


In [15]:
df["fwnum"].value_counts()

RM6068    470
RM3821    272
RM3830    265
RM6088    167
RM3822    151
         ... 
RM3858      1
RM3857      1
RM3856      1
RM3855      1
RM6100      1
Name: fwnum, Length: 156, dtype: int64

In [22]:
df = df.groupby("fwnum").filter(lambda x: len(x) >= 3)

df

Unnamed: 0,title,fwnum,label,data_type
0,Action Learning Set Courses,RM6145,126,not_set
1,Additions Plant & Machinry,RM6157,129,not_set
2,Additions Software,RM6068,102,not_set
3,Additns-Info Techno,RM6068,102,not_set
4,Adds - Assets Under Construc,RM6088,105,not_set
...,...,...,...,...
2373,Security and Information Management,RM1557,4,not_set
2374,Cloud Infrastructure Consultancy,RM1557,4,not_set
2375,Acorn and Acorn Profiler software with Paychec...,RM1557,4,not_set
2376,"Support, Maintenance and upgrade costs for the...",RM1557,4,not_set


## Split the Data & Stratify the Labels (resampling)

As the dataset is imbalanced it is required to use stratified sampling on the Framework Numbers.  

In [24]:
from sklearn.model_selection import train_test_split

X = df.index.values
y = df.label.values

X_train, X_val, y_train, y_val = train_test_split(X, 
                                                  y, 
                                                  test_size=0.20, 
                                                  random_state=42, 
                                                  stratify=y)

df['data_type'] = ['not_set']*df.shape[0]

df.loc[X_train, 'data_type'] = 'train'
df.loc[X_val, 'data_type'] = 'val'

df.groupby(['title',"fwnum", 'label', 'data_type']).count()

title,fwnum,label,data_type
Criminal Justice Inspectorate Website Hosting and Support,RM1557,4,train
HMP & YOI Swinfen Hall - Conflict Resolution Training January 2020,RM3822,31,train
HMP Bristol Music Workshops,RM3822,31,train
HMP Ford Bricklaying,RM3822,31,train
HMP Highpoint Warehousing & Distributions 2020 / 2021,RM6074,104,train
...,...,...,...
new: 3D modelling software @ c. £5k,RM6068,102,val
prj_2888 - Provision of Data Links,RM3821,30,train
prj_4509 - HMP Whatton-Preperaton for release course- 2020,RM3840,49,train
prj_5114 - Grant Agreement - Stop it Now Helpline 2020-2022,RM3815,27,train


## Tokenization

Tokenization is a process to take raw texts and split into tokens, which are numeric data to represent words.



*   Constructs a [BERT tokenizer](https://huggingface.co/transformers/model_doc/bert.html#berttokenizer). Based on WordPiece.
*   Instantiate a pre-trained BERT model configuration to encode our data.
* To convert all the titles from text into encoded form, we use a function called batch_encode_plus , and we will proceed train and validation data separately.
* Instantiate a pre-trained BERT model configuration to encode our data.
* To convert all the titles from text into encoded form, we use a function called batch_encode_plus , and we will proceed train and validation data separately.
* The 1st parameter inside the above function is the title text.
* `add_special_tokens=True` means the sequences will be encoded with the special tokens relative to their model.
* When batching sequences together, we set `return_attention_mask=True`, so it will return the attention mask according to the specific tokenizer defined by the `max_length attribute`.
* We also want to pad all the titles to certain maximum length.
* We actually do not need to set `max_length=256`, but just to play it safe. `return_tensors='pt'` to return PyTorch.
* And then we need to split the data into `input_ids`, `attention_masks` and `labels`.
* Finally, after we get encoded data set, we can create training data and validation data.

In [27]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', 
                                          do_lower_case=True)
                                          
encoded_data_train = tokenizer.batch_encode_plus(
    df[df.data_type=='train'].title.values, 
    add_special_tokens=True, 
    return_attention_mask=True, 
    pad_to_max_length=True, 
    max_length=256, 
    return_tensors='pt')

encoded_data_val = tokenizer.batch_encode_plus(
    df[df.data_type=='val'].title.values, 
    add_special_tokens=True, 
    return_attention_mask=True, 
    pad_to_max_length=True, 
    max_length=256, 
    return_tensors='pt')


input_ids_train = encoded_data_train['input_ids']
attention_masks_train = encoded_data_train['attention_mask']
labels_train = torch.tensor(df[df.data_type=='train'].label.values)

input_ids_val = encoded_data_val['input_ids']
attention_masks_val = encoded_data_val['attention_mask']
labels_val = torch.tensor(df[df.data_type=='val'].label.values)

dataset_train = TensorDataset(input_ids_train, attention_masks_train, labels_train)
dataset_val = TensorDataset(input_ids_val, attention_masks_val, labels_val)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


## Model Building

We are treating each "title" or "description" as its unique sequence, so one sequence will be classified to one of the FW Number labels.

* `bert-base-uncased` is a smaller pre-trained model.
* Using `num_labels` to indicate the number of output labels.
* We don’t really care about `output_attentions`.
* We also don’t need `output_hidden_states`.

In [29]:
model = BertForSequenceClassification.from_pretrained("bert-base-uncased",
                                                      num_labels=len(df["label"]),
                                                      output_attentions=False,
                                                      output_hidden_states=False)

Downloading:   0%|          | 0.00/420M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

### Dataloading the Model
* `DataLoader` combines a dataset and a sampler, and provides an iterable over the given dataset.
* We use `RandomSampler` for training and `SequentialSampler` for validation.
* Given the limited memory in my environment, I set `batch_size=3`.

In [30]:
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler

batch_size = 3

dataloader_train = DataLoader(dataset_train, 
                              sampler=RandomSampler(dataset_train), 
                              batch_size=batch_size)

dataloader_validation = DataLoader(dataset_val, 
                                   sampler=SequentialSampler(dataset_val), 
                                   batch_size=batch_size)

### Optimiser & Scheduler
We must supply an optimiser an iterable with the parameters to optimise before we can build it. The learning rate, epsilon, and other optimiser-specific variables can then be specified.

* I found `epochs=5` works well for this data set.
* Create a schedule with a learning rate that decreases linearly from the initial learning rate set in the optimiser to 0, after a warmup period during which it increases linearly from 0 to the initial learning rate set in the optimizer.

The AdamW is being depreciated in the next version of transformers,  this will be have to be updated in the future if model drift is detected. 

In [31]:
from transformers import AdamW, get_linear_schedule_with_warmup

optimizer = AdamW(model.parameters(),
                  lr=1e-5, 
                  eps=1e-8)
                  
epochs = 5

scheduler = get_linear_schedule_with_warmup(optimizer, 
                                            num_warmup_steps=0,
                                            num_training_steps=len(dataloader_train)*epochs)



### Performance Metrics
I will use f1 score and accuracy per class as performance metrics. F1 score is a measurement that considers both precision and recall to compute the score. The F1 score can be interpreted as a weighted average of the precision and recall values, where an F1 score reaches its best value at 1 and worst value at 0.

In [33]:
from sklearn.metrics import f1_score

def f1_score_func(preds, labels):
    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return f1_score(labels_flat, preds_flat, average='weighted')

def accuracy_per_class(preds, labels):
    label_dict_inverse = {v: k for k, v in df["labels_train"].items()}
    
    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()

    for label in np.unique(labels_flat):
        y_preds = preds_flat[labels_flat==label]
        y_true = labels_flat[labels_flat==label]
        print(f'Class: {label_dict_inverse[label]}')
        print(f'Accuracy: {len(y_preds[y_preds==label])}/{len(y_true)}\n')

### Training Loop



In [None]:
import random

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)


seed_val = 17
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

def evaluate(dataloader_val):

    model.eval()
    
    loss_val_total = 0
    predictions, true_vals = [], []
    
    for batch in dataloader_val:
        
        batch = tuple(b.to(device) for b in batch)
        
        inputs = {'input_ids':      batch[0],
                  'attention_mask': batch[1],
                  'labels':         batch[2],
                 }

        with torch.no_grad():        
            outputs = model(**inputs)
            
        loss = outputs[0]
        logits = outputs[1]
        loss_val_total += loss.item()

        logits = logits.detach().cpu().numpy()
        label_ids = inputs['labels'].cpu().numpy()
        predictions.append(logits)
        true_vals.append(label_ids)
    
    loss_val_avg = loss_val_total/len(dataloader_val) 
    
    predictions = np.concatenate(predictions, axis=0)
    true_vals = np.concatenate(true_vals, axis=0)
            
    return loss_val_avg, predictions, true_vals
    
for epoch in tqdm(range(1, epochs+1)):
    
    model.train()
    
    loss_train_total = 0

    progress_bar = tqdm(dataloader_train, desc='Epoch {:1d}'.format(epoch), leave=False, disable=False)
    for batch in progress_bar:

        model.zero_grad()
        
        batch = tuple(b.to(device) for b in batch)
        
        inputs = {'input_ids':      batch[0],
                  'attention_mask': batch[1],
                  'labels':         batch[2],
                 }       

        outputs = model(**inputs)
        
        loss = outputs[0]
        loss_train_total += loss.item()
        loss.backward()

        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

        optimizer.step()
        scheduler.step()
        
        progress_bar.set_postfix({'training_loss': '{:.3f}'.format(loss.item()/len(batch))})
         
        
    torch.save(model.state_dict(), f'data_volume/finetuned_BERT_epoch_{epoch}.model')
        
    tqdm.write(f'\nEpoch {epoch}')
    
    loss_train_avg = loss_train_total/len(dataloader_train)            
    tqdm.write(f'Training loss: {loss_train_avg}')
    
    val_loss, predictions, true_vals = evaluate(dataloader_validation)
    val_f1 = f1_score_func(predictions, true_vals)
    tqdm.write(f'Validation loss: {val_loss}')
    tqdm.write(f'F1 Score (Weighted): {val_f1}')

  0%|          | 0/5 [00:00<?, ?it/s]

Epoch 1:   0%|          | 0/612 [00:00<?, ?it/s]

### Loading and Evaluating the Model

In [None]:
model = BertForSequenceClassification.from_pretrained("bert-base-uncased",
                                                      num_labels=len(df["label"]),
                                                      output_attentions=False,
                                                      output_hidden_states=False)

model.to(device)

model.load_state_dict(torch.load("data_volume/finetuned_BERT_epoch_1.model", map_location=torch.device('cpu')))

_, predictions, true_vals = evaluate(dataloader_validation)
accuracy_per_class(predictions, true_vals)