# Fine-tuning a BERT language model for PII labeling

## Table of Contents
* Introduction
* Load training dataset with cudf
* Transform labels into pytorch tensor using dlpack
* Transform text using cudf subword tokenizer
* Split into train and test sets
* Loading pretrained model
* Fine-tune the model
* Model evaluation
* Save model file

## Introduction

Detecting PII inside of text data is an arduous task, often requiring complex regex and heuristics. This notebook illustrates how to train a language model using a dataset of 1000 API responses that have been previously labeled as containing up to ten different types of PII. We will fine-tune a pretrained BERT model from [HuggingFace](https://github.com/huggingface) with a multi-label classification layer. 

In [None]:
from os import path
import s3fs
import torch
from torch.nn import BCEWithLogitsLoss
from transformers import AutoModelForSequenceClassification, AdamW
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from torch.utils.data.dataset import random_split
from torch.utils.dlpack import from_dlpack
from sklearn.metrics import f1_score, accuracy_score, multilabel_confusion_matrix
from tqdm import trange
import cudf
import cupy

## Load training dataset with cudf

To train our model we begin with a dataframe containing a field with text samples and one column for each of ten labels of PII. The label columns have either 0 or 1 for the presence of the specific PII type in the text.

In [None]:
# download sample data
PII_SAMPLE_CSV = "pii_training_sample.csv"
S3_BASE_PATH = "rapidsai-data/cyber/pii"

if not path.exists(PII_SAMPLE_CSV):
    fs = s3fs.S3FileSystem(anon=True)
    fs.get(S3_BASE_PATH + "/" + PII_SAMPLE_CSV, PII_SAMPLE_CSV)

In [None]:
df = cudf.read_csv(PII_SAMPLE_CSV)

## Transform labels into pytorch tensor using dlpack

We find all the columns from the df that are labels for the text data and transform them into a tensor using dlpack. 

In [None]:
label_names = list(df.columns)
label_names.remove('text')
label_names

In [None]:
labels = from_dlpack(df[label_names].to_dlpack()).type(torch.long)

## Transform text using cudf subword tokenizer

We will define two tokenizers needed for two different models-- `bert-base-cased` using a pre-made vocab hash file `bert-base-cased-hash.txt`, and `mini-bert` using the hash file `bert-base-uncased-hash.txt`. Then we use one of our functions to transform the `text` column into two padded tensors for our model training-- `input_ids` and `attention_mask`.

In [None]:
# define tokenizer for bert-base-cased model

def bert_cased_tokenizer(strings, seq_length):
    """
    converts cudf.Seires of strings to two torch tensors- token ids and attention mask with padding
    """    
    num_strings = len(strings)
    token_ids, mask = strings.str.subword_tokenize("resources/bert-base-cased-hash.txt", seq_length, seq_length,
                                                            max_rows_tensor=num_strings,
                                                            do_lower=False, do_truncate=True)[:2]
    # convert from cupy to torch tensor using dlpack
    input_ids = from_dlpack(token_ids.reshape(num_strings,seq_length).astype(cupy.float).toDlpack())
    attention_mask = from_dlpack(mask.reshape(num_strings,seq_length).astype(cupy.float).toDlpack())
    return input_ids.type(torch.long), attention_mask.type(torch.long)

In [None]:
# define tokenizer for use with mini-bert or bert-base-uncased models

def bert_uncased_tokenizer(strings, seq_length):
    """
    converts cudf.Seires of strings to two torch tensors- token ids and attention mask with padding
    """    
    num_strings = len(strings)
    token_ids, mask = strings.str.subword_tokenize("resources/bert-base-uncased-hash.txt", seq_length, seq_length,
                                                            max_rows_tensor=num_strings,
                                                            do_lower=True, do_truncate=True)[:2]
    # convert from cupy to torch tensor using dlpack
    input_ids = from_dlpack(token_ids.reshape(num_strings,seq_length).astype(cupy.float).toDlpack())
    attention_mask = from_dlpack(mask.reshape(num_strings,seq_length).astype(cupy.float).toDlpack())
    return input_ids.type(torch.long), attention_mask.type(torch.long)

In [None]:
# pick model and tokenizer

MODEL_NAME = "google/bert_uncased_L-4_H-256_A-4"
TOKENIZER = bert_uncased_tokenizer

# or choose bert-base-cased
# MODEL_NAME = "bert-base-cased"
# TOKENIZER = bert_cased_tokenizer

In [None]:
# get input_ids and attention_masks tensors
input_ids, attention_masks = TOKENIZER(df.text, 256) # using 256 for our model sequence length

## Split into train and test sets

Create at pytorch dataset, split into testing and training subsets, and load into pytorch dataloaders. 

In [None]:
# create dataset
dataset = TensorDataset(input_ids, attention_masks, labels)

# use pytorch random_split to create training and validation data subsets
dataset_size = len(input_ids)
train_size = int(dataset_size * .8) # 80/20 split
training_dataset, validation_dataset = random_split(dataset, (train_size, (dataset_size-train_size)))

# create dataloaders
train_dataloader = DataLoader(dataset=training_dataset, shuffle=True, batch_size=32)
val_dataloader = DataLoader(dataset=validation_dataset, shuffle=False, batch_size=64)

## Load pretrained model from huggingface repo

In [None]:
num_labels = len(label_names)

In [None]:
# load the following model for bert-base-cased
model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME, 
                                                           num_labels=num_labels)

In [None]:
model.train()
model.cuda(); # move model to GPU

In [None]:
# find number of gpus
n_gpu = torch.cuda.device_count()

# use DataParallel if you have more than one GPU
if n_gpu > 1:
    model = torch.nn.DataParallel(model)

## Fine-tune model

In [None]:
# using hyperparameters recommended in orginial BERT paper
# the optimizer allows us to apply different hyperpameters for specific parameter groups
# apply weight decay to all parameters other than bias, gamma, and beta
param_optimizer = list(model.named_parameters())
no_decay = ['bias', 'gamma', 'beta']
optimizer_grouped_parameters = [
    {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)],
     'weight_decay_rate': 0.01},
    {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)],
     'weight_decay_rate': 0.0}
]

optimizer = AdamW(optimizer_grouped_parameters,lr=2e-5)

In [None]:
# number of training epochs
epochs = 4

# train loop
for _ in trange(epochs, desc="Epoch"):
  # tracking variables
    tr_loss = 0 #running loss
    nb_tr_examples, nb_tr_steps = 0, 0
  
  # train the data for one epoch
    for batch in train_dataloader:
        # unpack the inputs from dataloader
        b_input_ids, b_input_mask, b_labels = batch
        
        # clear out the gradients
        optimizer.zero_grad()

        # forward pass
        outputs = model(b_input_ids, attention_mask=b_input_mask)
        logits = outputs[0]
        
        # using binary cross-entropy with logits as loss function
        # assigns independent probabilities to each label
        loss_func = BCEWithLogitsLoss() 
        loss = loss_func(logits.view(-1,num_labels),b_labels.type_as(logits).view(-1,num_labels)) #convert labels to float for calculation 
        if n_gpu > 1:
            loss = loss.mean() # mean() to average on multi-gpu parallel training
        # backward pass
        loss.backward()
        
        # update parameters and take a step using the computed gradient
        optimizer.step()
        
        # update tracking variables
        tr_loss += loss.item()
        nb_tr_examples += b_input_ids.size(0)
        nb_tr_steps += 1

    print("Train loss: {}".format(tr_loss/nb_tr_steps))

## Model evaluation

We evaluate the accuracy on the 20% of data we have in the validation set. We report the `F1 macro accuracy`- correct_predictions divided by total_predictions is calculated for each label and averaged, and the `flat accuracy`- correct_predictions divided by total_predctions of the model for the validation set as a whole.

In [None]:
# model to eval mode to evaluate loss on the validation set
model.eval()

# variables to gather full output
logit_preds,true_labels,pred_labels = [],[],[]

# predict
for batch in val_dataloader:
    # unpack the inputs from our dataloader
    b_input_ids, b_input_mask, b_labels = batch
    with torch.no_grad():
        # forward pass
        output = model(b_input_ids, attention_mask=b_input_mask)
        b_logit_pred = output[0]
        b_pred_label = torch.sigmoid(b_logit_pred)
        b_logit_pred = b_logit_pred.detach().cpu().numpy()
        b_pred_label = b_pred_label.detach().cpu().numpy()
        b_labels = b_labels.detach().cpu().numpy()
    
    logit_preds.extend(b_logit_pred)
    true_labels.extend(b_labels)
    pred_labels.extend(b_pred_label)

# calculate accuracy, using 0.50 threshold
threshold = 0.50
pred_bools = [pl>threshold for pl in pred_labels]
true_bools = [tl==1 for tl in true_labels]
val_f1_accuracy = f1_score(true_bools,pred_bools,average='macro')*100
val_flat_accuracy = accuracy_score(true_bools, pred_bools)*100

print('F1 Macro Validation Accuracy: ', val_f1_accuracy)
print('Flat Validation Accuracy: ', val_flat_accuracy)

In [None]:
# confusion matrix for each label

for label, cf in zip(label_names, multilabel_confusion_matrix(true_bools, pred_bools)):
                     print(label)
                     print(cf)

## Save model file

If we're using data parallel save model as module, so you can use it either inside or outside of a multi-gpu environment later. 

In [None]:
#if n_gpu > 1:
#    torch.save(model.module.state_dict(), "path/to/your-model-name.pth")
#else:
#    torch.save(model.state_dict(), "path/to/your-model-name.pth")        

## Conclusion

Using pretrained BERT models (`bert-base-cased` or `mini-bert`) from the huggingface repo and a custom traning for multi-label classification, we are able to successfully train a PII detector from our training dataset. 