# Initialize BERT Tokenizer

We import the BERT tokenizer, specifically using the `BertTokenizer` class from the Hugging Face `transformers` library. This tokenizer is pre-trained on the 'bert-base-uncased' model and configured to convert all text to lowercase. The tokenizer will be used to convert text data into a format suitable for BERT model processing.


In [None]:
from transformers import BertTokenizer
from torch.utils.data import Dataset, TensorDataset
import torch 

class_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', 
                                          do_lower_case=True)

  from .autonotebook import tqdm as notebook_tqdm


# Prepare Data for BERT

To prepare the text data for the BERT model, we encode the captions using the BERT tokenizer. This involves:
- Converting the text to input IDs and attention masks, which are necessary for BERT to understand and focus on the important parts of the text.
- Padding or truncating all sequences to a uniform length of 256 tokens to ensure consistent input size.
- Converting the processed text into PyTorch tensors, which are suitable for model training.

We perform these steps separately for both the training and validation datasets based on the 'data_type' column in the sampled data. After encoding, we organize the input IDs, attention masks, and labels into `TensorDataset` objects. These datasets will facilitate efficient loading and batching during the model training and evaluation phases.


In [None]:
                                        
encoded_data_train = class_tokenizer.batch_encode_plus(
    sampled_data[sampled_data.data_type=='train'].Caption.values, 
    add_special_tokens=True, 
    return_attention_mask=True, 
    pad_to_max_length=True, 
    truncation=True,
    max_length=256, 
    return_tensors='pt',
)

encoded_data_val = class_tokenizer.batch_encode_plus(
    sampled_data[sampled_data.data_type=='val'].Caption.values, 
    add_special_tokens=True, 
    return_attention_mask=True, 
    pad_to_max_length=True,
    truncation=True, 
    max_length=256, 
    return_tensors='pt'
)


input_ids_train = encoded_data_train['input_ids']
attention_masks_train = encoded_data_train['attention_mask']
labels_train = torch.tensor(sampled_data[sampled_data.data_type=='train'].label.values)

input_ids_val = encoded_data_val['input_ids']
attention_masks_val = encoded_data_val['attention_mask']
labels_val = torch.tensor(sampled_data[sampled_data.data_type=='val'].label.values)

dataset_train = TensorDataset(input_ids_train, attention_masks_train, labels_train)
dataset_val = TensorDataset(input_ids_val, attention_masks_val, labels_val)




# Load BERT Model for Sequence Classification

We load a pre-trained BERT model specifically configured for sequence classification tasks from the Hugging Face `transformers` library. This model is initialized with the 'bert-base-uncased' version of BERT, which has been pre-trained on a large corpus of English data and does not differentiate between uppercase and lowercase text.

The model is customized for our specific task by setting the `num_labels` parameter to the number of unique labels in our dataset, which corresponds to the different meme templates. We disable the output of attentions and hidden states to streamline the model's output, focusing solely on the final classification results.

This setup is tailored to efficiently handle our classification needs, leveraging BERT's powerful contextual understanding of text.


In [None]:
# Model loading 
from transformers import BertForSequenceClassification

classification_model = BertForSequenceClassification.from_pretrained("bert-base-uncased",
                                                      num_labels=len(label_dict),
                                                      output_attentions=False,
                                                      output_hidden_states=False)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


# Create DataLoaders

To efficiently manage the training and validation datasets during the learning process, we set up DataLoaders. These are part of PyTorch's utility for batching, shuffling, and loading the data in parallel. For the training data, we use a `RandomSampler` to shuffle the data before each epoch, helping to reduce model overfitting by providing batches that are a random subset of the data. For the validation data, we use a `SequentialSampler` which iterates over the data in the same order, ensuring consistent evaluation metrics.

Both loaders are set with a batch size of 12, balancing the trade-off between computational efficiency and training effectiveness.


In [None]:
# Dataloader 
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler

batch_size = 12

dataloader_train = DataLoader(dataset_train, 
                              sampler=RandomSampler(dataset_train), 
                              batch_size=batch_size)

dataloader_validation = DataLoader(dataset_val, 
                                   sampler=SequentialSampler(dataset_val), 
                                   batch_size=batch_size)

# Configure Optimizer and Scheduler

For the training of our BERT model, we utilize the `AdamW` optimizer from the Hugging Face `transformers` library. AdamW is a variant of the Adam optimizer that corrects for weight decay, helping prevent overfitting. We set the learning rate to `1e-5` and epsilon to `1e-8`, values that are commonly used for fine-tuning BERT models due to their effectiveness in achieving a balance between speed and accuracy of convergence.

Additionally, we configure a learning rate scheduler to adjust the learning rate throughout the training process. We use a linear schedule with warmup, starting from a lower learning rate and gradually increasing to the maximum before linearly decaying. The scheduler is set with no warmup steps and the total number of training steps calculated based on the number of epochs and the size of the training DataLoader. This approach ensures that the learning rate is optimally adjusted according to the training progression.


In [None]:
# Optimisers
from transformers import AdamW, get_linear_schedule_with_warmup

optimizer = AdamW(classification_model.parameters(),
                  lr=1e-5, 
                  eps=1e-8)
                  
epochs = 5

scheduler = get_linear_schedule_with_warmup(optimizer, 
                                            num_warmup_steps=0,
                                            num_training_steps=len(dataloader_train)*epochs)



# Performance Metrics

To evaluate our model's performance, we implement two key metrics: the F1 score and accuracy per class.

### F1 Score Calculation
We define `f1_score_func` to compute the weighted F1 score, which is a harmonic mean of precision and recall, weighted by the number of true instances for each label. This metric deals well with imbalanced datasets. The function flattens the prediction and label arrays, computes the index of the maximum prediction as the predicted label, and then calculates the F1 score.

### Accuracy Per Class
The `accuracy_per_class` function provides a detailed accuracy breakdown for each class. It reverses our label dictionary to decode label indices back to their original names, compares predictions to true labels, and calculates the accuracy for each class individually. This is particularly useful for identifying how well the model performs on different segments of the data.

Both functions are designed to run on the device specified by checking the availability of a GPU, ensuring efficient computation.


In [None]:
from sklearn.metrics import f1_score
import numpy as np

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

def f1_score_func(preds, labels):
    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return f1_score(labels_flat, preds_flat, average='weighted')

def accuracy_per_class(preds, labels):
    label_dict_inverse = {v: k for k, v in label_dict.items()}
    
    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    
    for label in np.unique(labels_flat):
        y_preds = preds_flat[labels_flat==label]
        y_true = labels_flat[labels_flat==label]
        print(f'Class: {label_dict_inverse[label]}')
        print(f'Accuracy: {len(y_preds[y_preds==label])}/{len(y_true)}\n')

# Model Evaluation and Reproducibility Setup

### Reproducibility
To ensure consistent results across multiple runs, we set a fixed seed for the random number generators in `random`, `numpy`, and `torch`. The model is also transferred to the appropriate device (GPU or CPU) based on availability.

### Evaluation Function
The `evaluate` function is designed to assess the model's performance on the validation dataset. It computes the average validation loss and collects predictions. During evaluation, the model is set to `eval` mode, which disables dropout layers and batch normalization. The function processes batches of data, calculates the loss for each batch, and aggregates the predictions and true values. These are used to calculate metrics such as the F1 score outside the function.


In [None]:
import random
import numpy as np
import torch
from tqdm import tqdm
from sklearn.metrics import f1_score

seed_val = 17
classification_model = classification_model.to(device)

random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

def to_device(data):
    if isinstance(data, (list, tuple)):
        return [to_device(x) for x in data]
    return data.to(device)

def evaluate(dataloader_val):
    classification_model.eval()
    loss_val_total = 0
    predictions, true_vals = [], []
    
    for batch in dataloader_val:
        batch = to_device(batch)
        inputs = {'input_ids': batch[0],
                  'attention_mask': batch[1],
                  'labels': batch[2]}
        
        with torch.no_grad():
            outputs = model(**inputs)
            
        loss = outputs[0]
        logits = outputs[1]
        loss_val_total += loss.item()

        logits = logits.detach().cpu().numpy()
        label_ids = inputs['labels'].cpu().numpy()
        predictions.append(logits)
        true_vals.append(label_ids)
    
    loss_val_avg = loss_val_total / len(dataloader_val) 
    
    predictions = np.concatenate(predictions, axis=0)
    true_vals = np.concatenate(true_vals, axis=0)
            
    return loss_val_avg, predictions, true_vals
    


# Training Loop

The training loop is designed to optimize the BERT model over several epochs, iteratively improving its accuracy on the dataset. Each epoch consists of:
- Clearing the GPU cache to free memory.
- Iterating over batches of training data, where each batch is transferred to the appropriate device.
- Performing a forward pass to compute the loss and a backward pass to update the model's weights.
- Clipping the gradient norms to prevent exploding gradients, which can destabilize training.
- Updating the optimizer and scheduler at each step.

Training progress is displayed via a dynamic progress bar showing the loss per batch. After each epoch, the model's state is saved. We also print the average training loss and evaluate the model on the validation set, computing the average validation loss and F1 score. This process allows us to monitor improvements and ensure that the model does not overfit.


In [None]:
for epoch in tqdm(range(1, epochs + 1)):
    classification_model.train()
    loss_train_total = 0
    torch.cuda.empty_cache()
    progress_bar = tqdm(dataloader_train, desc='Epoch {:1d}'.format(epoch), leave=False, disable=False)
    for batch in progress_bar:
        classification_model.zero_grad()
        batch = to_device(batch)
        
        inputs = {'input_ids': batch[0],
                  'attention_mask': batch[1],
                  'labels': batch[2]}       

        outputs = classification_model(**inputs)
        loss = outputs[0]
        loss_train_total += loss.item()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()
        scheduler.step()
        
        progress_bar.set_postfix({'training_loss': '{:.3f}'.format(loss.item() / len(batch))})
    
    torch.save(classification_model.state_dict(), f'models/finetuned_BERT2_epoch_{epoch}.model')
    
    tqdm.write(f'\nEpoch {epoch}')
    
    loss_train_avg = loss_train_total / len(dataloader_train)
    tqdm.write(f'Training loss: {loss_train_avg}')
    
    val_loss, predictions, true_vals = evaluate(dataloader_validation)
    val_f1 = f1_score_func(predictions, true_vals)
    tqdm.write(f'Validation loss: {val_loss}')
    tqdm.write(f'F1 Score (Weighted): {val_f1}') 

  0%|          | 0/5 [03:05<?, ?it/s]


KeyboardInterrupt: 

# Model Evaluation on Validation Data

After loading the best model checkpoint, we proceed to evaluate its performance on the validation dataset using the previously defined `evaluate` function. This function returns the model's validation loss, predictions, and true labels. 

Following the evaluation, we use the `accuracy_per_class` function to compute and print the accuracy for each class individually. This detailed breakdown helps identify how well the model performs across different categories, highlighting strengths and areas for improvement in class-specific performance.


In [None]:
_, predictions, true_vals = evaluate(dataloader_validation)
accuracy_per_class(predictions, true_vals)