# Transformer-based baseline (BERT)

A transformer-based sentiment classification model is evaluated using a pre-trained BERT architecture. This notebook establishes a deep learning baseline to compare against classical machine learning approaches based on TF-IDF features.

## Load dataset and create train–test split

The same cleaned and balanced review dataset used for the classical machine learning baselines is loaded. The train–test split is reproduced to ensure a fair comparison between models.

In [4]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Load the cleaned dataset
df = pd.read_csv("../data/balanced_reviews.csv")

In [5]:
# Double check column names
df.columns

Index(['Text', 'Sentiment'], dtype='object')

In [6]:
# Extract text and labels
X = df["Text"]
y = df["Sentiment"]

# Create a reproducible train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, 
    y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

## Tokenisation using a pre-trained BERT tokenizer
Text data is converted into tokens that the BERT model can understand. This step prepares the text for input into the BERT model by converting words into numerical representations.

### Configure local Hugging Face cache directory

The Hugging Face Transformers library downloads pre-trained model files to a local cache directory. On some systems, the default cache location may not be writable, which can cause permission errors during model download. A project-local cache directory is configured to ensure reliable and reproducible access to pre-trained model files.

In [7]:
import os

# Set a local Hugging Face cache directory inside the project
os.environ["HF_HOME"] = os.path.join(os.getcwd(), ".hf_cache")
os.environ["TRANSFORMERS_CACHE"] = os.path.join(os.getcwd(), ".hf_cache")

# Ensure the cache directory exists
os.makedirs(os.environ["HF_HOME"], exist_ok=True)

os.environ["HF_HOME"]

'/Users/tommorton/Library/CloudStorage/OneDrive-Personal/Masters/Comp Sci/Modules/Masters Project/AmazonSentinmentAnalysis/notebooks/.hf_cache'

In [20]:
from transformers import DistilBertTokenizerFast
import torch

# Load the pre-trained DistilBERT tokenizer
tokenizer = DistilBertTokenizerFast.from_pretrained(
    "distilbert-base-uncased",
    cache_dir=os.path.join(os.getcwd(), ".hf_cache")
)

# Tokenise the training and test text
train_encodings = tokenizer(
    X_train.tolist(),
    truncation=True,
    padding=True,
    max_length=128
)

test_encodings = tokenizer(
    X_test.tolist(),
    truncation=True,
    padding=True,
    max_length=128
)



## Encode labels
Sentiment labels are converted into a numerical format sothat they can be used by transformer models.

In [9]:
from sklearn.preprocessing import LabelEncoder

# Encode the sentiment labels as intergers
label_encoder = LabelEncoder()
y_train_enc = label_encoder.fit_transform(y_train)
y_test_enc = label_encoder.transform(y_test)

# Display mapping of sentiment labels to integers
dict(zip(label_encoder.classes_, label_encoder.transform(label_encoder.classes_)))

{'negative': np.int64(0), 'neutral': np.int64(1), 'positive': np.int64(2)}

## Create PyTorch dataset objects
The tokenised text and encoded labels need to be wrapped in a PyTorch Dataset object for use with the DataLoader.

In [10]:
# Define a dataset wrapped for tokenised inputs
class ReviewsDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __len__(self):
        return len(self.labels)
    
    def __getitem__(self, index):
        item = {key: torch.tensor(val[index]) for key, val in self.encodings.items()}
        item["labels"] = torch.tensor(self.labels[index])
        return item

# Create dataset objects for training and test sets
train_dataset = ReviewsDataset(train_encodings, y_train_enc)
test_dataset = ReviewsDataset(test_encodings, y_test_enc)

# Check split size
len(train_dataset), len(test_dataset)

(96000, 24000)

## Load pre-trained BERT model for sequence classification
Load the pre-trained BERT model is loaded with a classification head suited for the multiclass sentiment analysis task.

In [21]:
from transformers import DistilBertForSequenceClassification

# Load a pre-trained DistilBERT model with a classification head
model = DistilBertForSequenceClassification.from_pretrained(
    "distilbert-base-uncased",
    num_labels=len(label_encoder.classes_), # Not hard coding allows for easy extension to multi-class
    cache_dir=os.path.join(os.getcwd(), ".hf_cache")
);

Loading weights: 100%|██████████| 100/100 [00:00<00:00, 1490.42it/s, Materializing param=distilbert.transformer.layer.5.sa_layer_norm.weight]   
DistilBertForSequenceClassification LOAD REPORT from: distilbert-base-uncased
Key                     | Status     | 
------------------------+------------+-
vocab_transform.bias    | UNEXPECTED | 
vocab_layer_norm.bias   | UNEXPECTED | 
vocab_projector.bias    | UNEXPECTED | 
vocab_layer_norm.weight | UNEXPECTED | 
vocab_transform.weight  | UNEXPECTED | 
pre_classifier.bias     | MISSING    | 
classifier.bias         | MISSING    | 
pre_classifier.weight   | MISSING    | 
classifier.weight       | MISSING    | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.
- MISSING	:those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.


## WIP - Define training parameters
Training arguments are defined to control how the transformer model is fine tuned. These settings include the number of epochs, batch sizes, learning rate, and evaluation strategy.

In [25]:
from transformers import TrainingArguments

# Define training arguments for fine-tuning and evaluation
training_args = TrainingArguments(
    output_dir="../results/bert_baseline",
    num_train_epochs=2,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=16,
    eval_strategy="epoch",
    save_strategy="no",
    logging_strategy="epoch",
    seed=42,
    dataloader_pin_memory=False # Silence error when running on Apple Silicon Macs
)

training_args

TrainingArguments(
accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False},
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
average_tokens_across_devices=True,
batch_eval_metrics=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_persistent_workers=False,
dataloader_pin_memory=False,
dataloader_prefetch_factor=None,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
do_eval=True,
do_predict=False,
do_train=False,
enable_jit_checkpoint=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_do_concat_batches=True,
eval_on_start=False,
eval_steps=None,
eval_strategy=IntervalStrategy.EPOCH,
eval_use_gather_object=Fal

## Create a trainer and define the evaluation metrics
A trainer object can be created to manage the fine tuning and evaluation. A metric function is also defined to calculate accuracy and macro averaged F1 score.

In [26]:
from transformers import Trainer
from sklearn.metrics import accuracy_score, f1_score
import numpy as np

# Define evaluation metrics for model performance
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=1)

    accuracy = accuracy_score(labels, predictions)
    f1 = f1_score(labels, predictions, average="macro")

    return {
        "accuracy": accuracy,
        "macro_f1": f1
    }

# Create Trainer object
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    compute_metrics=compute_metrics
)

trainer

<transformers.trainer.Trainer at 0x3d15e0e10>

In [27]:
# Evaluate the pre-trained BERT model before fine-tuning
trainer.evaluate()

{'eval_loss': 1.0984047651290894,
 'eval_model_preparation_time': 0.0042,
 'eval_accuracy': 0.3385416666666667,
 'eval_macro_f1': 0.2006771513132997,
 'eval_runtime': 193.6398,
 'eval_samples_per_second': 123.941,
 'eval_steps_per_second': 7.746}

In [28]:
# Fine-tune the BERT model on the training dataset
trainer.train()
# Started training BERT model. Switching to DistilBERT due to resource constraints.

Epoch,Training Loss,Validation Loss


KeyboardInterrupt: 