# Transformer-based baseline (BERT)

A transformer-based sentiment classification model is evaluated using a pre-trained BERT architecture. This notebook establishes a deep learning baseline to compare against classical machine learning approaches based on TF-IDF features.

## Load dataset and create train–test split

The same cleaned and balanced review dataset used for the classical machine learning baselines is loaded. The train–test split is reproduced to ensure a fair comparison between models.

In [25]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Load the cleaned dataset
df = pd.read_csv("../data/balanced_reviews.csv")

In [26]:
# Double check column names
df.columns

Index(['Text', 'Sentiment'], dtype='object')

In [27]:
# Extract text and labels
X = df["Text"]
y = df["Sentiment"]

# Create a reproducible train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, 
    y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

## Tokenisation using a pre-trained BERT tokenizer
Text data is converted into tokens that the BERT model can understand. This step prepares the text for input into the BERT model by converting words into numerical representations.

### Configure local Hugging Face cache directory

The Hugging Face Transformers library downloads pre-trained model files to a local cache directory. On some systems, the default cache location may not be writable, which can cause permission errors during model download. A project-local cache directory is configured to ensure reliable and reproducible access to pre-trained model files.

In [28]:
import os

# Set a local Hugging Face cache directory inside the project
os.environ["HF_HOME"] = os.path.join(os.getcwd(), ".hf_cache")
os.environ["TRANSFORMERS_CACHE"] = os.path.join(os.getcwd(), ".hf_cache")

# Ensure the cache directory exists
os.makedirs(os.environ["HF_HOME"], exist_ok=True)

os.environ["HF_HOME"]

'/Users/tommorton/Library/CloudStorage/OneDrive-Personal/Masters/Comp Sci/Modules/Masters Project/AmazonSentinmentAnalysis/notebooks/.hf_cache'

In [29]:
from transformers import BertTokenizer
import torch

# Load the pre-trained BERT tokenizer
tokenizer = BertTokenizer.from_pretrained(
    "bert-base-uncased",
    cache_dir=os.path.join(os.getcwd(), ".hf_cache")
)

# Tokenise the training and test sets
train_encodings = tokenizer(
    X_train.tolist(), 
    truncation=True, 
    padding=True, 
    max_length=128
)
test_encodings = tokenizer(
    X_test.tolist(),
    truncation=True, 
    padding=True, 
    max_length=128
)

## Encode labels
Sentiment labels are converted into a numerical format sothat they can be used by transformer models.

In [32]:
from sklearn.preprocessing import LabelEncoder

# Encode the sentiment labels as intergers
label_encoder = LabelEncoder()
y_train_enc = label_encoder.fit_transform(y_train)
y_test_enc = label_encoder.transform(y_test)

# Display mapping of sentiment labels to integers
dict(zip(label_encoder.classes_, label_encoder.transform(label_encoder.classes_)))

{'negative': np.int64(0), 'neutral': np.int64(1), 'positive': np.int64(2)}

## Create PyTorch dataset objects
The tokenised text and encoded labels need to be wrapped in a PyTorch Dataset object for use with the DataLoader.

In [35]:
# Define a dataset wrapped for tokenised inputs
class ReviewsDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __len__(self):
        return len(self.labels)
    
    def __getitem__(self, index):
        item = {key: torch.tensor(val[index]) for key, val in self.encodings.items()}
        item["labels"] = torch.tensor(self.labels[index])
        return item

# Create dataset objects for training and test sets
train_dataset = ReviewsDataset(train_encodings, y_train_enc)
test_dataset = ReviewsDataset(test_encodings, y_test_enc)

# Check split size
len(train_dataset), len(test_dataset)

(96000, 24000)

## Load pre-trained BERT model for sequence classification
Load the pre-trained BERT model is loaded with a classification head suited for the multiclass sentiment analysis task.

In [39]:
from transformers import BertForSequenceClassification

# Load the pre-trained BERT model for sequence classification
model = BertForSequenceClassification.from_pretrained(
    "bert-base-uncased",
    num_labels=len(label_encoder.classes_), # Not hard coding allows for change in number of classes
    cache_dir=os.path.join(os.getcwd(), ".hf_cache")
)

Loading weights: 100%|██████████| 199/199 [00:00<00:00, 778.02it/s, Materializing param=bert.pooler.dense.weight]                               
BertForSequenceClassification LOAD REPORT from: bert-base-uncased
Key                                        | Status     | 
-------------------------------------------+------------+-
cls.predictions.transform.dense.bias       | UNEXPECTED | 
cls.predictions.bias                       | UNEXPECTED | 
cls.seq_relationship.weight                | UNEXPECTED | 
cls.predictions.transform.dense.weight     | UNEXPECTED | 
cls.predictions.transform.LayerNorm.bias   | UNEXPECTED | 
cls.seq_relationship.bias                  | UNEXPECTED | 
cls.predictions.transform.LayerNorm.weight | UNEXPECTED | 
classifier.bias                            | MISSING    | 
classifier.weight                          | MISSING    | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.
- MISSING	:those pa

## WIP - Define training parameters
Training arguments are defined to control how the transformer model is fine tuned. These settings include the number of epochs, batch sizes, learning rate, and evaluation strategy.