# BERT

For our final model, we'll use BERT (RoBERTa) which is a pretrained transformer library. BERT is an encoder model designed to create a contextual numerical representation of text input.

Key features:
1. Bi-directional, can capture word context based on its surrounding words
2. Self-attention, focus on the most relevant words (query/key/value vectors)
3. Pre-trained embeddings
4. Masking, during pretraining BERT masks out words and tries to predict them based on their surroundings
5. Next sentence prediction, during pretraining BERT also tries to predict whether two given sentences are consecutive to gain a better understanding of relationships between sentences

RoBERTa extends and improves upon BERT:
1. Removed next sentence prediction, it didn't add much insight and studies showed it could introduce more noise
2. Dynamic masking, opposed to BERT's static masking where the same masking patterns are used for every training epoch. RoBERTa ensures masked positions are sampled randomly every training pass, which improves generalization.
3. Much larger batch size/training corpus

In [1]:
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
os.environ['TRANSFORMERS_NO_ADVISORY_WARNINGS'] = 'true'

import pandas as pd
import torch
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from transformers import AutoTokenizer, RobertaForSequenceClassification, Trainer, TrainingArguments
import optuna

df = pd.read_csv('sentiment140_cleaned.csv')
texts = df['text'].astype(str)
labels = df['sentiment']

X_train, X_temp, y_train, y_temp = train_test_split(texts, labels, test_size=0.3, random_state=734)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=734)

tokenizer = AutoTokenizer.from_pretrained("roberta-base")

Let's create our dataset class:

In [2]:
class TwitterDataset(torch.utils.data.Dataset):
    def __init__(self, texts, labels, tokenizer, max_len=64):
        self.encodings = tokenizer(list(texts), truncation=True, padding='max_length', max_length=max_len)
        self.labels = torch.tensor(labels.values, dtype=torch.long)
        
    def __len__(self):
        return len(self.labels)
    
    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = self.labels[idx]
        return item

train_dataset = TwitterDataset(X_train, y_train, tokenizer)
val_dataset = TwitterDataset(X_val, y_val, tokenizer)
test_dataset = TwitterDataset(X_test, y_test, tokenizer)

For this model, let's evaluate with accuracy, precision, recall, and F1 score.

In [3]:
def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(
        labels, preds, average='binary', zero_division=0
    )
    acc = accuracy_score(labels, preds)
    return {'accuracy': acc, 'precision': precision, 'recall': recall, 'f1': f1}

To streamline the training/fine-tuning loop, we'll look to use Trainer which automatically leverages GPU. We'll also use Optuna to test out various hyperparameter configurations (learning rates, batch sizes) and then train the final model on the best configuration.

Note how it will also save off the model to ./results_trial_### so you can easily retrieve previous runs.

In [4]:
def model_init():
    return RobertaForSequenceClassification.from_pretrained('roberta-base', num_labels=2)

def objective(trial):
    learning_rate = trial.suggest_float("learning_rate", 1e-5, 5e-5, log=True)
    batch_size = trial.suggest_categorical("batch_size", [16, 32])
    
    training_args = TrainingArguments(
        output_dir=f"./results_trial_{trial.number}",
        eval_strategy='epoch',
        save_strategy='epoch',
        learning_rate=learning_rate,
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size,
        num_train_epochs=3,
        logging_steps=1000,
        disable_tqdm=False,
        load_best_model_at_end=True,
        metric_for_best_model='f1',
        greater_is_better=True,
        save_total_limit=2
    )
    
    trainer = Trainer(
        model_init=model_init,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=val_dataset,
        compute_metrics=compute_metrics
    )
    
    trainer.train()
    eval_result = trainer.evaluate()
    return eval_result['eval_f1']

study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=8)

best_params = study.best_params
print("Best hyperparameters:", best_params)

[I 2025-09-24 14:26:02,700] A new study created in memory with name: no-name-a8d63494-553e-42d2-a332-3bb3d483f50f
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,No log,0.409484,0.834667,0.899506,0.744884,0.814925
2,No log,0.436475,0.836667,0.793269,0.900409,0.84345
3,No log,0.43483,0.86,0.846358,0.87176,0.858871


[I 2025-09-24 14:28:14,504] Trial 0 finished with value: 0.8588709677419355 and parameters: {'learning_rate': 4.324490720841345e-05, 'batch_size': 32}. Best is trial 0 with value: 0.8588709677419355.
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,No log,0.38621,0.835333,0.859467,0.792633,0.824698
2,No log,0.405488,0.846667,0.822023,0.875853,0.848085
3,No log,0.412742,0.862667,0.850866,0.87176,0.861186


[I 2025-09-24 14:30:28,136] Trial 1 finished with value: 0.8611859838274932 and parameters: {'learning_rate': 3.9498710996025097e-05, 'batch_size': 32}. Best is trial 1 with value: 0.8611859838274932.
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,No log,0.433348,0.813333,0.771257,0.878581,0.821429
2,No log,0.42329,0.828667,0.805128,0.856753,0.830139
3,0.412100,0.56325,0.842667,0.831776,0.849932,0.840756


[I 2025-09-24 14:33:10,034] Trial 2 finished with value: 0.8407557354925776 and parameters: {'learning_rate': 4.8241335103818346e-05, 'batch_size': 16}. Best is trial 1 with value: 0.8611859838274932.
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,No log,0.374405,0.839333,0.852436,0.811733,0.831586
2,No log,0.373368,0.855333,0.829082,0.886767,0.856955
3,0.385600,0.410181,0.854,0.840849,0.864939,0.852724


[I 2025-09-24 14:35:51,541] Trial 3 finished with value: 0.8569545154911009 and parameters: {'learning_rate': 1.3195120281189997e-05, 'batch_size': 16}. Best is trial 1 with value: 0.8611859838274932.
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,No log,0.410996,0.833333,0.825034,0.836289,0.830623
2,No log,0.414985,0.838,0.795181,0.900409,0.84453
3,0.397200,0.557326,0.85,0.828165,0.874488,0.850697


[I 2025-09-24 14:38:32,301] Trial 4 finished with value: 0.8506967485069675 and parameters: {'learning_rate': 4.919992495862197e-05, 'batch_size': 16}. Best is trial 1 with value: 0.8611859838274932.
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,No log,0.415761,0.818,0.841246,0.773533,0.80597
2,No log,0.407399,0.829333,0.789793,0.886767,0.835476
3,No log,0.404558,0.836,0.827725,0.839018,0.833333


[I 2025-09-24 14:40:44,922] Trial 5 finished with value: 0.8354755784061697 and parameters: {'learning_rate': 1.0556635030717683e-05, 'batch_size': 32}. Best is trial 1 with value: 0.8611859838274932.
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,No log,0.381871,0.835333,0.855263,0.79809,0.825688
2,No log,0.417264,0.851333,0.829457,0.875853,0.852024
3,No log,0.426156,0.852667,0.838624,0.864939,0.851578


[I 2025-09-24 14:42:59,190] Trial 6 finished with value: 0.8520238885202389 and parameters: {'learning_rate': 3.1980790436482464e-05, 'batch_size': 32}. Best is trial 1 with value: 0.8611859838274932.
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,No log,0.350541,0.842,0.82376,0.860846,0.841895
2,No log,0.417794,0.848,0.802395,0.914052,0.854592
3,0.368200,0.428487,0.854,0.83727,0.870396,0.853512


[I 2025-09-24 14:45:40,591] Trial 7 finished with value: 0.8545918367346939 and parameters: {'learning_rate': 1.7125673645151786e-05, 'batch_size': 16}. Best is trial 1 with value: 0.8611859838274932.


Best hyperparameters: {'learning_rate': 3.9498710996025097e-05, 'batch_size': 32}


Now let's train our final model with the best hyperparameters. This model will be saved off to ./best_model for future reference.

In [5]:
final_training_args = TrainingArguments(
    output_dir="./best_model",
    eval_strategy='epoch',
    save_strategy='epoch',
    learning_rate=best_params['learning_rate'],
    per_device_train_batch_size=best_params['batch_size'],
    per_device_eval_batch_size=best_params['batch_size'],
    num_train_epochs=3,
    logging_steps=1000,
    disable_tqdm=False,
    load_best_model_at_end=True,
    metric_for_best_model='f1',
    greater_is_better=True,
    save_total_limit=2
)

final_trainer = Trainer(
    model_init=model_init,
    args=final_training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics
)

final_trainer.train()

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,No log,0.38621,0.835333,0.859467,0.792633,0.824698
2,No log,0.405488,0.846667,0.822023,0.875853,0.848085
3,No log,0.412742,0.862667,0.850866,0.87176,0.861186


TrainOutput(global_step=657, training_loss=0.3365602072333999, metrics={'train_runtime': 129.4155, 'train_samples_per_second': 162.268, 'train_steps_per_second': 5.077, 'total_flos': 690666520320000.0, 'train_loss': 0.3365602072333999, 'epoch': 3.0})

Our final model has a test accuracy of 86%! Far outstripping the logistic regression/BOW baseline of 74%. We can also see relatively high precision (84%) and recall (88%). We can see how well modern transformer models can pick up and encode contextual patterns within text.

In [6]:
metrics = final_trainer.evaluate(test_dataset)
print("Test metrics:", metrics)

Test metrics: {'eval_loss': 0.4049210548400879, 'eval_accuracy': 0.862, 'eval_precision': 0.8449612403100775, 'eval_recall': 0.8825910931174089, 'eval_f1': 0.8633663366336634, 'eval_runtime': 2.6093, 'eval_samples_per_second': 574.875, 'eval_steps_per_second': 18.013, 'epoch': 3.0}
