In this notebook, we fine-tune a pretrained Transformer model on the
Yelp Review Full dataset for multiclass sentiment classification (1â€“5 stars).

Objectives:

Tokenize preprocessed text

Load a pretrained Transformer model

Fine-tune on Yelp reviews

Track training time

Save the trained model

In [None]:
import os

# Environment safety (important for Python 3.13 + CPU)
os.environ["TOKENIZERS_PARALLELISM"] = "false"
os.environ["CUDA_VISIBLE_DEVICES"] = ""
os.environ["TORCHDYNAMO_DISABLE"] = "1"

import time
import numpy as np
import torch

from datasets import load_from_disk
from transformers import (
    AutoTokenizer,
    RobertaForSequenceClassification,
    TrainingArguments,
    Trainer
)

print("Environment ready for evaluation")
print("Torch version:", torch.__version__)


  from .autonotebook import tqdm as notebook_tqdm


Environment ready âœ…
Torch version: 2.9.1+cpu


Load Preprocessed Dataset

We load the cleaned dataset saved in 02_preprocessing.ipynb.

In [17]:
train_ds = load_from_disk("data/processed/train_clean")
test_ds = load_from_disk("data/processed/test_clean")

print("Train size:", len(train_ds))
print("Test size:", len(test_ds))


Train size: 650000
Test size: 50000


Model Selection & tokenizer

In [18]:
MODEL_NAME = "roberta-base"
NUM_LABELS = 5
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)



Tokenization 

We reduce maximum sequence length to 256 tokens to speed up training.

In [None]:
def tokenize_function(batch):
    return tokenizer(
        batch["clean_text"],
        padding="max_length",
        truncation=True,
        max_length=256
    )

train_tokenized = train_ds.map(tokenize_function, batched=True)
test_tokenized = test_ds.map(tokenize_function, batched=True)

train_tokenized = train_tokenized.remove_columns(["text", "clean_text"])
test_tokenized = test_tokenized.remove_columns(["text", "clean_text"])

train_tokenized.set_format("torch")
test_tokenized.set_format("torch")

print("Tokenization completed and formatted")


Tokenization completed âœ…


Subsample Dataset (Critical for CPU Training)

We train on 20,000 samples (~3%) of the training set.

In [20]:
train_small = train_tokenized.shuffle(seed=42).select(range(1000))
test_small = test_tokenized.shuffle(seed=42).select(range(2000))

print("Training samples used:", len(train_small))
print("Test samples used:", len(test_small))


Training samples used: 1000
Test samples used: 2000


Load Model

In [None]:
model = RobertaForSequenceClassification.from_pretrained(
    MODEL_NAME,
    num_labels=NUM_LABELS
)

print("Model loaded successfully")


Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model loaded successfully âœ…


Training Arguments (CPU-SAFE)

In [22]:
training_args = TrainingArguments(
    output_dir="models/roberta_yelp",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=2,
    weight_decay=0.01,
    logging_steps=200,
    save_total_limit=1,
    report_to="none",
    no_cuda=True
)




Initialize Trainer

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_small,
    eval_dataset=test_small,
    tokenizer=tokenizer
)

print("Trainer initialized successfully")


  trainer = Trainer(


Trainer initialized âœ…


Train Model & Measure Training Time

In [24]:
start_time = time.time()

trainer.train()

end_time = time.time()
training_time_minutes = (end_time - start_time) / 60

print(f"Total training time: {training_time_minutes:.2f} minutes")


Epoch,Training Loss,Validation Loss


KeyboardInterrupt: 

Save Final Model

In [None]:
trainer.save_model("models/roberta_yelp/final")
tokenizer.save_pretrained("models/roberta_yelp/final")

print("Final model saved")


Final model saved âœ…
