To train our rating prediction model, we fine-tune a pretrained transformer using the Hugging Face datasets and transformers libraries. We first load the Yelp Review Full dataset directly from the Hugging Face Hub, which provides efficient access to the raw review text and its associated star labels. The DistilBERT tokenizer is then applied to convert each review into the token IDs and attention masks required by the model. Tokenization is performed using a preprocessing function mapped across the entire dataset, and dynamic padding is applied during batching to ensure computational efficiency. We fine-tune a DistilBERT-based sequence classification model with a five-class output layer corresponding to the five possible star ratings. Training is managed through the Hugging Face Trainer API, which handles data batching, optimization, evaluation, and checkpointing under a unified interface. We specify training hyperparameters such as learning rate, batch size, weight decay, and number of epochs, and we monitor both accuracy and macro-F1 during evaluation to account for the balanced multi-class nature of the task. This pipeline enables end-to-end fine-tuning of a transformer model on the Yelp dataset with minimal overhead while ensuring reproducibility and stable optimization behavior.

In [3]:
from datasets import load_dataset
from transformers import (
    AutoTokenizer, 
    DataCollatorWithPadding, 
    AutoModelForSequenceClassification, 
    TrainingArguments, 
    Trainer
)
import numpy as np
import evaluate

# 1. Load dataset
dataset = load_dataset("yelp_review_full")

# 2. Load tokenizer
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# 3. Tokenize function
def preprocess(examples):
    return tokenizer(
        examples["text"],
        truncation=True,
        padding=False  # use dynamic padding
    )

encoded_dataset = dataset.map(preprocess, batched=True)

# 4. Data collator (dynamic padding)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# 5. Load model (5 classes)
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=5
)

# 6. Metrics
accuracy = evaluate.load("accuracy")
f1 = evaluate.load("f1")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    return {
        "accuracy": accuracy.compute(predictions=preds, references=labels)["accuracy"],
        "f1_macro": f1.compute(predictions=preds, references=labels, average="macro")["f1"]
    }

# 7. Training arguments
training_args = TrainingArguments(
    output_dir="./yelp_distilbert",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=2,        
    weight_decay=0.01,
    logging_steps=200,
    load_best_model_at_end=True,
)

# 8. Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

# 9. Train
trainer.train()

# 10. Evaluate
trainer.evaluate()

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss
