# Overview 🚀

This notebook demonstrates fine-tuning and evaluating both BERT-based and GPT-based sequence classification models on the Yelp Review Full dataset using Hugging Face Transformers and Datasets libraries. The workflow includes:

- 📦 Loading and preprocessing the Yelp Review Full dataset, including tokenization.
- ✂️ Selecting subsets for training and evaluation.
- 🤖 Initializing BERT and GPT models for sequence classification.
- ⚙️ Setting up training arguments and metrics (accuracy).
- 🏋️‍♂️ Training the models using the Trainer API.
- 🔍 Making predictions on evaluation samples and decoding results.
- ✨ Comparing the performance of BERT and GPT architectures on sentiment classification.

Key variables include the dataset, model, tokenizer, training arguments, trainer, and evaluation metrics. The notebook leverages GPU acceleration for efficient computation and provides insights into the strengths of both BERT and GPT for NLP tasks.

## Environment Setup ⚙️🐍

The environment setup ensures all necessary libraries are installed for fine-tuning and evaluating a BERT-based model. Key steps include:

- 🤗 **transformers**: Provides state-of-the-art pre-trained models (like BERT) and tokenizers for NLP tasks. It enables easy loading, fine-tuning, and inference of models from Hugging Face's model hub.
- 📚 **datasets**: Offers efficient access to large-scale datasets (such as Yelp Review Full) with built-in preprocessing, shuffling, and slicing. It supports streaming, mapping, and integration with PyTorch/TensorFlow for seamless ML workflows.
- 📏 **evaluate**: Supplies a wide range of metrics (e.g., accuracy, F1, BLEU) for model evaluation. It integrates with datasets and transformers, allowing easy computation of metrics during training and testing.
- 🧮 **scikit-learn**: Delivers classic machine learning utilities for data preprocessing, metrics, and model selection, complementing deep learning workflows.
- 🚀 **accelerate**: Simplifies multi-GPU, TPU, and mixed-precision training. It abstracts device placement and distributed training, making it easy to scale experiments from CPU to GPU/TPU without code changes.

The workflow includes:
- 🛠️ Installing essential packages for model training and evaluation.
- 📥 Loading the Yelp Review Full dataset using `datasets`.
- 🏷️ Initializing the BERT tokenizer from Hugging Face's model hub via `transformers`.
- ✂️ Tokenizing the dataset for compatibility with BERT.
- 🔀 Selecting subsets for training and evaluation to speed up experimentation.

This setup provides a reproducible, scalable, and efficient environment for NLP model fine-tuning and evaluation, leveraging modern libraries for data handling, model management, metric computation, and hardware acceleration.


In [1]:
!pip install 'transformers[torch]==4.50.3' 'datasets' 'evaluate' 'scikit-learn' 'accelerate'

Collecting transformers==4.50.3 (from transformers[torch]==4.50.3)
  Downloading transformers-4.50.3-py3-none-any.whl.metadata (39 kB)
Collecting evaluate
  Downloading evaluate-0.4.5-py3-none-any.whl.metadata (9.5 kB)
Downloading transformers-4.50.3-py3-none-any.whl (10.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.2/10.2 MB[0m [31m52.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading evaluate-0.4.5-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: transformers, evaluate
  Attempting uninstall: transformers
    Found existing installation: transformers 4.55.2
    Uninstalling transformers-4.55.2:
      Successfully uninstalled transformers-4.55.2
Successfully installed evaluate-0.4.5 transformers-4.50.3


## Download Dataset and Pre-train Model 📥🤗📝

This section covers the steps to download the Yelp Review Full dataset and initialize a pre-trained BERT model for sequence classification.

- **Dataset Download** 📦  
    The `datasets` library is used to load the Yelp Review Full dataset, which contains user reviews and their corresponding star ratings (1-5). The dataset is split into `train` and `test` sets, each with features such as `label`, `text`, and BERT-compatible inputs (`input_ids`, `token_type_ids`, `attention_mask`).

- **Tokenizer Initialization** 🔤  
    The BERT tokenizer (`google-bert/bert-base-cased`) is loaded using Hugging Face's `transformers` library. It converts raw review text into token IDs and attention masks required by the BERT model.

- **Tokenization & Preprocessing** ✂️  
    The dataset is tokenized in batches, ensuring each review is padded and truncated to fit BERT's input requirements. This produces additional columns (`input_ids`, `token_type_ids`, `attention_mask`) in the dataset.

- **Subset Selection** ⚡  
    For faster experimentation, small subsets of the training and evaluation data are selected (100 samples each).

- **Model Initialization** 🤖  
    A pre-trained BERT model for sequence classification is loaded and configured for 5 output labels (matching the star ratings).

- **Device Setup** 🖥️🚀  
    The model and data tensors are moved to GPU (`cuda`) if available for accelerated training and inference.

These steps ensure the data and model are ready for fine-tuning and evaluation on the Yelp Review Full sentiment classification task.


### What is Yelp Review? 📝⭐

**Yelp Review Full** is a large-scale text classification dataset derived from Yelp, a popular platform for user-generated reviews of local businesses (restaurants, shops, services, etc.). The dataset contains:

- **Text reviews** 🗣️: Written by real users, describing their experiences with businesses.
- **Labels** 🏷️: Each review is assigned a star rating from 1 to 5, representing the sentiment (1 = very negative, 5 = very positive).
- **Purpose** 🎯: Used for benchmarking machine learning models on sentiment analysis and text classification tasks.

Researchers use this dataset to train and evaluate models that can automatically predict the sentiment or rating of a review based on its text.

---

### What is BERT? 🤖📚

**BERT (Bidirectional Encoder Representations from Transformers)** is a state-of-the-art language model developed by Google. Key features:

- **Transformer architecture** 🏗️: Uses self-attention mechanisms to understand the context of words in a sentence, both from the left and right (bidirectional).
- **Pre-trained on large corpora** 🌐: BERT is trained on massive datasets (like Wikipedia and BooksCorpus) to learn general language representations.
- **Fine-tuning** 🛠️: After pre-training, BERT can be adapted (fine-tuned) for specific tasks such as sentiment analysis, question answering, or text classification.
- **Sequence classification** 🧩: BERT can take a sequence of text and output a prediction (e.g., the sentiment label for a review).

BERT’s deep understanding of language context makes it highly effective for NLP tasks, including classifying Yelp reviews by sentiment.

In [2]:
# Import the Yelp Review Full dataset using Hugging Face Datasets
from datasets import load_dataset
# Import the BERT tokenizer from Hugging Face Transformers
from transformers import AutoTokenizer

# Load the Yelp Review Full dataset, which contains train and test splits
dataset = load_dataset("yelp_review_full")

# Initialize the BERT tokenizer (cased, English) for converting text to model inputs
tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-cased")

# Define a function to tokenize batches of examples from the dataset
def tokenize(examples):
    # Tokenize the "text" field, pad to max length, and truncate longer reviews
    return tokenizer(examples["text"], padding="max_length", truncation=True)

# Apply the tokenization function to the entire dataset in batches
# This adds 'input_ids', 'token_type_ids', and 'attention_mask' columns required by BERT
dataset = dataset.map(tokenize, batched=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/299M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/23.5M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/650000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/50000 [00:00<?, ? examples/s]

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Map:   0%|          | 0/650000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [3]:
# Select small, shuffled subsets from the Yelp Review Full dataset for quick experimentation

# Shuffle the training set with a fixed seed for reproducibility, then select the first 100 samples
train = dataset["train"].shuffle(seed=42).select(range(100))

# Shuffle the test set with the same seed, then select the first 100 samples for evaluation
eval = dataset["test"].shuffle(seed=42).select(range(100))

In [4]:
from transformers import AutoModelForSequenceClassification

# Initialize a pre-trained BERT model for sequence classification.
# - "google-bert/bert-base-cased" specifies the model checkpoint from Hugging Face's model hub.
# - num_labels=5 sets the output layer to predict 5 classes (matching Yelp star ratings: 1-5).
# The model is loaded with weights pre-trained on general language tasks, ready for fine-tuning.
model = AutoModelForSequenceClassification.from_pretrained(
    "google-bert/bert-base-cased",  # Model name (cased English BERT)
    num_labels=5                    # Number of output classes for classification
)

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Fine Tuning 🏋️‍♂️🤖

The training process fine-tunes a pre-trained BERT model for sentiment classification on the Yelp Review Full dataset. Key steps include:

- ⚙️ **Trainer Initialization**: The Hugging Face `Trainer` API is used to manage the training loop, evaluation, and checkpointing. It is configured with the BERT model, training arguments, and datasets.
- 📝 **Training Arguments**: Parameters such as learning rate, number of epochs, evaluation strategy, and output directory are set using `TrainingArguments`. These control how the model is trained and evaluated.
- 📦 **Dataset Preparation**: The training and evaluation datasets are tokenized and formatted for BERT, ensuring compatibility with the model's input requirements.
- 📏 **Metric Computation**: Accuracy is computed after each evaluation step using the `evaluate` library, providing feedback on model performance.
- 🚀 **GPU Acceleration**: Training is performed on GPU if available, significantly speeding up computation.
- 🔧 **Fine-tuning**: The model's weights are updated based on the training data, adapting BERT's general language understanding to the specific task of Yelp review sentiment classification.
- 🧪 **Evaluation**: After each epoch, the model is evaluated on the test set to monitor accuracy and prevent overfitting.
- 💾 **Checkpointing**: Model checkpoints and logs are saved to the specified output directory for later analysis or reuse.

This workflow leverages modern NLP libraries to streamline fine-tuning, making it efficient and reproducible for large-scale text classification tasks.


### Training Arguments: What They Are & Why They Matter ⚙️📝

**Training arguments** are configuration settings that control how the model is fine-tuned. They define the training process, evaluation strategy, resource usage, and output management. In this notebook, the `TrainingArguments` object includes:

- **eval_strategy="epoch"** 🕰️  
    Evaluates the model at the end of each training epoch, helping monitor performance and prevent overfitting.

- **max_steps=1000** ⏱️  
    Limits training to 1000 steps, enabling quick experiments and controlling computational cost.

- **eval_steps=1000** 📊  
    Runs evaluation every 1000 steps, aligning with the training step limit for periodic performance checks.

- **learning_rate=2e-5** 🚀  
    Sets the optimizer's learning rate, balancing how quickly the model adapts to new data.

- **num_train_epochs=1** 🔄  
    Trains for one epoch, useful for rapid prototyping and avoiding excessive computation.

- **output_dir="./results"** 💾  
    Specifies where to save model checkpoints and logs, ensuring reproducibility and easy access to results.

- **report_to="none"** 🚫  
    Disables external logging integrations for a simpler workflow.

**Why are training arguments important?**

- They **control the pace and duration** of training, affecting both speed and model quality.
- They **define evaluation frequency**, enabling early stopping and performance tracking.
- They **manage resource usage** (e.g., GPU, memory), making experiments scalable and efficient.
- They **ensure reproducibility** by saving checkpoints and logs.
- They **customize the workflow** for different tasks, datasets, and hardware setups.

Properly setting training arguments is crucial for efficient, reliable, and interpretable model fine-tuning. 🌟

In [5]:
import numpy as np
import evaluate

# Load the accuracy metric from the evaluate library.
# This metric will be used to assess the model's performance during evaluation.
metric = evaluate.load("accuracy")

# Define a function to compute metrics for model evaluation.
# This function will be passed to the Trainer and called after each evaluation step.
def compute_metrics(eval_pred):
    logits, labels = eval_pred  # Unpack model outputs (logits) and true labels.
    # Convert logits (raw model outputs) to predicted class indices using argmax.
    predictions = np.argmax(logits, axis=-1)
    # Compute and return the accuracy metric by comparing predictions to true labels.
    return metric.compute(predictions=predictions, references=labels)

Downloading builder script: 0.00B [00:00, ?B/s]

In [6]:
from transformers import TrainingArguments, Trainer
import os

SAVE_DIR = os.path.join("artifacts")  # Directory to save model checkpoints and logs

# Define training arguments for the Trainer API.
# These arguments control the training/evaluation process, logging, and output.
training_args = TrainingArguments(
    eval_strategy="epoch",      # Evaluate the model at the end of each epoch.
    max_steps=10,            # Train for 1000 steps (controls steps per epoch).
    eval_steps=10,            # Evaluate every 1000 steps during training (controls evaluation frequency within each epoch).
    learning_rate=2e-5,         # Set the learning rate for the optimizer.
    num_train_epochs=2,         # Train for two epochs (useful for quick experiments).
    output_dir=SAVE_DIR,        # Directory to save model checkpoints and logs.
    report_to="none",           # Disable reporting to external logging tools (e.g., WandB, TensorBoard).
)


### Start Training 🚦🤖
This section initiates the fine-tuning process for the BERT sequence classification model on the Yelp Review Full dataset. 🚦🤖

- The Hugging Face `Trainer` is configured with:
    - The pre-trained BERT model (`model`) 🧠
    - Training arguments (`training_args`) that control epochs, learning rate, evaluation strategy, and output directory ⚙️📊
    - The full training dataset (`dataset["train"]`) and test dataset (`dataset["test"]`) 📚
    - The `compute_metrics` function for evaluating accuracy after each epoch 📏

**Training Workflow:**
1. The model is fine-tuned on 650,000 Yelp reviews, learning to predict star ratings from review text. ⭐📝
2. Evaluation is performed on 50,000 test reviews at the end of each epoch to monitor accuracy. 🕵️‍♂️📈
3. Model checkpoints and logs are saved to the `./results` directory for reproducibility. 💾🗂️
4. GPU acceleration is used if available, speeding up training. 🚀🖥️

This step adapts BERT’s language understanding to the specific task of Yelp review sentiment classification, enabling accurate predictions of user ratings from text. 🌟
#### Callbacks: Customizing Training Behavior 🛎️

**Callbacks** are special hooks that allow you to inject custom logic at key points during training (e.g., after each epoch, step, or evaluation). In Hugging Face's Trainer, callbacks can be used for:

- Early stopping to prevent overfitting
- Custom logging or metric tracking
- Saving model checkpoints based on custom criteria
- Modifying training schedules or parameters dynamically

To use callbacks, simply pass a list of callback objects to the `Trainer` via the `callbacks` argument. This makes your training workflow more flexible and responsive to specific needs.



In [7]:
# Initialize the Hugging Face Trainer for fine-tuning the BERT model on Yelp Review Full

trainer = Trainer(
    model=model,                # The BERT sequence classification model to be trained
    args=training_args,         # Training arguments (epochs, learning rate, output dir, etc.)
    train_dataset=dataset["train"],  # Full training dataset (650,000 Yelp reviews)
    eval_dataset=dataset["test"],    # Full test dataset (50,000 Yelp reviews) for evaluation
    compute_metrics=compute_metrics, # Function to compute accuracy after each evaluation
)
# Start the training process. This will:
# - Fine-tune the BERT model on the training data
# - Evaluate accuracy on the test set at the end of each epoch
# - Save model checkpoints and logs to the specified output directory
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
0,No log,1.624714,0.21952


TrainOutput(global_step=10, training_loss=1.541056251525879, metrics={'train_runtime': 1511.1006, 'train_samples_per_second': 0.053, 'train_steps_per_second': 0.007, 'total_flos': 21049451397120.0, 'train_loss': 1.541056251525879, 'epoch': 0.00012307692307692307})

## Evaluation 📊🎉

Model evaluation measures how well the fine-tuned BERT classifier predicts Yelp review ratings on unseen data. This section covers:

- **Evaluation Dataset** 🗂️🔍  
    The model is evaluated on the `test` split of the Yelp Review Full dataset, containing 50,000 reviews. For quick inspection, a subset of 100 samples (`eval`) and a further sample of 10 reviews (`sample_eval`) are used for prediction and analysis.

- **Accuracy Metric** ✅📏  
    The `accuracy` metric from the `evaluate` library is used to quantify performance. Accuracy is the proportion of correctly predicted labels out of all predictions.

- **Prediction Workflow** 🤖➡️🔢  
    - Inputs are prepared using the BERT tokenizer, converting review text into tensors (`input_ids`, `token_type_ids`, `attention_mask`).
    - The model runs in inference mode (`torch.no_grad()`), outputting logits for each class.
    - Predicted labels are obtained by selecting the class with the highest logit for each review.
    - Predictions are compared to true labels to compute accuracy.

- **Results Interpretation** 🏆🔬  
    - The accuracy score provides a direct measure of how well the model generalizes to new reviews.
    - Sample predictions and true labels can be printed for qualitative inspection.
    - Evaluation helps identify strengths and weaknesses, guiding further fine-tuning or error analysis.

**Example Evaluation Code:** 🚀


In [8]:
import torch

# Set the device to GPU ("cuda") if available, otherwise use CPU.
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Select the first 10 samples from the evaluation dataset for prediction.
sample_eval = eval.select(range(10))

# Prepare input features for the model:
# - For each required input (input_ids, token_type_ids, attention_mask),
#   convert the list of values to a PyTorch tensor and move it to the selected device (GPU/CPU).
inputs = {k: torch.tensor(sample_eval[k]).to(device) for k in tokenizer.model_input_names}

# Disable gradient calculation since we're only doing inference (not training).
with torch.no_grad():
    # Pass the inputs through the model to get output logits (raw scores for each class).
    outputs = model(**inputs)
    # For each sample, select the class with the highest logit as the predicted label.
    predictions = outputs.logits.argmax(dim=-1).cpu().numpy()

# Print the predicted class indices for the 10 samples.
print("Predictions:", predictions)
# Print the true labels for these samples for comparison.
print("True labels:", sample_eval["label"])
# Print the tokenizer object (not the decoded label; this line can be removed or replaced).
print("Decoded Label: ", tokenizer)

Predictions: [2 4 4 2 4 4 4 2 4 4]
True labels: Column([2, 4, 1, 4, 3])
Decoded Label:  BertTokenizerFast(name_or_path='google-bert/bert-base-cased', vocab_size=28996, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=False, added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}
)


## GPT Architecture Instead of BERT 🤖✨

While BERT is a bidirectional transformer model designed for deep language understanding, GPT (Generative Pre-trained Transformer) uses a unidirectional (left-to-right) transformer architecture. Here’s how GPT differs and how it can be applied to sequence classification tasks like Yelp review sentiment analysis:

- **Architecture** 🏗️  
    - **GPT** is based on the transformer decoder stack, processing text in a left-to-right fashion.  
    - Unlike BERT, GPT does not use token type embeddings or segment embeddings, focusing solely on predicting the next token in a sequence.
    - GPT’s attention mechanism only attends to previous tokens, making it naturally suited for generative tasks but also effective for classification when fine-tuned.

- **Pre-training Objective** 🎯  
    - GPT is pre-trained using a language modeling objective (predicting the next word), while BERT uses masked language modeling and next sentence prediction.
    - This difference means GPT learns strong generative capabilities, while BERT excels at understanding context from both directions.

- **Fine-tuning for Classification** 🏷️  
    - For sentiment classification, GPT can be adapted by adding a classification head on top of the final hidden state of the last token.
    - The model is fine-tuned on labeled review data, learning to map review text to star ratings (1-5).

- **Workflow in This Notebook** 📚  
    - The GPT tokenizer and model are loaded from Hugging Face’s model hub (`openai-gpt`).
    - The same training and evaluation pipeline is used as with BERT, ensuring a fair comparison.
    - Accuracy is computed using the same metric, allowing direct performance comparison between BERT and GPT on Yelp review classification.

- **Key Differences in Usage** 🔑  
    - GPT does not require `token_type_ids` for input.
    - Tokenization and padding are handled with the GPT tokenizer, which may have different vocabulary and max length settings compared to BERT.

By fine-tuning GPT for sequence classification, we can explore how generative transformer models perform on sentiment analysis tasks and compare their strengths and weaknesses to BERT-based approaches. This provides valuable insights into model selection for NLP applications. 🚀📊


In [9]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification,TrainingArguments , Trainer
import torch
import os
from datasets import load_dataset

SAVE_DIR = os.path.join("artifacts")  # Directory to save model checkpoints and logs

device = "cuda" if torch.cuda.is_available() else "cpu"
# Load the GPT tokenizer and model for sequence classification.
# - "openai-gpt" specifies the GPT model checkpoint from Hugging Face's model hub.
# - num_labels=5 sets the output layer to predict 5 classes (matching Yelp star ratings: 1-5).
# The model is moved to the selected device (GPU or CPU) for training/inference.
gpt_tokenizer = AutoTokenizer.from_pretrained("openai-gpt")



gpt_model = AutoModelForSequenceClassification.from_pretrained("openai-gpt", num_labels=5).to(device)
training_args = TrainingArguments(
    eval_strategy="epoch",      # Evaluate the model at the end of each epoch.
    max_steps=10,            # Train for 1000 steps (controls steps per epoch).
    eval_steps=10,            # Evaluate every 1000 steps during training (controls evaluation frequency within each epoch).
    learning_rate=2e-5,         # Set the learning rate for the optimizer.
    num_train_epochs=2,         # Train for two epochs (useful for quick experiments).
    output_dir=SAVE_DIR,        # Directory to save model checkpoints and logs.
    report_to="none",           # Disable reporting to external logging tools (e.g., WandB, TensorBoard).
    per_device_eval_batch_size=1,
    per_device_train_batch_size=1,  # Set batch size for training and evaluation.
)

# Initialize a Trainer for fine-tuning the GPT model on the Yelp Review Full dataset.
# - Uses the same training arguments as the BERT model for consistency.
# - Trains on the full training dataset and evaluates on the full test dataset.
# - Computes accuracy after each evaluation step using the provided metric function.
gpt_trainer = Trainer(
    model=gpt_model,                  # GPT sequence classification model
    args=training_args,               # Training arguments (epochs, learning rate, etc.)
    train_dataset=dataset["train"],   # Full training dataset (Yelp reviews)
    eval_dataset=dataset["test"],     # Full test dataset for evaluation
    compute_metrics=compute_metrics,  # Function to compute accuracy
)
# Start the training process for the GPT model.
gpt_trainer.train()


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/656 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/816k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/458k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.27M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/479M [00:00<?, ?B/s]

Some weights of OpenAIGPTForSequenceClassification were not initialized from the model checkpoint at openai-gpt and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
0,No log,1.796783,0.2081


TrainOutput(global_step=10, training_loss=1.8305299758911133, metrics={'train_runtime': 2061.5721, 'train_samples_per_second': 0.005, 'train_steps_per_second': 0.005, 'total_flos': 2612991098880.0, 'train_loss': 1.8305299758911133, 'epoch': 1.5384615384615384e-05})

In [10]:
# Tokenize the evaluation samples using the GPT tokenizer.
# - Converts the text of 10 sample reviews into input tensors for the GPT model.
# - Pads/truncates each review to the maximum length required by the model.
# - Returns PyTorch tensors for input_ids and attention_mask.
# Set the pad_token to eos_token for GPT tokenizer to enable padding
gpt_tokenizer.pad_token = gpt_tokenizer.eos_token
gpt_tokenizer.add_special_tokens({'pad_token': '[PAD]'})
gpt_model.config.pad_token_id = tokenizer.pad_token_id

# gpt_model.resize_token_embeddings(len(gpt_tokenizer))
sample_eval = eval.select(range(10))

gpt_sample_eval = gpt_tokenizer(
    list(sample_eval["text"]),
    padding="max_length",
    truncation=True,
    return_tensors="pt"
)
# Move the input tensors to the selected device (GPU or CPU).
# gpt_inputs = {k: v.to(device) for k, v in gpt_sample_eval.items()}
gpt_inputs = {k: torch.tensor(sample_eval[k]).to(device) for k in gpt_tokenizer.model_input_names}

# Run inference with the fine-tuned GPT model on the evaluation samples.
# - Disables gradient calculation for efficiency (since we're not training).
# - Passes the inputs through the model to get output logits (raw scores for each class).
# - Selects the class with the highest logit as the predicted label for each sample.
with torch.no_grad():
    gpt_outputs = gpt_model(**gpt_inputs)
    gpt_predictions = gpt_outputs.logits.argmax(dim=-1).cpu().numpy()

# Print the predicted class indices for the 10 evaluation samples.
print("GPT Predictions:", gpt_predictions)
# Print the true labels for these samples for comparison.
print("True labels:", sample_eval["label"])

GPT Predictions: [4 4 4 4 4 4 4 4 4 4]
True labels: Column([2, 4, 1, 4, 3])
