# Summary for Transformer-Based Emotion Classification

Here is the synthesized summary of the entire pipeline in professional, academic-style bullet points:

    Environment Setup
       - Configures essential Python libraries, Hugging Face Transformers, and NLP resources (e.g., NLTK).
       - Establishes computational environment with GPU/CPU allocation for optimized training performance.
       
    Data Preparation
       - Loads and cleans the raw dataset, removing duplicates and handling missing values.
       - Balances class distribution using downsampling and optional GPT-2-based text augmentation.
       - Applies preprocessing (e.g., lemmatization, stopword removal) and stratified splitting into training, validation, and test sets.
       
    Model Fine-Tuning
       - Initializes BERT, XLNet, and GPT-2 models with appropriate tokenizers and padding strategies.
       - Uses Optuna for automated hyperparameter tuning (learning rate, weight decay, warmup ratio).
       - Trains each model on the training set while monitoring validation performance with early stopping.

    Model Evaluation
       -  Evaluate trained models on a held-out test set using accuracy and macro-averaged F1 score.
       -  Generates detailed classification reports and visualizes confusion matrices for interpretability.

    Reporting and Visualization
       - It saves evaluation results and training time and is the best hyperparameter for structured CSV and JSON files.
       - Produces comparative bar plots of F1 scores across models for clear visual assessment.
       - Outputs comprehensive logs and formatted summaries suitable for thesis documentation and presentations.
       - This structured flow ensures a reproducible, transparent, and empirically sound methodology for benchmarking transformer models in multi-class emotion classification.

# Environment Setup and Imports

In [None]:
# Install required libraries
# pip install --upgrade transformers datasets accelerate optuna

## Hugging face

In [None]:
from huggingface_hub import login
import os
# Authenticate with Hugging Face Hub
login(token="hf_Nj******")

os.environ['HF_TOKEN']="hf_Nj******"
os.environ['HUGGINGFACEHUB_API_TOKEN']="hf_Nj******"

## Library Imports 

### Data manipulation and preprocessing

In [None]:
import pandas as pd                    # For handling dataframes and reading/writing CSVs
import re                              # Text cleaning with regular expressions
import nltk                            # Text preparation with the Natural Language Toolkit
from nltk.corpus import stopwords       # Common stopwords for filtering
from nltk.stem import WordNetLemmatizer  # Lemmatization to normalize words

### Data splitting and evaluation metrics

In [None]:
from sklearn.model_selection import train_test_split                        # Stratified train/val/test splitting
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, f1_score  # Evaluation metrics
from sklearn.utils.class_weight import compute_class_weight              # Compute class weights to address imbalance

### Hugging Face Transformers for tokenization, modeling, and training

In [None]:
from transformers import (
    BertTokenizer, BertForSequenceClassification,         # BERT model and tokenizer for sequence classification
    XLNetTokenizer, XLNetForSequenceClassification,       # XLNet model and tokenizer
    GPT2Tokenizer, GPT2ForSequenceClassification,         # GPT-2 model adapted for classification
    Trainer, DataCollatorWithPadding,                     # Training interface and dynamic padding
    AutoTokenizer, AutoModelForCausalLM,                  # For generative text augmentation using GPT-2
    pipeline, get_scheduler,                              # Utilities for text generation and learning rate scheduling
    TrainingArguments,                                    # Training configuration container
    EarlyStoppingCallback                                 # Callback for early stopping during training
)

### Deep learning framework

In [None]:
import torch  # PyTorch for tensor operations and GPU acceleration
import numpy as np  # Numericals operation

 ### Visualization

In [None]:
import matplotlib.pyplot as plt  # For plotting performance metrics and confusion matrices
import seaborn as sns            # Enhanced data visualization (used for confusion matrix heatmaps)

### Hyperparameter tuning

In [None]:

import optuna   # Framework for automated hyperparameter optimization

### Utility modules

In [None]:

import time  # Runtime profiling
import json  # Exporting results and configurations to JSON files

### Hardware Configuration 

In [None]:
# Configure CUDA device (for multi-GPU environments, "0" specifies the first GPU)
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

# Dynamically select device: use GPU if available, else fallback to CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# In case a GPU is not available, configure PyTorch to use a high number of CPU threads for performance
if not torch.cuda.is_available():
    print("Using CPU with 96 threads")  # This assumes a high-core-count CPU (e.g., in HPC environments)
    torch.set_num_threads(96)
else:
    # Print the name of the available GPU
    print("We have a GPU:", torch.cuda.get_device_name(0))


### Natural Language Toolkit (NLTK) Resources

In [None]:
# Download NLTK stopwords list for text cleaning
nltk.download('stopwords')

# Download WordNet corpus for lemmatization
nltk.download('wordnet')

# Preprocessing and Data Augmentation

Objective :-
    
    Useful when exporting evaluation results (e.g., confusion matrix) to JSON format, which does not support NumPy arrays directly.
    Essential for preparing text inputs before tokenization and modeling. Improves model generalization by reducing noise and redundancy in the input text.
    Helpful in classification tasks to map numerical label encodings to their string representations and count the number of classes

In [None]:
# Converts a NumPy array to a Python list for JSON serialization.
def convert_ndarray(obj):
    if isinstance(obj, np.ndarray):
        return obj.tolist()
    raise TypeError(f"Object of type {type(obj).__name__} is not JSON serializable")

In [None]:
# Extracts label information from the dataset
def get_label_info(df, label_column='label'):
    label_names = df[label_column].astype('category').cat.categories.tolist()
    num_labels = len(label_names)
    return label_names, num_labels

## Preprocesses text data

Objective :-
            
          Cleans and preprocesses raw text for NLP tasks:
               Eliminates special characters, hashtags, URLs, and mentions; lowercases text
               Lemmatisation is used to tokenise and eliminate stopwords.

In [None]:
def preprocess_text(text):
    """Preprocesses text data."""
    if not isinstance(text, str):
        return ""
    text = re.sub(r'http\S+', '', text)
    text = re.sub(r'@[A-Za-z0-9_]+', '', text)
    text = re.sub(r'#[A-Za-z0-9_]+', '', text)
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    text = text.lower()
    tokens = text.split()
    stop_words = set(stopwords.words('english'))
    filtered_tokens = [word for word in tokens if word not in stop_words]
    lemmatizer = WordNetLemmatizer()
    lemmatized_tokens = [lemmatizer.lemmatize(word) for word in filtered_tokens]
    return ' '.join(lemmatized_tokens)

## Augment_minority_classes

Objective :-

  This function is essential for addressing class imbalance in supervised emotion classification by generating synthetic samples for underrepresented classes using GPT-style models. This enhances model generalization and improves performance on minority labels.

Augments minority classes using a causal language model (e.g., DistilGPT-2).

    Parameters:
    - df (DataFrame): Original dataset.
    - text_column (str): Column containing input text data.
    - label_column (str): Column containing emotion labels.
    - num_augmentations (int): Number of augmentation prompts per sample.
    - output_csv_path (str): File path to save/load the augmented dataset.
    - model_name (str): Name of the Hugging Face model used for generation.
    - device (str): 'cuda' or 'cpu'.
    - max_new_tokens (int): Maximum tokens to generate per sample.
    - batch_size (int): Number of prompts processed per batch.

    Returns:
    - DataFrame: Original data combined with generated synthetic samples.
    
   

In [None]:
def augment_minority_classes(df, text_column='text', label_column='label', num_augmentations=1,output_csv_path='augmented_dataset.csv', model_name='distilgpt2', device='cuda', max_new_tokens=80, batch_size=32):
    if os.path.exists(output_csv_path):
        print(f"Loading existing augmented dataset from: {output_csv_path}")
        return pd.read_csv(output_csv_path)

    print("Generating new augmented dataset (batched)...")
    label_counts = df[label_column].value_counts()
    max_count = label_counts.max()
    augmented_rows = []

    tokenizer = AutoTokenizer.from_pretrained(model_name)
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.padding_side = 'left'

    model = AutoModelForCausalLM.from_pretrained(model_name).to(device)
    #tokenizer.pad_token = tokenizer.eos_token

    for label, count in label_counts.items():
        if count >= max_count:
            continue

        minority_df = df[df[label_column] == label]
        required = max_count - count
        per_sample = max(1, required // len(minority_df))

        prompts = []
        for _, row in minority_df.iterrows():
            for _ in range(min(num_augmentations, per_sample)):
                prompt = f"{row[text_column]} Emotion: {label}."
                prompts.append(prompt)

        for i in range(0, len(prompts), batch_size):
            batch_prompts = prompts[i:i + batch_size]
            inputs = tokenizer(batch_prompts, return_tensors='pt', padding=True, truncation=True).to(device)
            with torch.no_grad():
                outputs = model.generate(
                    **inputs,
                    #max_length=max_length,
                    max_new_tokens=max_new_tokens,
                    do_sample=True,
                    top_k=50,
                    top_p=0.95,
                    num_return_sequences=1,
                    pad_token_id=tokenizer.eos_token_id
                )
            decoded = tokenizer.batch_decode(outputs, skip_special_tokens=True)

            for prompt, gen_text in zip(batch_prompts, decoded):
                continuation = gen_text.replace(prompt, "").strip()
                augmented_rows.append({text_column: continuation, label_column: label})

    augmented_df = pd.concat([df, pd.DataFrame(augmented_rows)], ignore_index=True)
    augmented_df.to_csv(output_csv_path, index=False)
    print(f"Saved augmented dataset to {output_csv_path}")
    return augmented_df

# Dataset Splitting

Objective :-

    The text is cleaned and preprocessed before tokenization.
    The dataset is stratified, maintaining proportional label distributions across training, validation, and test sets.
    Augmentation is integrated as an optional preprocessing step to address class imbalance.

divides the dataset in to training, validation, and test sets and preprocesses the text.

    Parameters:
    - df : The initial data set containing text and labels.
    - text_column : Column name containing textual input data.
    - label_column : Column name containing emotion labels.
    - train_size : Proportion of data used for training.
    - val_size : Proportion of data  used for validtion.
    - test_size : Proportion of data  used for test .
    - augment : Whether to perform augmentation on minority classes.
    - num_augmentations : Number of synthetic samples per instance if augmentation is applied.
    - random_state : Random seed for reproducibility.


   

In [None]:

def create_datasets(df, text_column='text', label_column='label', train_size=0.7, val_size=0.15, test_size=0.15, augment=False, num_augmentations=2, random_state=42):
    
    if augment:     
         df = augment_minority_classes(df, text_column, label_column, num_augmentations)

    X = df[text_column].apply(preprocess_text).tolist()
    y = df[label_column].astype('category').cat.codes.tolist()
    label_names = df[label_column].astype('category').cat.categories.tolist()
    #num_labels = len(label_names)

    X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=1 - train_size, random_state=random_state, stratify=y)
    X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=test_size / (1 - train_size), random_state=random_state, stratify=y_temp)

    return X_train, X_val, X_test, y_train, y_val, y_test, label_names

# Tokenization and Dataset Wrapping

Objective :-

    Facilitates compatibility with the Hugging Face Trainer API by structuring the data in a format compliant with PyTorch’s Dataset class. This approach ensures that text inputs are efficiently tokenized and converted into tensors, while corresponding labels are integrated to support supervised learning workflows during model fine-tuning.

In [None]:
#  Custom dataset class for emotion classification tasks.
# Wraps tokenized inputs and labels into a PyTorch-compatible Dataset structure for use with Hugging Face's Trainer API.

class EmotionDataset(torch.utils.data.Dataset):
    """Custom dataset for emotion data."""
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: val[idx] for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

## Tokenizes

Tokenizes and encodes raw text data for use with Transformer models.

    Parameters:
    - tokenizer: Hugging Face tokenizer corresponding to the model (e.g., BERT, XLNet, GPT-2).
    - X_train, X_val, X_test: Preprocessed textual data splits.
    - y_train, y_val, y_test: Integer-encoded label splits.
    - max_length: Maximum sequence length after padding/truncation.

In [None]:
def tokenize_data(tokenizer, X_train, X_val, X_test, y_train, y_val, y_test, max_length=128):
    """Tokenizes and encodes text data."""
    # Ensure tokenizer has pad token
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token if hasattr(tokenizer, "eos_token") else '[PAD]'
        tokenizer.pad_token_id = tokenizer.eos_token_id if hasattr(tokenizer, "eos_token_id") else 0

    tokenizer.padding_side = 'right'  # Ensures consistency

    train_encodings = tokenizer(X_train, truncation=True, padding='max_length', max_length=max_length, return_tensors='pt')
    val_encodings = tokenizer(X_val, truncation=True, padding='max_length', max_length=max_length, return_tensors='pt')
    test_encodings = tokenizer(X_test, truncation=True, padding='max_length', max_length=max_length, return_tensors='pt')
    # Add labels to encodings
    train_encodings['labels'] = y_train
    val_encodings['labels'] = y_val
    test_encodings['labels'] = y_test
    return train_encodings, val_encodings, test_encodings

# Trainer Configuration and Training

Objective :-

        This function setup enables robust and fair model training, particularly in the presence of class imbalance. It incorporates:
        Custom weighted loss to mitigate label skew.
        Early stopping to prevent overfitting.
        F1-macro optimization, suitable for multi-class emotion classification.

## Create_trainer

Configures and returns a Hugging Face `Trainer` instance tailored for emotion classification.

    Features:
    - Handles class imbalance using weighted loss.
    - Applies learning rate scheduling and early stopping.
    - Uses DataCollatorWithPadding for dynamic padding.

    Parameters:
    - model: Pre-trained transformer model (BERT, XLNet, or GPT-2).
    - tokenizer: Tokenizer corresponding to the selected model.
    - train_encodings, val_encodings: Tokenized datasets.
    - y_train: List of training labels (used for class weight calculation).
    - weight_decay: Weight decay coefficient for regularization.
    - warmup_ratio: Ratio of warm-up steps for scheduler.


In [None]:
def create_trainer(model, tokenizer, train_encodings, val_encodings, y_train, batch_size=16, epochs=3, weight_decay=0.01, learning_rate=5e-5, warmup_ratio=0.06):
    """Creates and configures a Trainer with class weights and learning rate scheduler."""
    train_dataset = EmotionDataset(train_encodings, train_encodings.pop('labels'))
    val_dataset = EmotionDataset(val_encodings, val_encodings.pop('labels'))
    data_collator = DataCollatorWithPadding(tokenizer=tokenizer, padding='max_length', max_length=128)


    class_weights = compute_class_weight('balanced', classes=np.unique(y_train), y=y_train)
    class_weights = torch.tensor(class_weights, dtype=torch.float).to(device)

    
    def compute_weighted_loss(model, inputs, return_outputs=False,**kwargs):
        labels = inputs.get("labels")
        outputs = model(**inputs, return_dict=True)
        logits = outputs.get("logits")
        loss_fct = torch.nn.CrossEntropyLoss(weight=class_weights)
        loss = loss_fct(logits.view(-1, logits.size(-1)), labels.view(-1))
        return (loss, outputs) if return_outputs else loss

    def compute_loss(model, inputs, return_outputs=False):
        """
        How the loss is computed by Trainer. By default, all models return the loss in the first element.
        Subclass and override for custom behavior.
        """
        if self.label_smoother is not None and "labels" in inputs:
            labels = inputs.pop("labels")

    # Define training arguments
    training_args = TrainingArguments(
        output_dir='./results',                        # check points saved on output file folder
        num_train_epochs=epochs,                       # Num training epochs
        per_device_train_batch_size=batch_size,        # Train batch size
        per_device_eval_batch_size=batch_size,         # Eval batch size
        warmup_ratio=warmup_ratio,                     # Warm-up stp LR scheduler
        weight_decay=weight_decay,                     # Regularization to prevent overfitting
        learning_rate=learning_rate,                   # Optimizer learning rate
        logging_dir='./logs',                          # Directory to store logs
        logging_steps=10,                              # Frequency of logging
        eval_strategy='epoch',                         # Run evaluation every epoch
        save_strategy='epoch',                         # Save model every epoch
        save_steps=500,                                # Save model every 500 stps
        eval_steps=500,                                # Run evaluation every 500 stps
        load_best_model_at_end=True,                   # Automatically load the optimal checkpoint according to the metric
        metric_for_best_model='eval_f1_macro',         # Metric to track best model
        fp16=True,                                     # Use 16-bit precision 
        greater_is_better=True                         # Optimize towards higher score
    )

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=val_dataset,
        tokenizer=tokenizer,
        data_collator=data_collator,
        compute_metrics=compute_metrics,
        #compute_loss=compute_weighted_loss,
        callbacks=[EarlyStoppingCallback(early_stopping_patience=2)],
    )
    trainer.compute_loss = compute_weighted_loss
    return trainer

## Compute_metrics 

 Computes key evaluation metrics for classification:
 
    - Accuracy
    - Macro-averaged F1 Score

In [None]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    accuracy = accuracy_score(labels, predictions)
    f1_macro = f1_score(labels, predictions, average='macro')
    return {'accuracy': accuracy, 'f1_macro': f1_macro}

#  Evaluation and Saving

Objective :-

    This function facilitates a comprehensive performance assessment by generating quantitative metrics and visualizing classification outcomes, thereby enabling both statistical evaluation and diagnostic error analysis. Additionally, it supports reproducibility and deployment by systematically exporting the trained model and tokenizer in a standardized format compatible with the Hugging Face ecosystem.

## evaluate_model

 Evaluates a fine-tuned transformer model on a held-out test set, computes classification metrics, and generates a confusion matrix plot.

    Parameters:
    - model: Trained transformer-based classification model.
    - tokenizer: Tokenizer used for encoding the dataset.
    - test_encodings : Tokenized test data with input tensors.
    - y_test : Ground truth labels for the test set.
    - label_names : Original class names for report readability.
    - model_name : Descriptive model identifier (e.g., 'BERT', 'GPT-2').
    - save_folder : Destination folder for saving visual outputs.

In [None]:
def evaluate_model(model, tokenizer, test_encodings, y_test, label_names, model_name, save_folder):
    
    # Convert encoded test inputs into a PyTorch-compatible dataset
    test_dataset = EmotionDataset(test_encodings, test_encodings.pop('labels'))
    
    # Reuse Hugging Face Trainer for efficient prediction
    trainer = Trainer(model=model, tokenizer=tokenizer)
    predictions = trainer.predict(test_dataset)
    y_pred = np.argmax(predictions.predictions, axis=1)

    accuracy = accuracy_score(y_test, y_pred)
    f1_macro = f1_score(y_test, y_pred, average='macro')
    report = classification_report(y_test, y_pred, target_names=label_names, zero_division=1)
    cm = confusion_matrix(y_test, y_pred)

    print(f"\n{model_name} Results:")
    print(f"  Accuracy: {accuracy}")
    print(f"  F1 Macro: {f1_macro}")
    print(f"  Classification Report:\n{report}")

    plt.figure(figsize=(8, 6))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=label_names, yticklabels=label_names)
    plt.title(f'{model_name} Confusion Matrix')
    plt.xlabel('Predicted Labels')
    plt.ylabel('True Labels')
    plt.savefig(os.path.join(save_folder, f'{model_name}_confusion_matrix.png'))
    plt.close()

    return {'accuracy': accuracy, 'f1_macro': f1_macro, 'report': report, 'cm': cm}

## save_model

In [None]:
#Saves the fine-tuned model and its tokenizer to disk for reuse or deployment.
def save_model(model, tokenizer, model_save_path):
    """Saves the trained model and tokenizer."""
    model.save_pretrained(model_save_path)
    tokenizer.save_pretrained(model_save_path)
    print(f"\nModel and tokenizer saved to: {model_save_path}")

# Hyperparameter Tuning with Optuna

 Objective :-    
 
     This function enables automated hyperparameter search via Optuna, targeting optimal training performance by maximizing the macro-averaged F1 score. It dynamically evaluates multiple model configurations across learning rate, weight decay, and warmup ratio, thereby enhancing model generalizability and robustness—particularly critical in imbalanced multi-class emotion classification tasks.
 
 Defines the objective function for Optuna to optimize hyperparameters
    in transformer-based emotion classification models.

    Parameters:
    - trial: Optuna trial object for suggesting hyperparameters.
    - model_name : Identifier of the model type ('BERT', 'XLNet', 'GPT-2').
    - tokenizer: Tokenizer instance corresponding to the selected model.
    - X_train_encodings : Tokenized training input data.
    - X_val_encodings : Tokenized validation input data.
    - y_train : Encoded training labels.
    - batch_size : Batch size for training.
    - epochs : Number of epochs for model training.
    - num_labels : Number of output classes.


## objective

In [None]:
def objective(trial, model_name, tokenizer, X_train_encodings, X_val_encodings, y_train, batch_size, epochs,num_labels):
    """Objective function for Optuna hyperparameter optimization."""

    train_encodings_copy = X_train_encodings.copy()  # Create a copy
    val_encodings_copy = X_val_encodings.copy()  # Create a copy

    learning_rate = trial.suggest_float("learning_rate", 1e-5, 5e-5, log=True)
    weight_decay = trial.suggest_float("weight_decay", 1e-5, 1e-2, log=True)
    warmup_ratio = trial.suggest_float("warmup_ratio", 0.0, 0.2)

    if model_name == 'BERT':
        model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=num_labels)
    elif model_name == 'XLNet':
        model = XLNetForSequenceClassification.from_pretrained('xlnet-base-cased', num_labels=num_labels)
        model.config.pad_token_id = tokenizer.pad_token_id
    elif model_name == 'GPT-2':
        tokenizer.pad_token = tokenizer.eos_token
        tokenizer.pad_token_id = tokenizer.eos_token_id
        tokenizer.padding_side = 'right'
        model = GPT2ForSequenceClassification.from_pretrained('gpt2', num_labels=num_labels)
        model.config.pad_token_id = tokenizer.pad_token_id

    else:
        raise ValueError(f"Invalid model name: {model_name}")
    
    model.to(device)

    trainer = create_trainer(
        model=model,
        tokenizer=tokenizer,
        train_encodings=train_encodings_copy,
        val_encodings=val_encodings_copy,
        y_train=y_train,
        batch_size=batch_size,
        epochs=epochs,
        weight_decay=weight_decay,
        learning_rate=learning_rate,
        warmup_ratio=warmup_ratio
    )

    trainer.train()
    eval_results = trainer.evaluate()
    return eval_results['eval_f1_macro']  # Optimize for F1 macro

# Unified Fine-tuning and Evaluation Routine



## fine_tune_and_evaluate

Objective :-


        This function encapsulates the end-to-end training workflow:
        Ensures class balance through augmentation.
        Uses Optuna to discover optimal hyperparameters.
        Fine-tunes and evaluates pre-trained LLMs for multi-class emotion detection.
        Produces reproducible and deployable model artifacts.

Orchestrates the full workflow for fine-tuning and evaluating a transformer-based emotion classification model.

    Steps included:
    - Data preparation (including optional augmentation)
    - Tokenization
    - Optuna-based hyperparameter tuning
    - Model training with best parameters
    - Evaluation on a held-out test set
    - Saving of trained model and tokenizer

    Parameters:
    - df : Dataset with 'text' and 'label' columns.
    - model_name : One of {'BERT', 'XLNet', 'GPT-2'}.
    - tokenizer: Corresponding tokenizer for the model.
    - model_save_path : Path to save the trained model and tokenizer.
    - augment : Whether to apply data augmentation to balance classes.
    - num_augmentations : Number of augmentations per minority instance.
    - epochs : Number of training epochs.
    - batch_size : Batch size during training and evaluation.
    - n_trials : Number of Optuna trials for hyperparameter optimization.


In [None]:
def fine_tune_and_evaluate(df, model_name, tokenizer, model_save_path, augment=False, num_augmentations=2, epochs=3, batch_size=16, n_trials=10):
    """Fine-tunes and evaluates a given model."""
    if model_name == 'GPT-2':
        tokenizer.pad_token = tokenizer.eos_token

    X_train, X_val, X_test, y_train, y_val, y_test, label_names = create_datasets(df, augment=augment, num_augmentations=num_augmentations)

    X_train_encodings, X_val_encodings, X_test_encodings = tokenize_data(tokenizer, X_train, X_val, X_test, y_train, y_val, y_test)

    label_names = df['label'].astype('category').cat.categories.tolist()
    num_labels = len(label_names)

    label_names, num_labels = get_label_info(df, label_column='label')
    print("Label names:", label_names)
    print("Number of classes (num_labels):", num_labels)

    # Optuna optimization
    study = optuna.create_study(direction="maximize")
    study.optimize(lambda trial: objective(trial, model_name, tokenizer, X_train_encodings, X_val_encodings, y_train, batch_size, epochs,num_labels), n_trials=n_trials)

    print(f"\nBest hyperparameters for {model_name}:")
    print(study.best_params)

    # Train with best hyperparameters
    best_model = None
    if model_name == 'BERT':
        best_model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=num_labels)#len(np.unique(y_train))
    elif model_name == 'XLNet':
        best_model = XLNetForSequenceClassification.from_pretrained('xlnet-base-cased', num_labels=num_labels)
        best_model.config.pad_token_id = tokenizer.pad_token_id
    elif model_name == 'GPT-2':
        tokenizer.pad_token = tokenizer.eos_token
        tokenizer.pad_token_id = tokenizer.eos_token_id
        tokenizer.padding_side = 'right'

        best_model = GPT2ForSequenceClassification.from_pretrained('gpt2', num_labels=num_labels)
        best_model.config.pad_token_id = tokenizer.pad_token_id

    else:
        raise ValueError(f"Invalid model name: {model_name}")
    
    best_model.to(device)

    best_trainer = create_trainer(
        model=best_model,
        tokenizer=tokenizer,
        train_encodings=X_train_encodings,
        val_encodings=X_val_encodings,
        y_train=y_train,
        batch_size=batch_size,
        epochs=epochs,
        learning_rate=study.best_params['learning_rate'],
        weight_decay=study.best_params['weight_decay'],
        warmup_ratio=study.best_params['warmup_ratio']
    )

    start_time = time.time()
    best_trainer.train()
    training_time = time.time() - start_time
    print(f"Training time for {model_name}: {training_time:.2f} seconds")

    # Evaluate on test set
    eval_results = evaluate_model(best_model, tokenizer, X_test_encodings, y_test, label_names, model_name, model_save_path)
    eval_results['optuna_params'] = study.best_params
    eval_results['training_time'] = training_time

    save_model(best_model, tokenizer, model_save_path)

    return eval_results

# Pipeline Orchestration

## load_and_evaluate_all_models

Objective :-

            This function automates the end-to-end benchmarking of transformer models (BERT, XLNet, GPT-2) for emotion classification by:
            Ensuring a balanced dataset using proportional downsampling.
            Running consistent training and evaluation routines with hyperparameter optimization.
            Producing reproducible results, model artifacts, and visualizations for reporting.

Loads data, balances classes, fine-tunes three transformer models (BERT, XLNet, GPT-2),
    evaluates their performance, and stores the results for comparison.

    Parameters:
    - file_path : Path to the input CSV file (not directly used due to hardcoded file).
    - augment : Whether to apply data augmentation for class balancing.
    - num_augmentations : Number of augmentations per minority sample.
    - epochs : Number of train epochs for each model.
    - batch__size : Batchwise size used in train and eval.
    - n_trials : Num of Optuna trials for hyperparameter search.

   

In [None]:
def load_and_evaluate_all_models(file_path='emotion.csv', augment=False, num_augmentations=2, epochs=3, batch_size=16, n_trials=10):
    """Loads, preprocesses, fine-tunes, evaluates, and saves BERT, XLNet, and GPT-2 models."""

    try:
#        from google.colab import drive
       # drive.mount('/content/drive')
        df_org = pd.read_csv('emotion_sentimen_dataset.csv')
        df_org = df_org[['text', 'Emotion']]
        df_org=df_org.drop_duplicates()
        df = df_org[["text", "Emotion"]]
    except FileNotFoundError:
        print(f"Error: '{file_path}' not found.")
        return

    print("Original label distribution:\n", df_org['Emotion'].value_counts(normalize=True))

    target_rows = 300000
    emotion_distribution = df_org['Emotion'].value_counts(normalize=True)*100
    
    #emotion_distribution
    # Compute the number of rows per emotion
    new_emotion_counts = (emotion_distribution * target_rows / 100).astype(int)
    #new_emotion_counts = new_emotion_counts.apply(lambda x: max(x, 2)) # This line ensures each emotion has at least 2 instances

    difference = target_rows - new_emotion_counts.sum()
    new_emotion_counts.iloc[0] += difference  # Adjust the most frequent emotion

    # Create a new DataFrame with the same distribution as the original dataset
    new_df_balanced = pd.DataFrame()

    # Generate synthetic data maintaining the original emotion distribution
    new_df_balanced = pd.DataFrame()
    for emotion, count in new_emotion_counts.items():
        subset = df_org[df_org["Emotion"] == emotion].sample(n=count, replace=False, random_state=42)
        new_df_balanced = pd.concat([new_df_balanced, subset])

    # Shuffle the new dataset to mix emotions
    new_df_balanced = new_df_balanced.sample(frac=1, random_state=42).reset_index(drop=True)
    new_df_balanced['Emotion'].value_counts(normalize=True)
    print("Balanced label distribution:\n", new_df_balanced['Emotion'].value_counts(normalize=True))
    
    df = new_df_balanced[['text', 'Emotion']].rename(columns={'Emotion': 'label'}).dropna()

    # Original distribution (normalized)
    original_dist = df_org['Emotion'].value_counts(normalize=True).sort_index()
    # Downsampled distribution (normalized)
    downsampled_dist = new_df_balanced['Emotion'].value_counts(normalize=True).sort_index()

    # Combine into a single DataFrame
    comparison_df = pd.DataFrame({'Original (%)': (original_dist * 100).round(2),'Downsampled (%)': (downsampled_dist * 100).round(2)})
    print("\n Class Distribution Comparison (Original vs Downsampled):",comparison_df)
    

    # Create folders
    bert_folder = './bert_emotion_model'
    xlnet_folder = './xlnet_emotion_model'
    gpt2_folder = './gpt2_emotion_model'

    os.makedirs(bert_folder, exist_ok=True)
    os.makedirs(xlnet_folder, exist_ok=True)
    os.makedirs(gpt2_folder, exist_ok=True)

    results = {}

    # BERT
    bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
    results['BERT'] = fine_tune_and_evaluate(df.copy(), 'BERT', bert_tokenizer, bert_folder, augment, num_augmentations, epochs, batch_size, n_trials)

    # XLNet
    xlnet_tokenizer = XLNetTokenizer.from_pretrained('xlnet-base-cased')
    # Proper padding setup
    xlnet_tokenizer.pad_token = xlnet_tokenizer.eos_token if xlnet_tokenizer.pad_token is None else xlnet_tokenizer.pad_token
    xlnet_tokenizer.pad_token_id = xlnet_tokenizer.convert_tokens_to_ids(xlnet_tokenizer.pad_token)
    xlnet_tokenizer.padding_side = 'right'
    results['XLNet'] = fine_tune_and_evaluate(df.copy(), 'XLNet', xlnet_tokenizer, xlnet_folder, augment, num_augmentations, epochs, batch_size, n_trials)

    # GPT-2
    gpt2_tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
    results['GPT-2'] = fine_tune_and_evaluate(df.copy(), 'GPT-2', gpt2_tokenizer, gpt2_folder, augment, num_augmentations, epochs, batch_size, n_trials)

    # Find best model
    best_model_name = max(results, key=lambda k: results[k]['f1_macro'])
    print(f"\nBest model: {best_model_name} with F1 Macro: {results[best_model_name]['f1_macro']}")

    with open("results_summary.json", "w") as f:
        json.dump(results, f, indent=4, default=convert_ndarray)

    with open("results_summary.txt", "w") as f:
        for model, metrics in results.items():
            f.write(f"\nModel: {model}\n")
            f.write(f"  Accuracy: {metrics.get('accuracy')}\n")
            f.write(f"  F1 Macro: {metrics.get('f1_macro')}\n")
            f.write(f"  Training Time (s): {metrics.get('training_time')}\n")
            
            f.write("  Optuna Params:\n")
            for k, v in metrics.get("optuna_params", {}).items():
                f.write(f"    - {k}: {v}\n")
            
            f.write("  Confusion Matrix:\n")
            cm = np.array(metrics.get("cm"))
            for row in cm:
                f.write("    " + " ".join(map(str, row)) + "\n")


    for model in results:
    # Save individual model params to JSON
        with open(f"{model.lower()}_optuna_best_params.json", "w") as f:
            json.dump(results[model].get('optuna_params', {}), f, indent=4, default=convert_ndarray)

    # Save individual model params to TXT
    with open(f"{model.lower()}_optuna_best_params.txt", "w") as f:
        f.write(f"{model} Best Optuna Parameters:\n")
        for k, v in results[model].get('optuna_params', {}).items():
            f.write(f"- {k}: {v}\n")



    save_results_to_csv(results)
    plot_model_comparison(results)

    return results

# Run the training and evaluation

load_and_evaluate_all_models(augment=True, n_trials=2) 


# Visualization and Reporting

Objective :-

    The visualization function facilitates an intuitive comparative analysis of model performance by graphically representing macro-averaged F1 scores. This serves as a valuable tool for identifying the most effective model across evaluation tasks and is particularly well-suited for inclusion in academic presentations, reports, or publications.

    the CSV export function enhances reproducibility and transparency by systematically storing key evaluation metrics—such as F1 scores, accuracy, training time, and hyperparameters in a structured, machine-readable format. This output is ideal for documentation in thesis appendices, tabulated result sections, or supplementary materials.

## plot_model_comparison

In [None]:
# results (dict): Dictionary containing performance metrics for each model
def plot_model_comparison(results):
    model_names = list(results.keys())
    f1_scores = [results[m]['f1_macro'] for m in model_names]

    plt.figure(figsize=(6, 4))
    plt.bar(model_names, f1_scores)
    plt.title("F1 Macro Score by Model")
    plt.ylabel("F1 Macro")
    plt.xlabel("Model")
    plt.ylim(0, 1)
    plt.tight_layout()
    plt.savefig("f1_score_comparison.png")
    plt.close()

## save_results_to_csv

In [None]:
# Exports model evaluation metrics and hyperparameter configurations to a CSV file.
def save_results_to_csv(results, file_path='final_results.csv'):
    rows = []
    for model_name, metrics in results.items():
        rows.append({
            "Model": model_name,
            "F1 Macro": metrics.get("f1_macro"),
            "Accuracy": metrics.get("accuracy"),
            "Training Time (s)": metrics.get("training_time"),
            **metrics.get("optuna_params", {})
        })
    df = pd.DataFrame(rows)
    df.to_csv(file_path, index=False)
    print(f"Saved CSV to {file_path}")

# Output

## Bert

/home2/sk23aib/ml.py:254:
  trainer = Trainer(model=model, tokenizer=tokenizer)
  
    {'evalulation loss': 1.1640475988388062, 'evalulation accuracy': 0.8382905272579073, 'evalulation f1_macro': 0.6192348587152997, 'evalulation runtime': 37.898, 'evalulation_samples_per_second': 1650.986, 'evalulation_steps_per_second': 103.198, 'epoch': 3.0}
    {'train_runtime': 2598.2224, 'train_samples_per_second': 337.14, 'train_steps_per_second': 21.072, 'train_loss': 1.2009653114249172, 'epoch': 3.0}
Training time for BERT: 2598.51 seconds


/home2/sk23aib/ml.py:228: 
  trainer = Trainer(

BERT Results:

  Accuracy: 0.8360849622017293
  
  F1 Macro: 0.622697837450507
  
  Classification Report:
                   
       
                  precision   recall   f1-score  support
       anger        0.77      0.58      0.66      2040
     boredom        0.71      0.50      0.59        20
       empty        0.50      0.55      0.52       925
    enthusiasm      0.47      0.60      0.53      1541 
         fun        0.60      0.53      0.56      1664
    happiness       0.51      0.73      0.60      4469
        hate        0.66      0.59      0.62      2112
        love        0.83      0.66      0.73      6044
     neutral        1.00      0.99      0.99     36216
      relief        0.63      0.53      0.57      2744
     sadness        0.55      0.69      0.61      2908
    surprise        0.61      0.52      0.56      1176
       worry        0.62      0.46      0.53       710

    accuracy                           0.84       62569
    macro avg        0.65      0.61      0.62     62569
    weighted avg     0.85      0.84      0.84     62569


Model and tokenizer saved to: ./bert_emotion_model

## Xlnet


Loading existing augmented dataset from: augmented_dataset.csv
        
        Label names: 'anger', 'boredom',  'relief', 'sadness', 'surprise', 'worry','empty', 'enthusiasm', 'fun', 'happiness', 'hate', 'love', 'neutral'
Num of classes (num_labels): 13


/home2/sk23aib/ml.py:254: 
  trainer = Trainer(model=model, tokenizer=tokenizer)
  
    {'evalulation_loss': 1.1928361654281616, 'evalulation_accuracy': 0.8404161805366875, 'evalulation_f1_macro': 0.6154764789644133, 'evalulation_runtime': 78.825, 'evalulation_samples_per_second': 793.771, evalulation'evalulation_steps_per_second': 49.616, 'epoch': 3.0}


    {'train_runtime': 4709.4429, 'train_samples_per_second': 186.002, 'train_steps_per_second': 11.626, 'train_loss': 1.2257617823544158, 'epoch': 3.0}

    Training time for XLNet: 4709.73 seconds


/home2/sk23aib/ml.py:228: 
  trainer = Trainer(

XLNet Results:
  Accuracy: 0.8386261567229778
  F1 Macro: 0.6049235095074379
  Classification Report:
              precision    recall  f1-score   support
         
                 precision    recall  f1-score   support
       anger        0.61      0.63      0.62      2040
     boredom        0.62      0.25      0.36        20
       empty        0.63      0.52      0.57       925
    enthusiasm      0.56      0.57      0.57      1541
         fun        0.50      0.55      0.52      1664
    happiness       0.55      0.70      0.62      4469
        hate        0.68      0.58      0.62      2112
        love        0.80      0.68      0.74      6044
     neutral        0.99      1.00      1.00     36216
      relief        0.59      0.54      0.57      2744
     sadness        0.64      0.64      0.64      2908
    surprise        0.53      0.55      0.54      1176
       worry        0.52      0.52      0.52       710

    accuracy                           0.84     62569
    macro avg       0.63    0.59      0.60     62569
    weighted avg    0.84    0.84      0.84     62569


Model and tokenizer saved to: ./xlnet_emotion_model

## GPT


Loading existing augmented dataset from: augmented_dataset.csv

    Label names: 'anger', 'boredom',  'relief', 'sadness', 'surprise', 'worry','empty', 'enthusiasm', 'fun', 'happiness', 'hate', 'love', 'neutral'
Num of classes (num_labels): 13

/home2/sk23aib/ml.py:254: 
  trainer = Trainer(model=model, tokenizer=tokenizer)
{'evalulation loss': 1.1552095413208008, 'evalulation accuracy': 0.8376831977496844, 'evalulation f1 macro': 0.6186894375669489, 'evalulation runtime': 41.7142, 'evalulation samples per_second': 1499.944, 'evalulation steps_per_second': 93.757, 'epoch': 3.0}


{'train runtime': 2791.202, 'train samples per second': 313.83, 'train steps per/sec': 19.615, 'train loss': 1.2230425001170537, epoch: 3.0}
Training time for GPT-2: 2791.45 seconds



GPT-2 Results:
  Accuracy: 0.8367242564209113
  F1 Macro: 0.6146183418210642
  Classification Report:
              precision    recall  f1-score   support

                precision    recall  f1-score   support
       anger       0.63      0.62      0.63      2040
     boredom       0.89      0.40      0.55        20
       empty       0.56      0.53      0.54       925
    enthusiasm     0.51      0.58      0.55      1541
         fun       0.55      0.53      0.54      1664
    happiness      0.56      0.69      0.62      4469
        hate       0.68      0.58      0.62      2112
        love       0.80      0.69      0.74      6044
     neutral       0.99      0.99      0.99     36216
      relief       0.58      0.55      0.56      2744
     sadness       0.64      0.65      0.64      2908
    surprise       0.50      0.55      0.53      1176
       worry       0.46      0.51      0.48       710

    accuracy                           0.84     62569
    macro avg       0.64      0.60      0.61     62569
    weighted avg     0.84      0.84      0.84     62569


Model and tokenizer saved to: ./gpt2_emotion_model

Best model: BERT with F1 Macro: 0.622697837450507
Saved CSV to final_results.csv