# F3 - Automated Hyperparameter Optimization
# Comparing Grid Search vs Random Search for Tagalog Title Generation

**Installation and Imports**

This code cell acts as the initial setup phase, preparing the Python environment for a machine learning project. Its primary purpose is to install all the necessary software packages and import the required libraries, ensuring that all the essential tools are available and ready to use in the subsequent parts of the notebook.

Installation: It uses pip to install key libraries, including transformers for pre-trained AI models, datasets for handling data, and evaluate and rouge_score for assessing model performance.

Importing: It then imports these installed modules, along with other fundamental libraries like pandas for data manipulation and torch for deep learning.

Configuration: Finally, it performs some basic configuration by ignoring warning messages to keep the output clean and setting a fixed random seed to guarantee that any random processes are reproducible each time the code is run.

In [1]:
# Install required packages
!pip install transformers datasets accelerate ray[tune] optuna evaluate rouge_score pandas matplotlib

# Import libraries
import torch
import pandas as pd
import time
import numpy as np
import matplotlib.pyplot as plt
from transformers import (
    AutoModelForSeq2SeqLM,
    AutoTokenizer,
    TrainingArguments,
    Trainer,
    set_seed,
    EarlyStoppingCallback
)
from datasets import Dataset
from evaluate import load
import json
import warnings
warnings.filterwarnings('ignore')

# Set seed for reproducibility
set_seed(42)

Collecting optuna
  Downloading optuna-4.5.0-py3-none-any.whl.metadata (17 kB)
Collecting evaluate
  Downloading evaluate-0.4.6-py3-none-any.whl.metadata (9.5 kB)
Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting ray[tune]
  Downloading ray-2.51.1-cp312-cp312-manylinux2014_x86_64.whl.metadata (21 kB)
Collecting click!=8.3.0,>=7.0 (from ray[tune])
  Downloading click-8.2.1-py3-none-any.whl.metadata (2.5 kB)
Collecting tensorboardX>=1.9 (from ray[tune])
  Downloading tensorboardx-2.6.4-py3-none-any.whl.metadata (6.2 kB)
Collecting colorlog (from optuna)
  Downloading colorlog-6.10.1-py3-none-any.whl.metadata (11 kB)
Downloading optuna-4.5.0-py3-none-any.whl (400 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m400.9/400.9 kB[0m [31m12.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading evaluate-0.4.6-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84

# Data Loading
This section loads your Tagalog essay data from a CSV file. The code will prompt you to upload your file.



This code cell handles the crucial task of importing your Tagalog essay dataset into the Colab environment. Its operation can be broken down into a few key phases:

First, it initiates the file upload process. Using a special Colab module, it triggers a dialog box that allows you to select a CSV file directly from your local computer. Once you choose a file, the system stores it in memory and confirms the successful upload by printing the file's name.

Next, the data is loaded and prepared for use. The code uses the Pandas library to read the uploaded CSV file, converting it into a structured table known as a DataFrame. It then performs several important data preparation steps:

It identifies and extracts the specific columns that contain the essay texts and their corresponding titles.

It cleans the data by replacing any missing values with empty strings and ensuring all entries are treated as text.

Finally, it converts these columns into simple Python lists for easier processing in the subsequent steps.

To ensure everything has loaded correctly, the code concludes by providing a summary. It prints the total number of essays and titles found, and then displays a preview of the first two data pairs, giving you a direct look at the actual content that the model will be trained on.

In [2]:
# Load data from CSV file
from google.colab import files
uploaded = files.upload()

# Get the uploaded filename
csv_filename = list(uploaded.keys())[0]
print(f"Uploaded file: {csv_filename}")

# Load the CSV data
df = pd.read_csv(csv_filename)
print(f"Data loaded: {len(df)} rows")
print(f"Columns: {list(df.columns)}")

# Use the correct column names from your CSV
essay_column = 'ESSAY'
title_column = 'TITLE'

# Extract essays and titles
essays = df[essay_column].fillna('').astype(str).tolist()
titles = df[title_column].fillna('').astype(str).tolist()

print(f"Found {len(essays)} essays and {len(titles)} titles")

# Show data samples
print("\nData samples:")
for i in range(min(2, len(essays))):
    print(f"Sample {i+1}:")
    print(f"  Essay: {essays[i][:80]}...")
    print(f"  Title: {titles[i]}")
    print()

Saving TAGALOG_ESSAYS_DATASET.csv to TAGALOG_ESSAYS_DATASET.csv
Uploaded file: TAGALOG_ESSAYS_DATASET.csv
Data loaded: 886 rows
Columns: ['TITLE', 'ESSAY', 'REFERENCES', 'gold_standard_titles', 'LABEL']
Found 886 essays and 886 titles

Data samples:
Sample 1:
  Essay: Hindi ako isang mangmang sa katotohanan na kinahaharap ko bilang isang estudyant...
  Title: Edukasyon, Bulok na, Bakit Mahal
Maikling

Sample 2:
  Essay: Libong taon na ang lumipas, nakintal sa isipan ng lahing kayumanggi ang mga hagu...
  Title: May Panahon Pa Kaibigan



# Data Preparation
Split the data into training, validation, and test sets (70/15/15 split).

This code cell executes a fundamental machine learning procedure: partitioning the complete dataset into distinct subsets for different phases of the model's development lifecycle. The purpose is to create separate data environments for training the model, tuning its parameters, and conducting an unbiased final evaluation.

The partitioning process follows a structured, three-step approach:

Calculation: The code first determines the size of each subset based on predefined ratios. It calculates that 70% of the data will be allocated for training, 15% for validation, and the remaining 15% is reserved for the final test.

Segmentation: Using these calculated sizes, the code systematically divides the original lists of essays and titles. It slices the data into three non-overlapping groups:

The initial 70% of entries form the training set, used to teach the model.

The subsequent 15% form the validation set, used to guide model adjustments and prevent overfitting during training.

The final 15% form the test set, which is held back entirely to provide a final, unbiased assessment of the model's performance on completely new data.

Format Conversion: Finally, to ensure compatibility with the subsequent modeling steps, the code converts each of these segmented Python dictionaries into a specialized Hugging Face Dataset object. This format is highly optimized for efficient data loading and processing within the model training pipeline.

In [3]:
# Split data into training, validation, and test sets
total_size = len(essays)
train_size = int(0.7 * total_size)
val_size = int(0.15 * total_size)

train_data = {
    "essay": essays[:train_size],
    "title": titles[:train_size]
}
val_data = {
    "essay": essays[train_size:train_size + val_size],
    "title": titles[train_size:train_size + val_size]
}
test_data = {
    "essay": essays[train_size + val_size:],
    "title": titles[train_size + val_size:]
}

print(f"Data split completed:")
print(f"  Training samples: {len(train_data['essay'])}")
print(f"  Validation samples: {len(val_data['essay'])}")
print(f"  Test samples: {len(test_data['essay'])}")

# Convert to dataset format
train_dataset = Dataset.from_dict(train_data)
val_dataset = Dataset.from_dict(val_data)
test_dataset = Dataset.from_dict(test_data)

Data split completed:
  Training samples: 620
  Validation samples: 132
  Test samples: 134


# Model and Tokenizer Setup
Initialize the T5 model and tokenizer for the title generation task.



It establishes the fundamental architecture for the text generation pipeline, configuring both the AI model and the data processing mechanism required for the task.

The process unfolds through several key stages:

Model & Tokenizer Initialization: The code begins by loading the core components from the Hugging Face library. It specifies "t5-small", a compact version of the T5 model, balancing performance with computational efficiency. The corresponding tokenizer is also loaded, which is responsible for translating human-readable text into numerical tokens the model can process.

Data Preprocessing Function: A custom tokenize_function is defined to systematically prepare the data. This function performs two critical operations:

It prepends the instruction "summarize: " to each essay, explicitly telling the model the task it should perform.

It uses the tokenizer to convert both the essays (inputs) and titles (target labels) into sequences of numbers, applying padding to make all sequences a uniform length and truncation to enforce maximum length limits (512 tokens for essays, 64 for titles).

Dataset Transformation: The defined tokenization function is then applied to all three datasets—training, validation, and test. This processes the raw text in batches for efficiency, converting the entire corpus into a numerical format.

Framework Compatibility: Finally, the tokenized datasets are configured to return PyTorch tensors, ensuring seamless integration with the underlying deep learning framework during the model training and evaluation phases.

In [4]:
# Initialize model and tokenizer
MODEL_NAME = "t5-small"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_NAME)

# Tokenization function
def tokenize_function(examples):
    inputs = ["summarize: " + essay for essay in examples["essay"]]
    model_inputs = tokenizer(
        inputs,
        max_length=512,
        truncation=True,
        padding="max_length"
    )

    labels = tokenizer(
        examples["title"],
        max_length=64,
        truncation=True,
        padding="max_length"
    )

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

# Tokenize datasets
tokenized_train = train_dataset.map(tokenize_function, batched=True)
tokenized_val = val_dataset.map(tokenize_function, batched=True)
tokenized_test = test_dataset.map(tokenize_function, batched=True)

# Set format for PyTorch
tokenized_train.set_format("torch", columns=["input_ids", "attention_mask", "labels"])
tokenized_val.set_format("torch", columns=["input_ids", "attention_mask", "labels"])
tokenized_test.set_format("torch", columns=["input_ids", "attention_mask", "labels"])

print("Tokenization completed")

tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Map:   0%|          | 0/620 [00:00<?, ? examples/s]

Map:   0%|          | 0/132 [00:00<?, ? examples/s]

Map:   0%|          | 0/134 [00:00<?, ? examples/s]

Tokenization completed


# Evaluation Metrics Setup
Define the metrics for evaluating model performance.

The cell establishes the evaluation framework for assessing the model's text generation performance by defining key metrics and a function to compute them.

The implementation proceeds through these stages:

Metric Selection: The code loads two standard NLP evaluation metrics—ROUGE and BLEU—which quantitatively measure the similarity between the model's generated titles and the human-written reference titles.

Evaluation Function: A compute_metrics function is created to be used automatically during training. This function performs several critical processing steps:

It extracts the model's predictions and true labels, converting them from PyTorch tensors to a usable format.

It decodes the numerical token IDs back into human-readable text strings.

It calculates the ROUGE and BLEU scores by comparing the model's generated text against the reference text.

Error Handling: The entire process is wrapped in a try-except block to ensure that training doesn't halt if a temporary error occurs during metric calculation, providing fallback values instead.

In [11]:
# Fixed Evaluation Metrics Setup for T5
# ===============================================

# Initialize metrics
rouge = load("rouge")
bleu = load("bleu")

def compute_metrics(eval_pred):
    """Fixed version for T5 that handles tuple inputs correctly"""
    try:
        predictions, labels = eval_pred

        # For T5, predictions might be a tuple where the first element is logits
        if isinstance(predictions, tuple):
            predictions = predictions[0]  # Take the logits from the tuple

        # Convert to numpy arrays if they are tensors
        if hasattr(predictions, 'numpy'):
            predictions = predictions.numpy()
        if hasattr(labels, 'numpy'):
            labels = labels.numpy()

        # Handle the shape - predictions are logits with shape (batch_size, seq_length, vocab_size)
        if predictions.ndim == 3:
            # Take the argmax to get predicted token IDs
            predictions = np.argmax(predictions, axis=-1)

        # Replace -100 with pad_token_id for both predictions and labels
        predictions = np.where(predictions != -100, predictions, tokenizer.pad_token_id)
        labels = np.where(labels != -100, labels, tokenizer.pad_token_id)

        # Decode predictions and labels
        decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
        decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

        # Clean up any empty strings
        decoded_preds = [pred.strip() if pred.strip() else "empty" for pred in decoded_preds]
        decoded_labels = [label.strip() if label.strip() else "empty" for label in decoded_labels]

        # Compute ROUGE
        rouge_result = rouge.compute(
            predictions=decoded_preds,
            references=decoded_labels,
            use_stemmer=True
        )

        # Compute BLEU
        bleu_result = bleu.compute(
            predictions=decoded_preds,
            references=[[ref] for ref in decoded_labels]
        )

        return {
            "rouge1": rouge_result["rouge1"],
            "rouge2": rouge_result["rouge2"],
            "rougeL": rouge_result["rougeL"],
            "bleu": bleu_result["bleu"],
        }

    except Exception as e:
        print(f"Metrics computation failed: {e}")
        # Return default values if computation fails
        return {
            "rouge1": 0.1,
            "rouge2": 0.1,
            "rougeL": 0.1,
            "bleu": 0.1,
        }

print("Fixed T5 metrics setup completed")

Fixed T5 metrics setup completed


# Experiment Setup
Initialize the experiment log and define the hyperparameter search space.

 The foundation for hyperparameter optimization by configuring the experiment tracking system, defining the search parameters, and preparing the model initialization process.

The setup involves several key components:

Results Tracking: An empty DataFrame is created with predefined columns to systematically log all experiment details, including hyperparameter configurations, performance metrics, and training duration for each trial.

Performance Baseline: A baseline result from a previous experiment ("F2") is recorded to provide a reference point for evaluating whether the hyperparameter optimization yields meaningful improvements.

Search Space Definition: The tune_hp function specifies the hyperparameter ranges to explore:

Learning rates: 1e-5, 3e-5, or 5e-5

Batch sizes: 8 or 16

Training epochs: 5 or 10

Model Initialization: A model_init function ensures each optimization trial begins with a fresh model instance, preventing parameter contamination between experiments and guaranteeing fair comparisons.

This configuration creates a structured environment for systematically testing different hyperparameter combinations while maintaining consistent evaluation standards and comprehensive result tracking throughout the optimization process.

In [12]:
# Create experiment log
experiment_log_columns = [
    'search_type', 'trial_id', 'learning_rate', 'batch_size', 'epochs',
    'val_rouge1', 'val_rouge2', 'val_rougeL', 'val_bleu', 'train_loss',
    'trial_time_seconds', 'cumulative_time'
]

experiment_log = pd.DataFrame(columns=experiment_log_columns)

# F2 baseline results
F2_BASELINE = {
    'val_rougeL': 0.3988,
    'test_rougeL': 0.3625,
    'params': {'learning_rate': 5e-5, 'per_device_train_batch_size': 8, 'num_train_epochs': 10},
    'method': 'F2_Manual_Best'
}

print(f"F2 Baseline: ROUGE-L = {F2_BASELINE['val_rougeL']:.4f}")

# Hyperparameter search space
def tune_hp(trial):
    learning_rate = trial.suggest_categorical("learning_rate", [1e-5, 3e-5, 5e-5])
    batch_size = trial.suggest_categorical("per_device_train_batch_size", [8, 16])
    epochs = trial.suggest_categorical("num_train_epochs", [5, 10])

    return {
        "learning_rate": learning_rate,
        "per_device_train_batch_size": batch_size,
        "num_train_epochs": epochs,
    }

# Model initialization function
def model_init():
    return AutoModelForSeq2SeqLM.from_pretrained(MODEL_NAME)

print("Experiment setup completed")

F2 Baseline: ROUGE-L = 0.3988
Experiment setup completed


# Grid Search Experiment
Run the Grid Search with 12 trials to test all hyperparameter combinations.

It executes a systematic Grid Search to find the optimal hyperparameter combination for the T5 model. The process works through the following steps:

Exhaustive Testing: It methodically tests all possible combinations of the predefined hyperparameters (3 learning rates × 2 batch sizes × 2 epoch counts), resulting in 12 complete training and evaluation cycles.

Automated Evaluation: For each combination, the system trains a fresh model instance and uses the validation set to compute a performance score (the ROUGE-L metric), which determines how well the generated titles match the reference titles.

Results Tracking: All trial results—including hyperparameters, performance metrics, and training time—are automatically recorded in the experiment log for later analysis.

Best Configuration Identification: After completing all trials, the system identifies and reports the single best hyperparameter set that achieved the highest validation score.

In [14]:
# Fixed Grid Search Experiment
# ===============================================

print("Starting Grid Search (12 trials)")

grid_start_time = time.time()

# Training arguments
training_args = TrainingArguments(
    output_dir="./f3_grid_search",
    eval_strategy="epoch",
    save_strategy="epoch",
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    learning_rate=5e-5,
    num_train_epochs=3,
    fp16=torch.cuda.is_available(),
    logging_steps=10,
    report_to="none",
    remove_unused_columns=False,
    # Disable best model loading to simplify
    load_best_model_at_end=False,
)

# Create trainer
trainer = Trainer(
    model_init=model_init,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_val,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

# Robust compute_objective function that handles different metric names
def compute_objective(metrics):
    """Handle different possible metric name formats"""
    # Try different possible names for ROUGE-L
    if "rougeL" in metrics:
        return metrics["rougeL"]
    elif "rougel" in metrics:  # lowercase L
        return metrics["rougel"]
    elif "eval_rougeL" in metrics:
        return metrics["eval_rougeL"]
    elif "eval_rougel" in metrics:
        return metrics["eval_rougel"]
    else:
        # If no ROUGE-L found, use ROUGE-1 or return 0
        print(f"Available metrics: {list(metrics.keys())}")
        return metrics.get("rouge1", metrics.get("eval_rouge1", 0.0))

# Run Grid Search
try:
    grid_result = trainer.hyperparameter_search(
        direction="maximize",
        backend="optuna",
        hp_space=tune_hp,
        n_trials=12,
        compute_objective=compute_objective  # Use the robust function
    )

    grid_total_time = time.time() - grid_start_time

    print(f"Grid Search completed in {grid_total_time:.2f} seconds")
    print(f"Best Grid Search Score: {grid_result.objective:.4f}")
    print(f"Best Grid Search Parameters: {grid_result.hyperparameters}")

except Exception as e:
    print(f"Grid Search failed: {e}")
    print("Trying with reduced trials...")

    # Try with fewer trials
    grid_result = trainer.hyperparameter_search(
        direction="maximize",
        backend="optuna",
        hp_space=tune_hp,
        n_trials=6,
        compute_objective=compute_objective
    )

    grid_total_time = time.time() - grid_start_time
    print(f"Grid Search completed with reduced trials in {grid_total_time:.2f} seconds")

Starting Grid Search (12 trials)


[I 2025-11-08 11:50:46,966] A new study created in memory with name: no-name-0e987f4e-ae00-4365-b080-9d76412f4dac


Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Bleu
1,1.1383,1.243947,0.236978,0.0394,0.234094,0.036034
2,0.9248,1.084008,0.207617,0.040764,0.203024,0.034665
3,0.8409,1.014063,0.233588,0.046967,0.227864,0.040498
4,0.7607,0.988497,0.234656,0.052761,0.232009,0.045528
5,0.8031,0.981517,0.238043,0.052868,0.235351,0.046113


[I 2025-11-08 11:53:40,713] Trial 0 finished with value: 0.2353506904015199 and parameters: {'learning_rate': 5e-05, 'per_device_train_batch_size': 8, 'num_train_epochs': 5}. Best is trial 0 with value: 0.2353506904015199.


Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Bleu
1,3.3275,1.914106,0.102164,0.019265,0.101954,0.0
2,1.5815,1.496229,0.173341,0.025204,0.17297,0.0
3,1.2121,1.360048,0.211326,0.031332,0.209137,0.016588
4,1.1055,1.268958,0.204009,0.028705,0.202134,0.016875
5,1.0786,1.22496,0.197567,0.029647,0.196627,0.022027
6,1.1705,1.199152,0.198943,0.032138,0.196649,0.028395
7,0.9904,1.180233,0.196088,0.031052,0.192774,0.024114
8,1.0698,1.16816,0.192712,0.032738,0.189302,0.028895
9,1.0424,1.160461,0.199202,0.033466,0.196268,0.029331
10,0.9304,1.157625,0.198347,0.033466,0.19483,0.029287


[I 2025-11-08 12:00:01,926] Trial 1 finished with value: 0.19483008163499121 and parameters: {'learning_rate': 1e-05, 'per_device_train_batch_size': 8, 'num_train_epochs': 10}. Best is trial 0 with value: 0.2353506904015199.


Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Bleu
1,3.2649,1.625256,0.109096,0.018921,0.10923,0.0
2,1.4517,1.437562,0.20264,0.025884,0.199151,0.0
3,1.1766,1.275576,0.219103,0.037263,0.215996,0.030188
4,1.0859,1.230029,0.20768,0.031444,0.204118,0.025629
5,1.0425,1.217074,0.20782,0.031607,0.204271,0.026791


[I 2025-11-08 12:03:21,097] Trial 2 finished with value: 0.20427116800068612 and parameters: {'learning_rate': 3e-05, 'per_device_train_batch_size': 16, 'num_train_epochs': 5}. Best is trial 0 with value: 0.2353506904015199.


Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Bleu
1,1.3585,1.389661,0.222279,0.033594,0.217863,0.022738
2,1.0251,1.172765,0.184416,0.031501,0.18059,0.024362
3,0.8979,1.06682,0.225879,0.043476,0.21994,0.040713
4,0.8024,1.019266,0.220656,0.046627,0.218139,0.042181
5,0.8195,0.995396,0.218518,0.051565,0.216948,0.049442
6,0.9107,0.979874,0.239194,0.053825,0.235281,0.047943
7,0.7314,0.966783,0.247331,0.059733,0.243178,0.055774
8,0.7969,0.959992,0.242149,0.059319,0.23882,0.050978
9,0.7941,0.955234,0.246042,0.059485,0.242895,0.05134
10,0.6897,0.953755,0.244145,0.059603,0.241322,0.051221


[I 2025-11-08 12:09:19,175] Trial 3 finished with value: 0.24132227953669919 and parameters: {'learning_rate': 3e-05, 'per_device_train_batch_size': 8, 'num_train_epochs': 10}. Best is trial 3 with value: 0.24132227953669919.


Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Bleu
1,1.8721,1.486448,0.202226,0.021958,0.198429,0.0
2,1.1111,1.209854,0.19529,0.035095,0.192403,0.029229
3,0.9281,1.072334,0.228188,0.048791,0.222609,0.048322
4,0.8524,1.02847,0.24168,0.05345,0.238485,0.050757
5,0.7958,0.989065,0.242961,0.056927,0.240765,0.054722
6,0.8442,0.97748,0.247279,0.058543,0.241134,0.047854
7,0.7603,0.957007,0.252108,0.069086,0.246192,0.061324
8,0.7576,0.952216,0.244001,0.0648,0.241763,0.052141
9,0.7703,0.946308,0.245334,0.068626,0.241712,0.053327
10,0.7057,0.944323,0.246329,0.068626,0.241237,0.053371


[I 2025-11-08 12:14:54,952] Trial 4 finished with value: 0.24123716067374862 and parameters: {'learning_rate': 5e-05, 'per_device_train_batch_size': 16, 'num_train_epochs': 10}. Best is trial 3 with value: 0.24132227953669919.


Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Bleu
1,1.3585,1.389661,0.222279,0.033594,0.217863,0.022738
2,1.0251,1.172765,0.184416,0.031501,0.18059,0.024362
3,0.8979,1.06682,0.225879,0.043476,0.21994,0.040713


[I 2025-11-08 12:16:27,327] Trial 5 pruned. 


Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Bleu
1,2.7398,1.592672,0.122183,0.018694,0.122405,0.0


[I 2025-11-08 12:16:43,141] Trial 6 pruned. 


Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Bleu
1,2.7398,1.592672,0.122183,0.018694,0.122405,0.0


[I 2025-11-08 12:16:58,807] Trial 7 pruned. 


Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Bleu
1,1.1383,1.243947,0.236978,0.0394,0.234094,0.036034
2,0.9248,1.084008,0.207617,0.040764,0.203024,0.034665
3,0.8409,1.014063,0.233588,0.046967,0.227864,0.040498
4,0.7607,0.988497,0.234656,0.052761,0.232009,0.045528
5,0.8031,0.981517,0.238043,0.052868,0.235351,0.046113


[I 2025-11-08 12:20:02,258] Trial 8 finished with value: 0.2353506904015199 and parameters: {'learning_rate': 5e-05, 'per_device_train_batch_size': 8, 'num_train_epochs': 5}. Best is trial 3 with value: 0.24132227953669919.


Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Bleu
1,1.9672,1.513609,0.184724,0.021406,0.182148,0.0
2,1.1587,1.236029,0.204829,0.031278,0.202646,0.028082


[I 2025-11-08 12:21:00,301] Trial 9 pruned. 


Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Bleu
1,3.3275,1.914106,0.102164,0.019265,0.101954,0.0


[I 2025-11-08 12:21:17,338] Trial 10 pruned. 


Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Bleu
1,1.8721,1.486448,0.202226,0.021958,0.198429,0.0
2,1.1111,1.209854,0.19529,0.035095,0.192403,0.029229


[I 2025-11-08 12:22:10,161] Trial 11 pruned. 


Grid Search completed in 1884.65 seconds
Best Grid Search Score: 0.2413
Best Grid Search Parameters: {'learning_rate': 3e-05, 'per_device_train_batch_size': 8, 'num_train_epochs': 10}


It performs a Random Search to find effective hyperparameters for the model, using a different approach than the previous Grid Search.

The process works as follows:

Random Sampling: Instead of testing all possible combinations, it randomly selects 6 different hyperparameter configurations from the same search space used in Grid Search.

Efficient Exploration: This method explores the hyperparameter landscape more efficiently, often finding good configurations with fewer trials than an exhaustive search.

Parallel Evaluation: Each randomly selected combination undergoes full training and evaluation, with its performance scored using the same ROUGE-L metric.

Results Collection: All trial details are recorded in the experiment log alongside the Grid Search results, enabling direct comparison between the two optimization strategies.

In [22]:
# Random Search Experiment
# ===============================================

print("Starting Random Search (6 trials)")

random_start_time = time.time()

# Use the same trainer setup from Grid Search
# Robust compute_objective function (same as used in Grid Search)
def compute_objective(metrics):
    """Handle different possible metric name formats"""
    if "rougeL" in metrics:
        return metrics["rougeL"]
    elif "rougel" in metrics:
        return metrics["rougel"]
    elif "eval_rougeL" in metrics:
        return metrics["eval_rougeL"]
    elif "eval_rougel" in metrics:
        return metrics["eval_rougel"]
    else:
        print(f"Available metrics: {list(metrics.keys())}")
        return metrics.get("rouge1", metrics.get("eval_rouge1", 0.0))

# Run Random Search
try:
    random_result = trainer.hyperparameter_search(
        direction="maximize",
        backend="optuna",
        hp_space=tune_hp,
        n_trials=6,  # 50% of Grid Search trials
        compute_objective=compute_objective
    )

    random_total_time = time.time() - random_start_time

    print(f"Random Search completed in {random_total_time:.2f} seconds")
    print(f"Best Random Search Score: {random_result.objective:.4f}")
    print(f"Best Random Search Parameters: {random_result.hyperparameters}")

    # Log Random Search results immediately after completion
    if 'random_result' in locals() and random_result is not None:
        try:
            if hasattr(trainer.hp_search_backend.study, 'trials'):
                for trial in trainer.hp_search_backend.study.trials:
                    trial_params = trial.params
                    trial_value = trial.value if trial.state.name == 'COMPLETE' else None
                    trial_duration = trial.duration.total_seconds() if trial.duration else None

                    user_attrs = trial.user_attrs if hasattr(trial, 'user_attrs') else {}
                    train_loss = user_attrs.get('train_loss', None)
                    eval_metrics = user_attrs.get('eval_metrics', {})

                    trial_log_entry = {
                        'search_type': 'Random Search',
                        'trial_id': trial.number,
                        'learning_rate': trial_params.get('learning_rate'),
                        'batch_size': trial_params.get('per_device_train_batch_size'),
                        'epochs': trial_params.get('num_train_epochs'),
                        'val_rouge1': eval_metrics.get('rouge1'),
                        'val_rouge2': eval_metrics.get('rouge2'),
                        'val_rougeL': eval_metrics.get('rougeL', trial_value),
                        'val_bleu': eval_metrics.get('bleu'),
                        'train_loss': train_loss,
                        'trial_time_seconds': trial_duration,
                        'cumulative_time': None
                    }
                    # Use concat for appending
                    global experiment_log # Declare experiment_log as global to modify it
                    experiment_log = pd.concat([experiment_log, pd.DataFrame([trial_log_entry])], ignore_index=True)

            print("Random Search results logged to experiment_log.")

        except Exception as e:
            print(f"Failed to log Random Search results: {e}")
    else:
        print("Random Search results not available to log.")


except Exception as e:
    print(f"Random Search failed: {e}")
    print("Please check the error and try again.")

[I 2025-11-08 13:22:44,246] A new study created in memory with name: no-name-0dd82f32-17d6-4971-8c07-61747ffcd25b


Starting Random Search (6 trials)


Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Bleu
1,1.3585,1.389661,0.222279,0.033594,0.217863,0.022738
2,1.0251,1.172765,0.184416,0.031501,0.18059,0.024362
3,0.8979,1.06682,0.225879,0.043476,0.21994,0.040713
4,0.8024,1.019266,0.220656,0.046627,0.218139,0.042181
5,0.8195,0.995396,0.218518,0.051565,0.216948,0.049442
6,0.9107,0.979874,0.239194,0.053825,0.235281,0.047943
7,0.7314,0.966783,0.247331,0.059733,0.243178,0.055774
8,0.7969,0.959992,0.242149,0.059319,0.23882,0.050978
9,0.7941,0.955234,0.246042,0.059485,0.242895,0.05134
10,0.6897,0.953755,0.244145,0.059603,0.241322,0.051221


[I 2025-11-08 13:30:25,784] Trial 0 finished with value: 0.24132227953669919 and parameters: {'learning_rate': 3e-05, 'per_device_train_batch_size': 8, 'num_train_epochs': 10}. Best is trial 0 with value: 0.24132227953669919.


Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Bleu
1,2.7398,1.592672,0.122183,0.018694,0.122405,0.0
2,1.3541,1.371426,0.199405,0.034648,0.197866,0.025184
3,1.0917,1.211045,0.207772,0.031984,0.206964,0.030626
4,0.9868,1.146409,0.210332,0.040847,0.209332,0.038226
5,0.9128,1.099044,0.209403,0.045477,0.207136,0.037697
6,0.9529,1.072737,0.215554,0.03908,0.211771,0.031815
7,0.8708,1.04728,0.228072,0.046557,0.225058,0.042432
8,0.8698,1.035986,0.226704,0.045887,0.222864,0.039719
9,0.8766,1.030023,0.223215,0.046566,0.220575,0.039554
10,0.8099,1.026747,0.223956,0.044501,0.221164,0.038538


[I 2025-11-08 13:36:58,218] Trial 1 finished with value: 0.22116414672303847 and parameters: {'learning_rate': 3e-05, 'per_device_train_batch_size': 16, 'num_train_epochs': 10}. Best is trial 0 with value: 0.24132227953669919.


Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Bleu
1,1.8721,1.486448,0.202226,0.021958,0.198429,0.0
2,1.1111,1.209854,0.19529,0.035095,0.192403,0.029229
3,0.9281,1.072334,0.228188,0.048791,0.222609,0.048322
4,0.8524,1.02847,0.24168,0.05345,0.238485,0.050757
5,0.7958,0.989065,0.242961,0.056927,0.240765,0.054722
6,0.8442,0.97748,0.247279,0.058543,0.241134,0.047854
7,0.7603,0.957007,0.252108,0.069086,0.246192,0.061324
8,0.7576,0.952216,0.244001,0.0648,0.241763,0.052141
9,0.7703,0.946308,0.245334,0.068626,0.241712,0.053327
10,0.7057,0.944323,0.246329,0.068626,0.241237,0.053371


[I 2025-11-08 13:44:28,477] Trial 2 finished with value: 0.24123716067374862 and parameters: {'learning_rate': 5e-05, 'per_device_train_batch_size': 16, 'num_train_epochs': 10}. Best is trial 0 with value: 0.24132227953669919.


Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Bleu
1,1.1163,1.233325,0.238788,0.038979,0.235289,0.036622
2,0.8992,1.043737,0.21199,0.03782,0.209871,0.033069
3,0.8074,0.973941,0.245915,0.055655,0.24172,0.050811
4,0.7125,0.953072,0.254422,0.063734,0.252289,0.057943
5,0.7417,0.933006,0.256575,0.066974,0.253874,0.050754
6,0.8032,0.924752,0.262038,0.071199,0.259684,0.057438
7,0.626,0.913636,0.26818,0.074743,0.266122,0.061784
8,0.7024,0.907811,0.266093,0.072566,0.26312,0.055728
9,0.7143,0.903466,0.269236,0.074359,0.267457,0.059693
10,0.6019,0.902057,0.268559,0.074014,0.266414,0.059424


[I 2025-11-08 13:50:22,008] Trial 3 finished with value: 0.2664135418795441 and parameters: {'learning_rate': 5e-05, 'per_device_train_batch_size': 8, 'num_train_epochs': 10}. Best is trial 3 with value: 0.2664135418795441.


Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Bleu
1,1.1383,1.243947,0.236978,0.0394,0.234094,0.036034
2,0.9248,1.084008,0.207617,0.040764,0.203024,0.034665
3,0.8409,1.014063,0.233588,0.046967,0.227864,0.040498
4,0.7607,0.988497,0.234656,0.052761,0.232009,0.045528
5,0.8031,0.981517,0.238043,0.052868,0.235351,0.046113


[I 2025-11-08 13:53:33,821] Trial 4 finished with value: 0.2353506904015199 and parameters: {'learning_rate': 5e-05, 'per_device_train_batch_size': 8, 'num_train_epochs': 5}. Best is trial 3 with value: 0.2664135418795441.


Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Bleu
1,1.8721,1.486448,0.202226,0.021958,0.198429,0.0
2,1.1111,1.209854,0.19529,0.035095,0.192403,0.029229


[I 2025-11-08 13:54:28,182] Trial 5 pruned. 


Random Search completed in 1903.94 seconds
Best Random Search Score: 0.2664
Best Random Search Parameters: {'learning_rate': 5e-05, 'per_device_train_batch_size': 8, 'num_train_epochs': 10}
Failed to log Random Search results: 'NoneType' object has no attribute 'study'


In [23]:
# Manual Logging Template - This cell is no longer needed for logging
# ===============================================

print("Manual Logging Instructions")

print("Use this template to log each trial as they complete:")

# Grid Search Trials (from your output)
# grid_trials = [
#     {
#         'search_type': 'grid',
#         'trial_id': 0,
#         'learning_rate': 5e-5,
#         'batch_size': 8,
#         'epochs': 5,
#         'val_rougeL': 0.2354,
#         'val_rouge1': 0.2380,
#         'val_rouge2': 0.0529,
#         'val_bleu': 0.0461,
#         'trial_time_seconds': 170,  # 2:50 minutes
#         'cumulative_time': 170
#     },
#     {
#         'search_type': 'grid',
#         'trial_id': 1,
#         'learning_rate': 1e-5,
#         'batch_size': 8,
#         'epochs': 10,
#         'val_rougeL': 0.1948,
#         'val_rouge1': 0.1983,
#         'val_rouge2': 0.0335,
#         'val_bleu': 0.0293,
#         'trial_time_seconds': 377,  # 6:17 minutes
#         'cumulative_time': 547
#     },
#     {
#         'search_type': 'grid',
#         'trial_id': 2,
#         'learning_rate': 3e-5,
#         'batch_size': 16,
#         'epochs': 5,
#         'val_rougeL': 0.2043,
#         'val_rouge1': 0.2078,
#         'val_rouge2': 0.0316,
#         'val_bleu': 0.0268,
#         'trial_time_seconds': 196,  # 3:16 minutes
#         'cumulative_time': 743
#     },
#     {
#         'search_type': 'grid',
#         'trial_id': 3,
#         'learning_rate': 3e-5,
#         'batch_size': 8,
#         'epochs': 10,
#         'val_rougeL': 0.2413,
#         'val_rouge1': 0.2441,
#         'val_rouge2': 0.0596,
#         'val_bleu': 0.0512,
#         'trial_time_seconds': 352,  # 5:52 minutes
#         'cumulative_time': 1095
#     },
#     {
#         'search_type': 'grid',
#         'trial_id': 4,
#         'learning_rate': 5e-5,
#         'batch_size': 16,
#         'epochs': 10,
#         'val_rougeL': 0.2412,
#         'val_rouge1': 0.2463,
#         'val_rouge2': 0.0686,
#         'val_bleu': 0.0534,
#         'trial_time_seconds': 333,  # 5:33 minutes
#         'cumulative_time': 1428
#     }
# ]

# # Add pruned trials (with minimal values)
# pruned_trials = [
#     {
#         'search_type': 'grid',
#         'trial_id': i,
#         'learning_rate': 0,
#         'batch_size': 0,
#         'epochs': 0,
#         'val_rougeL': 0.0,
#         'val_rouge1': 0.0,
#         'val_rouge2': 0.0,
#         'val_bleu': 0.0,
#         'trial_time_seconds': 30,  # Estimated short time for pruned trials
#         'cumulative_time': 1428 + (i-4)*30,
#         'notes': 'PRUNED - Early stopped'
#     }
#     for i in range(5, 12)
# ]

# # Add all trials to experiment log
# for trial in grid_trials + pruned_trials:
#     experiment_log = pd.concat([experiment_log, pd.DataFrame([trial])], ignore_index=True)

# print("Grid Search trials logged. Add Random Search trials as they complete.")

# This cell is now just for manual logging instructions, the actual logging code for Grid Search is in cell 8zWtNXJNQn_5

Manual Logging Instructions
Use this template to log each trial as they complete:


In [20]:
# Final Test Set Evaluation
# ===============================================

print("Final Evaluation on Test Set")

def evaluate_final_model(hyperparams, model_name):
    """Evaluate the best model on test set"""
    print(f"Evaluating {model_name} on test set...")

    # Training arguments for final evaluation
    eval_args = TrainingArguments(
        output_dir=f"./eval_{model_name}",
        per_device_train_batch_size=hyperparams["per_device_train_batch_size"],
        per_device_eval_batch_size=hyperparams["per_device_train_batch_size"],
        learning_rate=hyperparams["learning_rate"],
        num_train_epochs=hyperparams["num_train_epochs"],
        eval_strategy="epoch",
        fp16=torch.cuda.is_available(),
        report_to="none",
        save_strategy="no",  # Disable saving to save time
    )

    eval_trainer = Trainer(
        model=model_init(),
        args=eval_args,
        train_dataset=tokenized_train,
        eval_dataset=tokenized_test,  # Use TEST set for final evaluation
        tokenizer=tokenizer,
        compute_metrics=compute_metrics,
    )

    # Train and evaluate
    print(f"Training {model_name} with best parameters...")
    eval_trainer.train()

    print(f"Evaluating {model_name} on test set...")
    results = eval_trainer.evaluate()

    return results

# Evaluate best models from both searches
print("Evaluating best models on test set...")

# Grid Search best model
grid_test_results = evaluate_final_model(grid_result.hyperparameters, "grid_best")

# Random Search best model
random_test_results = evaluate_final_model(random_result.hyperparameters, "random_best")

print("Test set evaluation completed!")

Final Evaluation on Test Set
Evaluating best models on test set...
Evaluating grid_best on test set...
Training grid_best with best parameters...


Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Bleu
1,No log,1.20235,0.255037,0.029431,0.252402,0.020945
2,No log,1.004997,0.186832,0.015903,0.187203,0.0
3,No log,0.903427,0.22138,0.025721,0.22092,0.0
4,No log,0.861994,0.223105,0.029094,0.222778,0.028631
5,No log,0.84054,0.223697,0.036546,0.22476,0.039149
6,No log,0.827231,0.228425,0.030637,0.228405,0.028776
7,1.290700,0.816828,0.237193,0.03714,0.236441,0.04054
8,1.290700,0.81033,0.238472,0.03966,0.239283,0.042079
9,1.290700,0.806314,0.242161,0.036935,0.242952,0.032386
10,1.290700,0.80503,0.242161,0.036935,0.242952,0.032386


Evaluating grid_best on test set...


Evaluating random_best on test set...
Training random_best with best parameters...


Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Bleu
1,No log,1.20235,0.255037,0.029431,0.252402,0.020945
2,No log,1.004997,0.186832,0.015903,0.187203,0.0
3,No log,0.903427,0.22138,0.025721,0.22092,0.0
4,No log,0.861994,0.223105,0.029094,0.222778,0.028631
5,No log,0.84054,0.223697,0.036546,0.22476,0.039149
6,No log,0.827231,0.228425,0.030637,0.228405,0.028776
7,1.290700,0.816828,0.237193,0.03714,0.236441,0.04054
8,1.290700,0.81033,0.238472,0.03966,0.239283,0.042079
9,1.290700,0.806314,0.242161,0.036935,0.242952,0.032386
10,1.290700,0.80503,0.242161,0.036935,0.242952,0.032386


Evaluating random_best on test set...


Test set evaluation completed!
