# Installation

**Use and Importance:** This cell is crucial for setting up the environment by installing all the necessary Python libraries required for the project. Without these libraries, the subsequent code cells that rely on them would fail to execute.

**Observation:** The output shows the progress of the installation and confirms that the required packages are being downloaded and installed. The `[?25l` and `[90m` sequences are ANSI escape codes used for formatting the output in the terminal, indicating progress bars and colors.

In [1]:
# ==================== INSTALLATION ====================
!pip install -q sentence-transformers scikit-learn pandas optuna torch

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/400.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m399.4/400.9 kB[0m [31m16.3 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m400.9/400.9 kB[0m [31m9.4 MB/s[0m eta [36m0:00:00[0m
[?25h

# Imports and Configuration

**Use and Importance:** This cell imports all the necessary modules and libraries from the installed packages. It also sets up the `CONFIG` dictionary, which centralizes important parameters like the dataset file path, output directory, and device to be used (CPU or GPU). Mounting Google Drive is essential for accessing the dataset and saving results directly to your Drive. Disabling WANDB prevents logging to that service if it's not desired.

**Observation:** The output confirms that Google Drive was successfully mounted and prints the configuration details, including the detected device (cuda in this case, indicating GPU is available) and the random state used for reproducibility.

In [2]:
# ==================== IMPORTS ====================
from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader
from sklearn.metrics import (accuracy_score, precision_score, recall_score,
                            f1_score, confusion_matrix, classification_report)
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
import optuna
import time
import os
import shutil
import re
import torch
from datetime import datetime
from itertools import product
import random
import json

# ==================== DISABLE WANDB ====================
os.environ['WANDB_DISABLED'] = 'true'

# ==================== MOUNT GOOGLE DRIVE ====================
from google.colab import drive
drive.mount('/content/drive')

# ==================== CONFIGURATION ====================
CONFIG = {
    'file_path': '/content/drive/MyDrive/TAGALOG-ESSAYS.csv',
    'output_dir': '/content/drive/MyDrive/tagalog_mismatch_optimization',
    'device': 'cuda' if torch.cuda.is_available() else 'cpu',
    'random_state': 42
}

print("="*80)
print("MiniLM for Title-Content Mismatch Detection in Tagalog Essays")
print("Hyperparameter Optimization Study")
print("="*80)
print(f"Using device: {CONFIG['device']}")
print(f"GPU Available: {torch.cuda.is_available()}")
print("="*80)

Mounted at /content/drive
MiniLM for Title-Content Mismatch Detection in Tagalog Essays
Hyperparameter Optimization Study
Using device: cuda
GPU Available: True


# Load Dataset

**Use and Importance:** This cell is responsible for loading the raw data from the specified CSV file into a pandas DataFrame. This is the first step in any data-driven project, making the data accessible for manipulation and analysis in Python.

**Observation:** The output confirms that the dataset was loaded successfully, shows the total number of samples (886), lists the column names ('TITLE', 'ESSAY', 'LABEL'), provides a sample of the raw data from the first row, and shows the initial distribution of the 'LABEL' column, indicating a relatively balanced dataset.

In [3]:
# ==================== LOAD DATASET ====================
print("\n" + "="*80)
print("LOADING DATASET")
print("="*80)

df = pd.read_csv(CONFIG['file_path'])

print("Dataset loaded successfully.")
print(f"Total samples: {len(df)}")
print(f"Columns: {list(df.columns)}")

# Show sample before preprocessing
print("\n" + "="*80)
print("SAMPLE DATA (Before Preprocessing)")
print("="*80)
print(f"Title sample: {df['TITLE'].iloc[0][:100]}")
print(f"Essay sample: {df['ESSAY'].iloc[0][:100]}...")
print(f"\nLabel distribution:\n{df['LABEL'].value_counts()}")


LOADING DATASET
Dataset loaded successfully.
Total samples: 886
Columns: ['TITLE', 'ESSAY', 'LABEL']

SAMPLE DATA (Before Preprocessing)
Title sample: Edukasyon, Bulok na, Bakit Mahal
Maikling
Essay sample: Hindi ako isang mangmang sa katotohanan na kinahaharap ko bilang isang estudyante, at hindi isang bu...

Label distribution:
LABEL
1    477
0    409
Name: count, dtype: int64


# Data Preprocessing

**Use and Importance:** The `preprocess_text` function defined and applied in this cell is vital for cleaning and standardizing the text data. Raw text often contains noise like extra spaces, inconsistent punctuation, URLs, and numbers that can negatively impact model performance. This preprocessing step prepares the text for effective tokenization and embedding.

**Observation:** The output confirms that preprocessing was applied to both 'TITLE' and 'ESSAY' columns. It then shows a sample of the data before and after preprocessing for comparison, demonstrating the cleaning effect (e.g., removal of newline characters and normalization of spaces and quotes).

In [4]:
# ==================== DATA PREPROCESSING ====================
print("\n" + "="*80)
print("DATA PREPROCESSING")
print("="*80)

def preprocess_text(text):
    """
    Clean and normalize Tagalog text for better model performance
    """
    if pd.isna(text) or text is None:
        return ""

    text = str(text)

    # 1. Remove extra whitespace
    text = re.sub(r'\s+', ' ', text)

    # 2. Remove leading/trailing whitespace
    text = text.strip()

    # 3. Normalize quotes
    text = text.replace('"', '"').replace('"', '"')
    text = text.replace(''', "'").replace(''', "'")

    # 4. Keep Tagalog letters and basic punctuation
    text = re.sub(r'[^\w\s.,!?;:\-áéíóúàèìòùäëïöüñÁÉÍÓÚÀÈÌÒÙÄËÌÒÙÄËÏÖÜÑ]', '', text)

    # 5. Remove multiple punctuation
    text = re.sub(r'([.!?])\1+', r'\1', text)

    # 6. Remove URLs
    text = re.sub(r'http\S+|www.\S+', '', text)

    # 7. Remove email addresses
    text = re.sub(r'\S+@\S+', '', text)

    # 8. Remove numbers-only sequences
    text = re.sub(r'\b\d+\b', '', text)

    # 9. Final cleanup
    text = re.sub(r'\s+', ' ', text).strip()

    return text

# Apply preprocessing
print("Preprocessing TITLE column...")
df['TITLE_CLEAN'] = df['TITLE'].apply(preprocess_text)

print("Preprocessing ESSAY column...")
df['ESSAY_CLEAN'] = df['ESSAY'].apply(preprocess_text)

# Show sample after preprocessing
print("\n" + "="*80)
print("SAMPLE DATA (After Preprocessing)")
print("="*80)
print(f"Title before: {df['TITLE'].iloc[0][:100]}")
print(f"Title after:  {df['TITLE_CLEAN'].iloc[0][:100]}")
print(f"\nEssay before: {df['ESSAY'].iloc[0][:100]}...")
print(f"Essay after:  {df['ESSAY_CLEAN'].iloc[0][:100]}...")


DATA PREPROCESSING
Preprocessing TITLE column...
Preprocessing ESSAY column...

SAMPLE DATA (After Preprocessing)
Title before: Edukasyon, Bulok na, Bakit Mahal
Maikling
Title after:  Edukasyon, Bulok na, Bakit Mahal Maikling

Essay before: Hindi ako isang mangmang sa katotohanan na kinahaharap ko bilang isang estudyante, at hindi isang bu...
Essay after:  Hindi ako isang mangmang sa katotohanan na kinahaharap ko bilang isang estudyante, at hindi isang bu...


# Data Quality Checks

**Use and Importance:** This cell performs essential data quality checks after preprocessing. It helps identify potential issues like empty strings resulting from cleaning or unusually short texts that might not be meaningful for the task. Understanding the length statistics provides insights into the nature of the text data.

**Observation:** The output indicates that after preprocessing, there is 1 empty title and 1 empty essay. It also provides length statistics, showing the average, minimum, and maximum character counts for the cleaned titles and essays. The maximum essay length is quite high (18285), which might require consideration for truncation or handling in the model.

In [5]:
# ==================== DATA QUALITY CHECKS ====================
print("\n" + "="*80)
print("DATA QUALITY CHECKS")
print("="*80)

empty_titles = df[df['TITLE_CLEAN'].str.len() == 0]
empty_essays = df[df['ESSAY_CLEAN'].str.len() == 0]

print(f"Empty titles after preprocessing: {len(empty_titles)}")
print(f"Empty essays after preprocessing: {len(empty_essays)}")

print("\nLength Statistics:")
print(f"Title length (avg): {df['TITLE_CLEAN'].str.len().mean():.1f} characters")
print(f"Title length (min): {df['TITLE_CLEAN'].str.len().min()} characters")
print(f"Title length (max): {df['TITLE_CLEAN'].str.len().max()} characters")
print(f"\nEssay length (avg): {df['ESSAY_CLEAN'].str.len().mean():.1f} characters")
print(f"Essay length (min): {df['ESSAY_CLEAN'].str.len().min()} characters")
print(f"Essay length (max): {df['ESSAY_CLEAN'].str.len().max()} characters")


DATA QUALITY CHECKS
Empty titles after preprocessing: 1
Empty essays after preprocessing: 1

Length Statistics:
Title length (avg): 43.6 characters
Title length (min): 0 characters
Title length (max): 123 characters

Essay length (avg): 1441.9 characters
Essay length (min): 0 characters
Essay length (max): 18285 characters


# Data Cleaning

**Use and Importance:** This cell further cleans the dataset based on the quality checks. It removes rows with missing original data, empty preprocessed data, and duplicates to ensure the model is trained on clean and unique examples. It also filters out very short titles and essays based on predefined minimum lengths, assuming these are not informative for the task.

**Observation:** The output details the number of rows removed at each cleaning step (1 row with missing data, 0 empty preprocessed rows, 11 duplicate rows, 0 very short titles/essays). It shows the final dataset size (874 samples) and the final label distribution, which remains relatively balanced after cleaning.

In [6]:
# ==================== DATA CLEANING ====================
print("\n" + "="*80)
print("DATA CLEANING")
print("="*80)

df_before = len(df)
df = df.dropna(subset=['TITLE', 'ESSAY', 'LABEL'])
print(f"Removed {df_before - len(df)} rows with missing original data")

df_before = len(df)
df = df[df['TITLE_CLEAN'].str.len() > 0]
df = df[df['ESSAY_CLEAN'].str.len() > 0]
print(f"Removed {df_before - len(df)} rows with empty preprocessed data")

df_before = len(df)
df = df.drop_duplicates(subset=['TITLE_CLEAN', 'ESSAY_CLEAN', 'LABEL'])
print(f"Removed {df_before - len(df)} duplicate rows")

df_before = len(df)
df = df[df['TITLE_CLEAN'].str.len() >= 3]
print(f"Removed {df_before - len(df)} rows with very short titles (<3 chars)")

df_before = len(df)
df = df[df['ESSAY_CLEAN'].str.len() >= 50]
print(f"Removed {df_before - len(df)} rows with very short essays (<50 chars)")

print(f"\nFinal dataset size: {len(df)} samples")
print(f"\nFinal label distribution:\n{df['LABEL'].value_counts()}")
print(f"Label balance: {df['LABEL'].value_counts(normalize=True).round(3)}")


DATA CLEANING
Removed 1 rows with missing original data
Removed 0 rows with empty preprocessed data
Removed 11 duplicate rows
Removed 0 rows with very short titles (<3 chars)
Removed 0 rows with very short essays (<50 chars)

Final dataset size: 874 samples

Final label distribution:
LABEL
1    468
0    406
Name: count, dtype: int64
Label balance: LABEL
1    0.535
0    0.465
Name: proportion, dtype: float64


# Split Data (3-way: Train/Val/Test)

**Use and Importance:** This cell splits the cleaned dataset into training, validation, and testing sets. This is a standard practice in machine learning to train the model on the training set, tune hyperparameters on the validation set, and evaluate the final performance on a completely unseen test set. Using `stratify=df['LABEL']` is important for maintaining the proportion of each label in all splits, which is crucial for classification tasks, especially with imbalanced datasets (though this dataset is relatively balanced).

**Observation:** The output shows the number of samples allocated to each split (611 for training, 88 for validation, 175 for testing) and their respective percentages of the total dataset size. It also confirms that the label distribution is preserved across all three splits.

In [7]:
# ==================== SPLIT DATA (3-way: Train/Val/Test) ====================
print("\n" + "="*80)
print("SPLITTING DATA")
print("="*80)

# First split: train+val vs test (80/20)
train_val_df, test_df = train_test_split(
    df, test_size=0.2, random_state=CONFIG['random_state'], stratify=df['LABEL']
)

# Second split: train vs val (from train_val: 87.5%/12.5% = 70%/10% of total)
train_df, val_df = train_test_split(
    train_val_df, test_size=0.125, random_state=CONFIG['random_state'], stratify=train_val_df['LABEL']
)

print(f"Training samples: {len(train_df)} ({len(train_df)/len(df)*100:.1f}%)")
print(f"Validation samples: {len(val_df)} ({len(val_df)/len(df)*100:.1f}%)")
print(f"Testing samples: {len(test_df)} ({len(test_df)/len(df)*100:.1f}%)")
print(f"\nTrain label distribution:\n{train_df['LABEL'].value_counts()}")
print(f"Val label distribution:\n{val_df['LABEL'].value_counts()}")
print(f"Test label distribution:\n{test_df['LABEL'].value_counts()}")


SPLITTING DATA
Training samples: 611 (69.9%)
Validation samples: 88 (10.1%)
Testing samples: 175 (20.0%)

Train label distribution:
LABEL
1    327
0    284
Name: count, dtype: int64
Val label distribution:
LABEL
1    47
0    41
Name: count, dtype: int64
Test label distribution:
LABEL
1    94
0    81
Name: count, dtype: int64


# Baseline Evaluation

**Use and Importance:** This cell establishes a baseline performance measure before any fine-tuning is done. It uses a pre-trained SentenceTransformer model to encode the titles and essays in the test set and calculates the cosine similarity between the corresponding title and essay embeddings. A predefined threshold is used to classify pairs as 'match' or 'not match'. This baseline provides a point of comparison to assess the effectiveness of fine-tuning.

**Observation:** The output shows the loading of the baseline model and the progress bars for generating embeddings. The baseline results indicate an accuracy of 50.29% and an F1-Score of 66.15% with a threshold of 0.4. The confusion matrix shows that the baseline model has a high recall for the 'Match' class (85 out of 94 true matches are predicted as match) but struggles with precision (78 'Not Match' samples are incorrectly predicted as match). The warning about `HF_TOKEN` is related to Hugging Face authentication and is not critical for this specific code execution.

In [8]:
# ==================== BASELINE EVALUATION ====================
print("\n" + "="*80)
print("BASELINE EVALUATION (Pre-trained Model)")
print("="*80)

baseline_model_name = 'all-MiniLM-L6-v2'
baseline_threshold = 0.4

baseline_model = SentenceTransformer(baseline_model_name)
print(f"✓ Loaded baseline model: {baseline_model_name}")

print("\nGenerating baseline embeddings on test set...")
baseline_title_emb = baseline_model.encode(
    test_df['TITLE_CLEAN'].tolist(),
    convert_to_tensor=False,
    show_progress_bar=True
)
baseline_essay_emb = baseline_model.encode(
    test_df['ESSAY_CLEAN'].tolist(),
    convert_to_tensor=False,
    show_progress_bar=True
)

baseline_similarities = np.diag(cosine_similarity(baseline_title_emb, baseline_essay_emb))
baseline_predictions = [1 if sim >= baseline_threshold else 0 for sim in baseline_similarities]

true_labels = test_df['LABEL'].tolist()

baseline_accuracy = accuracy_score(true_labels, baseline_predictions)
baseline_precision = precision_score(true_labels, baseline_predictions, zero_division=0)
baseline_recall = recall_score(true_labels, baseline_predictions, zero_division=0)
baseline_f1 = f1_score(true_labels, baseline_predictions, zero_division=0)

print("\n" + "="*80)
print("BASELINE RESULTS")
print("="*80)
print(f"Accuracy:  {baseline_accuracy:.4f} ({baseline_accuracy*100:.2f}%)")
print(f"Precision: {baseline_precision:.4f} ({baseline_precision*100:.2f}%)")
print(f"Recall:    {baseline_recall:.4f} ({baseline_recall*100:.2f}%)")
print(f"F1-Score:  {baseline_f1:.4f} ({baseline_f1*100:.2f}%)")
print(f"Threshold: {baseline_threshold}")
print("="*80)

print("\nBaseline Confusion Matrix:")
print(confusion_matrix(true_labels, baseline_predictions))

# Clean up baseline model
del baseline_model
torch.cuda.empty_cache() if torch.cuda.is_available() else None


BASELINE EVALUATION (Pre-trained Model)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

✓ Loaded baseline model: all-MiniLM-L6-v2

Generating baseline embeddings on test set...


Batches:   0%|          | 0/6 [00:00<?, ?it/s]

Batches:   0%|          | 0/6 [00:00<?, ?it/s]


BASELINE RESULTS
Accuracy:  0.5029 (50.29%)
Precision: 0.5215 (52.15%)
Recall:    0.9043 (90.43%)
F1-Score:  0.6615 (66.15%)
Threshold: 0.4

Baseline Confusion Matrix:
[[ 3 78]
 [ 9 85]]


# Model Training Function

**Use and Importance:** This cell defines a reusable function `train_model` that encapsulates the fine-tuning process for a SentenceTransformer model. It takes hyperparameters as input, prepares the data using `InputExample` and `DataLoader`, sets up the `CosineSimilarityLoss`, and trains the model using the `model.fit` method. This function is crucial for the hyperparameter optimization process, allowing different parameter combinations to be easily tested.

**Observation:** This cell defines a function and does not produce direct output when executed, other than potentially printing " Training error" if an exception occurs during the function definition or initial setup (which did not happen here). The function's execution and output will be observed when it is called later in the notebook.

In [9]:
# ==================== MODEL TRAINING FUNCTION ====================
def train_model(params, train_df, verbose=False):
    """Train a model with given hyperparameters"""
    try:
        model = SentenceTransformer(params['model_name'])
        model.to(CONFIG['device'])

        train_examples = [
            InputExample(
                texts=[str(row['TITLE_CLEAN']), str(row['ESSAY_CLEAN'])],
                label=float(row['LABEL'])
            )
            for _, row in train_df.iterrows()
        ]

        train_dataloader = DataLoader(
            train_examples,
            shuffle=True,
            batch_size=params['batch_size']
        )

        train_loss = losses.CosineSimilarityLoss(model)

        warmup_steps = params.get('warmup_steps', int(len(train_dataloader) * params['num_epochs'] * 0.1))

        model.fit(
            train_objectives=[(train_dataloader, train_loss)],
            epochs=params['num_epochs'],
            warmup_steps=warmup_steps,
            weight_decay=params.get('weight_decay', 0.01),
            show_progress_bar=verbose,
            use_amp=False
        )

        return model
    except Exception as e:
        print(f"  Training error: {e}")
        return None

# Model Evaluation Function

**Use and Importance:** This cell defines the `evaluate_model` function, which is essential for assessing the performance of a trained model. It takes a model and a dataset (typically the validation or test set) as input, generates embeddings, calculates cosine similarity, and computes various evaluation metrics (accuracy, precision, recall, F1-score) based on a given threshold. This function is used repeatedly during hyperparameter tuning and for the final evaluation.

**Observation:** Similar to the training function, this cell defines a function and does not produce direct output upon execution. Its output will be seen when it is called in the hyperparameter search and main execution cells.

In [10]:
# ==================== MODEL EVALUATION FUNCTION ====================
def evaluate_model(model, df, threshold=0.4, return_details=False):
    """Evaluate model on a dataset"""
    title_emb = model.encode(
        df['TITLE_CLEAN'].tolist(),
        convert_to_tensor=False,
        show_progress_bar=False,
        batch_size=32
    )
    essay_emb = model.encode(
        df['ESSAY_CLEAN'].tolist(),
        convert_to_tensor=False,
        show_progress_bar=False,
        batch_size=32
    )

    similarities = np.diag(cosine_similarity(title_emb, essay_emb))
    predictions = (similarities >= threshold).astype(int)
    true_labels = df['LABEL'].tolist()

    metrics = {
        "accuracy": accuracy_score(true_labels, predictions),
        "precision": precision_score(true_labels, predictions, zero_division=0),
        "recall": recall_score(true_labels, predictions, zero_division=0),
        "f1": f1_score(true_labels, predictions, zero_division=0),
        "threshold": threshold
    }

    if return_details:
        metrics['predictions'] = predictions
        metrics['similarities'] = similarities
        metrics['true_labels'] = true_labels
        metrics['confusion_matrix'] = confusion_matrix(true_labels, predictions).tolist()

    return metrics

# Grid Search

**Use and Importance:** This cell defines the `grid_search` function, which implements an exhaustive search over a predefined grid of hyperparameters. It systematically tries every combination of parameters, trains a model, and evaluates its performance on the validation set. The goal is to find the combination that yields the best performance according to a chosen metric (F1-score in this case).

**Observation:** The output shows the start of the Grid Search, the total number of combinations (24), and the parameter space being explored. It then prints the results for each trial, including the parameters used, the evaluation metrics (Accuracy, Precision, Recall, F1-Score), and the training time. It clearly indicates when a new best F1-score is found. Finally, it summarizes the grid search process, including total time, average time per trial, successful/failed trials, the best F1-score found, and the corresponding best parameters. It also lists the top 3 configurations.

In [11]:
# ==================== GRID SEARCH ====================
def grid_search(train_df, val_df, param_grid, metric='f1'):
    """Exhaustive Grid Search with improved efficiency"""
    print("\n" + "="*80)
    print("METHOD 1: GRID SEARCH (EXHAUSTIVE)")
    print("="*80)

    keys = list(param_grid.keys())
    values = [param_grid[k] for k in keys]
    combinations = [dict(zip(keys, v)) for v in product(*values)]

    total = len(combinations)
    print(f"Total combinations: {total}")
    print(f"Optimizing for: {metric.upper()}")
    print(f"Parameter space:")
    for k, v in param_grid.items():
        print(f"  * {k}: {v}")

    results = []
    start_time = time.time()
    best_score = -1
    best_params = None
    failed_count = 0

    for i, params in enumerate(combinations, 1):
        try:
            print(f"\n[Trial {i}/{total}] ", end="")
            param_str = ", ".join([f"{k}={v}" for k, v in params.items()])
            print(param_str)

            trial_start = time.time()
            model = train_model(params, train_df, verbose=False)

            if model is None:
                failed_count += 1
                results.append({'params': params.copy(), metric: 0.0, 'error': 'Training failed'})
                print(f"  Training failed")
                continue

            val_metrics = evaluate_model(model, val_df, params['threshold'])
            trial_time = time.time() - trial_start

            result = {
                'params': params.copy(),
                'f1': val_metrics['f1'],
                'accuracy': val_metrics['accuracy'],
                'precision': val_metrics['precision'],
                'recall': val_metrics['recall'],
                'training_time': trial_time
            }
            results.append(result)

            # Display all metrics for better tracking
            print(f"  Acc={val_metrics['accuracy']:.4f} | P={val_metrics['precision']:.4f} | "
                  f"R={val_metrics['recall']:.4f} | F1={val_metrics['f1']:.4f} | Time={trial_time:.1f}s")

            if val_metrics[metric] > best_score:
                best_score = val_metrics[metric]
                best_params = params.copy()
                print(f"  NEW BEST! {metric.upper()}: {val_metrics[metric]:.4f} ")

            del model
            if torch.cuda.is_available():
                torch.cuda.empty_cache()

        except Exception as e:
            failed_count += 1
            print(f"  ERROR: {str(e)[:100]}")
            results.append({'params': params.copy(), metric: 0.0, 'error': str(e)})

    elapsed = time.time() - start_time
    valid_results = [r for r in results if 'error' not in r]

    print(f"\n{'='*80}")
    print(f"GRID SEARCH COMPLETED")
    print(f"{'='*80}")
    print(f"Total time: {elapsed/60:.2f} minutes")
    print(f"Avg time/trial: {elapsed/total:.2f} seconds")
    print(f"Successful trials: {len(valid_results)}/{total}")
    print(f"Failed trials: {failed_count}/{total}")

    if not valid_results:
        print("\nAll trials failed!")
        return None, results

    best_result = max(valid_results, key=lambda x: x[metric])

    print(f"\nBest {metric.upper()}: {best_result[metric]:.4f}")
    print(f"\nBest Parameters:")
    for k, v in best_result['params'].items():
        print(f"  * {k}: {v}")

    # Show top 3 configurations
    print(f"\nTop 3 Configurations:")
    sorted_results = sorted(valid_results, key=lambda x: x[metric], reverse=True)[:3]
    for idx, res in enumerate(sorted_results, 1):
        print(f"  {idx}. {metric.upper()}={res[metric]:.4f}: {res['params']}")

    return best_result, results

# Random Search

**Use and Importance:** This cell defines the `random_search` function, which explores the hyperparameter space by randomly sampling parameter combinations for a fixed number of iterations. Unlike Grid Search, it does not try every combination but aims to find good parameters more efficiently, especially when the search space is large. It trains and evaluates models similarly to Grid Search and tracks the best performing parameters.

**Observation:** The output indicates the start of the Random Search, the number of iterations (20), and the parameter distributions being sampled. It prints the results for each trial, including sampled parameters, evaluation metrics, and training time. It also highlights new best F1-scores. The summary provides total time, average time per trial, successful/failed trials, the number of unique configurations explored, the best F1-score found, and the corresponding best parameters. It also lists the top 3 configurations. The "Could not find unique configuration" warning indicates that in one attempt, the random sampling generated a parameter combination that had already been tried within the current search, which is handled by the code.

In [12]:
# ==================== RANDOM SEARCH ====================
def random_search(train_df, val_df, param_distributions, n_iter=20, metric='f1', seed=None):
    """Random Search with improved sampling and tracking"""
    print("\n" + "="*80)
    print("METHOD 2: RANDOM SEARCH")
    print("="*80)
    print(f"Number of iterations: {n_iter}")
    print(f"Optimizing for: {metric.upper()}")

    if seed is not None:
        random.seed(seed)
        np.random.seed(seed)

    print(f"\nParameter distributions:")
    for key, values in param_distributions.items():
        if isinstance(values, list):
            print(f"  * {key}: {values}")
        elif isinstance(values, tuple):
            print(f"  * {key}: [{values[0]}, {values[1]}]")

    results = []
    start_time = time.time()
    best_score = -1
    best_params = None
    failed_count = 0
    seen_configs = set()

    for i in range(n_iter):
        try:
            # Sample parameters with duplicate checking
            max_attempts = 100
            for attempt in range(max_attempts):
                params = {}
                for key, values in param_distributions.items():
                    if isinstance(values, list):
                        params[key] = random.choice(values)
                    elif isinstance(values, tuple) and len(values) == 2:
                        if isinstance(values[0], float):
                            params[key] = round(random.uniform(values[0], values[1]), 4)
                        else:
                            params[key] = random.randint(values[0], values[1])

                # Check if configuration is unique
                config_key = tuple(sorted(params.items()))
                if config_key not in seen_configs:
                    seen_configs.add(config_key)
                    break
            else:
                print(f"\n[Trial {i+1}/{n_iter}] Could not find unique configuration, using duplicate")

            print(f"\n[Trial {i+1}/{n_iter}] ", end="")
            param_str = ", ".join([f"{k}={v}" for k, v in params.items()])
            print(param_str)

            trial_start = time.time()
            model = train_model(params, train_df, verbose=False)

            if model is None:
                failed_count += 1
                results.append({'params': params.copy(), metric: 0.0, 'error': 'Training failed'})
                print(f"  Training failed")
                continue

            val_metrics = evaluate_model(model, val_df, params['threshold'])
            trial_time = time.time() - trial_start

            result = {
                'params': params.copy(),
                'f1': val_metrics['f1'],
                'accuracy': val_metrics['accuracy'],
                'precision': val_metrics['precision'],
                'recall': val_metrics['recall'],
                'training_time': trial_time
            }
            results.append(result)

            # Display all metrics
            print(f"  Acc={val_metrics['accuracy']:.4f} | P={val_metrics['precision']:.4f} | "
                  f"R={val_metrics['recall']:.4f} | F1={val_metrics['f1']:.4f} | Time={trial_time:.1f}s")

            if val_metrics[metric] > best_score:
                best_score = val_metrics[metric]
                best_params = params.copy()
                print(f"  NEW BEST! {metric.upper()}: {val_metrics[metric]:.4f} ")

            del model
            if torch.cuda.is_available():
                torch.cuda.empty_cache()

        except Exception as e:
            failed_count += 1
            print(f"  ERROR: {str(e)[:100]}")
            results.append({'params': params.copy(), metric: 0.0, 'error': str(e)})

    elapsed = time.time() - start_time
    valid_results = [r for r in results if 'error' not in r]

    print(f"\n{'='*80}")
    print(f"RANDOM SEARCH COMPLETED")
    print(f"{'='*80}")
    print(f"Total time: {elapsed/60:.2f} minutes")
    print(f"Avg time/trial: {elapsed/n_iter:.2f} seconds")
    print(f"Successful trials: {len(valid_results)}/{n_iter}")
    print(f"Failed trials: {failed_count}/{n_iter}")
    print(f"Unique configurations: {len(seen_configs)}/{n_iter}")

    if not valid_results:
        print("\nAll trials failed!")
        return None, results

    best_result = max(valid_results, key=lambda x: x[metric])

    print(f"\nBest {metric.upper()}: {best_result[metric]:.4f}")
    print(f"\nBest Parameters:")
    for k, v in best_result['params'].items():
        print(f"  * {k}: {v}")

    # Show top 3 configurations
    print(f"\nTop 3 Configurations:")
    sorted_results = sorted(valid_results, key=lambda x: x[metric], reverse=True)[:3]
    for idx, res in enumerate(sorted_results, 1):
        print(f"  {idx}. {metric.upper()}={res[metric]:.4f}: {res['params']}")

    return best_result, results

# Comparison Function

**Use and Importance:** This cell defines the `compare_search_methods` function, which serves as the main orchestrator for the hyperparameter optimization study. It defines the parameter spaces for both Grid Search and Random Search, calls the respective search functions, and then presents a summary comparison of their best results on the validation set. After identifying the winning method based on the best validation F1-score, it trains a final model using the optimal hyperparameters found by the winner on the combined training and validation data. Finally, it evaluates this fine-tuned model on the unseen test set to get the final performance metrics and compares them to the initial baseline.

**Observation:** The output shows the initiation of the comparison process, followed by the detailed outputs of the Grid Search and Random Search functions as they are called. After the searches complete, it prints a performance comparison table summarizing the best F1-score, Accuracy, Precision, and Recall for each method on the validation set. It clearly declares Random Search as the winner based on its higher F1-score and lists its optimal hyperparameters. It then proceeds to train the final model on the combined training and validation data and evaluates it on the test set, printing the fine-tuned model's performance metrics (Accuracy: 75.43%, Precision: 71.43%, Recall: 90.43%, F1 Score: 79.81%). It also shows the confusion matrix and classification report. Finally, it provides a clear comparison table showing the improvement of the fine-tuned model over the baseline across all metrics.

In [13]:
# ==================== COMPARISON FUNCTION ====================
def compare_search_methods(train_df, val_df, test_df):
    """Compare selected hyperparameter optimization methods"""

    comparison_results = {}

    # Define search spaces
    grid_param_space = {
        'batch_size': [16, 32],
        'num_epochs': [3, 4],
        'threshold': [0.35, 0.40, 0.45],
        'model_name': ['all-MiniLM-L6-v2', 'paraphrase-multilingual-MiniLM-L12-v2'],
        'warmup_steps': [100],
        'weight_decay': [0.01]
    }

    random_param_space = {
        'batch_size': [8, 16, 32],
        'num_epochs': [2, 3, 4, 5],
        'threshold': (0.3, 0.6),  # continuous range
        'model_name': ['all-MiniLM-L6-v2', 'paraphrase-multilingual-MiniLM-L12-v2'],
        'warmup_steps': [50, 100, 150, 200],
        'weight_decay': (0.0, 0.05)  # continuous range
    }

    # Run selected methods
    print("\n" + "="*80)
    print("Running selected hyperparameter optimization methods...")
    print("This may take a while depending on your hardware and dataset size.\n")

    # 1. Grid Search
    grid_best, grid_all = grid_search(train_df, val_df, grid_param_space, metric='f1')
    if grid_best:
        comparison_results['grid_search'] = {
            'best': grid_best,
            'all_results': grid_all
        }

    # 2. Random Search (with seed for reproducibility)
    random_best, random_all = random_search(
        train_df, val_df, random_param_space,
        n_iter=20,
        metric='f1',
        seed=CONFIG['random_state']
    )
    if random_best:
        comparison_results['random_search'] = {
            'best': random_best,
            'all_results': random_all
        }

    # 3. Optuna (Removed as requested)
    # optuna_best, optuna_study = optuna_search(...)


    # ==================== COMPARISON SUMMARY ====================
    print("\n" + "="*80)
    print("FINAL COMPARISON SUMMARY")
    print("="*80)

    methods = ['grid_search', 'random_search'] # Removed 'optuna_search'
    method_names = ['Grid Search', 'Random Search'] # Removed 'Optuna (Bayesian)'

    comparison_table = []
    for method, name in zip(methods, method_names):
        if method in comparison_results:
            best = comparison_results[method]['best']
            comparison_table.append({
                'Method': name,
                'F1 Score': f"{best['f1']:.4f}",
                'Accuracy': f"{best.get('accuracy', 0):.4f}",
                'Precision': f"{best.get('precision', 0):.4f}",
                'Recall': f"{best.get('recall', 0):.4f}"
            })

    print("\nPerformance Comparison:")
    df_comparison = pd.DataFrame(comparison_table)
    print(df_comparison.to_string(index=False))

    # Find winner
    if comparison_results:
        best_method = max(
            [m for m in methods if m in comparison_results],
            key=lambda m: comparison_results[m]['best']['f1']
        )
        best_result = comparison_results[best_method]['best']
        winner_name = method_names[methods.index(best_method)]

        print(f"\n{'='*80}")
        print(f"WINNER: {winner_name}")
        print(f"{'='*80}")
        print(f"Best Validation F1: {best_result['f1']:.4f}")
        print(f"\nOptimal Hyperparameters:")
        for k, v in best_result['params'].items():
            print(f"  * {k}: {v}")
    else:
        print("\nNo successful hyperparameter search methods were run.")
        return None, None, None, None


    # ==================== TRAIN FINAL MODEL ====================
    print("\n" + "="*80)
    print("TRAINING FINAL MODEL WITH OPTIMAL PARAMETERS")
    print("="*80)

    full_train_df = pd.concat([train_df, val_df])
    print(f"Training on {len(full_train_df)} samples (train + validation)")

    final_training_start = time.time()
    final_model = train_model(best_result['params'], full_train_df, verbose=True)
    final_training_time = time.time() - final_training_start

    # ==================== TEST SET EVALUATION ====================
    print("\n" + "="*80)
    print("FINAL TEST SET EVALUATION")
    print("="*80)

    test_metrics = evaluate_model(
        final_model, test_df, best_result['params']['threshold'], return_details=True
    )

    print(f"\nFine-Tuned Model Performance:")
    print(f"  Accuracy:  {test_metrics['accuracy']:.4f} ({test_metrics['accuracy']*100:.2f}%)")
    print(f"  Precision: {test_metrics['precision']:.4f} ({test_metrics['precision']*100:.2f}%)")
    print(f"  Recall:    {test_metrics['recall']:.4f} ({test_metrics['recall']*100:.2f}%)")
    print(f"  F1 Score:  {test_metrics['f1']:.4f} ({test_metrics['f1']*100:.2f}%)")
    print(f"  Training Time: {final_training_time:.2f}s")

    print(f"\nConfusion Matrix:")
    cm = np.array(test_metrics['confusion_matrix'])
    print(f"                    Predicted")
    print(f"               Not Match  Match")
    print(f"  Actual Not Match  {cm[0,0]:4d}    {cm[0,1]:4d}")
    print(f"         Match      {cm[1,0]:4d}    {cm[1,1]:4d}")

    print("\nClassification Report:")
    report = classification_report(
        test_metrics['true_labels'],
        test_metrics['predictions'],
        target_names=['Not Match', 'Match']
    )
    print(report)

    # ==================== BASELINE vs FINE-TUNED COMPARISON ====================
    print("\n" + "="*80)
    print("BASELINE vs FINE-TUNED COMPARISON")
    print("="*80)
    print(f"{'Metric':<12} {'Baseline':<12} {'Fine-tuned':<12} {'Improvement':<15}")
    print("-"*80)

    acc_improvement = test_metrics['accuracy'] - baseline_accuracy
    acc_improvement_pct = (acc_improvement / baseline_accuracy * 100) if baseline_accuracy > 0 else 0
    print(f"{'Accuracy':<12} {baseline_accuracy:.4f}       {test_metrics['accuracy']:.4f}       {acc_improvement:+.4f} ({acc_improvement_pct:+.1f}%)")

    prec_improvement = test_metrics['precision'] - baseline_precision
    prec_improvement_pct = (prec_improvement / baseline_precision * 100) if baseline_precision > 0 else 0
    print(f"{'Precision':<12} {baseline_precision:.4f}       {test_metrics['precision']:.4f}       {prec_improvement:+.4f} ({prec_improvement_pct:+.1f}%)")

    rec_improvement = test_metrics['recall'] - baseline_recall
    rec_improvement_pct = (rec_improvement / baseline_recall * 100) if baseline_recall > 0 else 0
    print(f"{'Recall':<12} {baseline_recall:.4f}       {test_metrics['recall']:.4f}       {rec_improvement:+.4f} ({rec_improvement_pct:+.1f}%)")

    f1_improvement = test_metrics['f1'] - baseline_f1
    f1_improvement_pct = (f1_improvement / baseline_f1 * 100) if baseline_f1 > 0 else 0
    print(f"{'F1-Score':<12} {baseline_f1:.4f}       {test_metrics['f1']:.4f}       {f1_improvement:+.4f} ({f1_improvement_pct:+.1f}%)")

    print("="*80)

    # ==================== SAMPLE PREDICTIONS ====================
    print("\n" + "="*80)
    print("SAMPLE PREDICTIONS: BASELINE vs FINE-TUNED")
    print("="*80)

    # Get baseline predictions for comparison
    baseline_model_temp = SentenceTransformer(baseline_model_name)
    baseline_title_temp = baseline_model_temp.encode(test_df['TITLE_CLEAN'].tolist()[:5], show_progress_bar=False)
    baseline_essay_temp = baseline_model_temp.encode(test_df['ESSAY_CLEAN'].tolist()[:5], show_progress_bar=False)
    baseline_sim_temp = np.diag(cosine_similarity(baseline_title_temp, baseline_essay_temp))
    baseline_pred_temp = (baseline_sim_temp >= baseline_threshold).astype(int)
    del baseline_model_temp

    for i in range(min(5, len(test_df))):
        title = test_df.iloc[i]['TITLE_CLEAN'][:60]
        true_label = test_metrics['true_labels'][i]

        baseline_pred = baseline_pred_temp[i]
        baseline_sim = baseline_sim_temp[i]

        finetuned_pred = test_metrics['predictions'][i]
        finetuned_sim = test_metrics['similarities'][i]

        print(f"\n Sample {i+1}: {title}...")
        print(f"True Label: {true_label}")
        print(f"Baseline   - Pred: {baseline_pred} | Sim: {baseline_sim:.4f} | {'✓' if baseline_pred == true_label else ''}")
        print(f"Fine-tuned - Pred: {finetuned_pred} | Sim: {finetuned_sim:.4f} | {'✓' if finetuned_pred == true_label else ''}")


    return final_model, comparison_results, test_metrics, final_training_time

# Save Results

**Use and Importance:** This cell defines the `save_all_results` function, which is responsible for persistently storing the outputs of the experiment. This includes saving the fine-tuned SentenceTransformer model so it can be reused later without retraining, saving a comprehensive JSON file containing all the key results and parameters from the optimization study, and saving the preprocessed data as a CSV file. Saving these artifacts is crucial for reproducibility, sharing results, and deploying the trained model.

**Observation:** The output confirms that the fine-tuned model, the optimization results JSON file, and the preprocessed data CSV file were successfully saved to the specified output directory in Google Drive. The filenames include a timestamp to distinguish between different experiment runs.

In [14]:
# ==================== SAVE RESULTS ====================
def save_all_results(model, comparison_results, test_metrics, training_time, baseline_metrics):
    """Save model and all results"""

    os.makedirs(CONFIG['output_dir'], exist_ok=True)
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")

    # Save fine-tuned model
    model_path = os.path.join(CONFIG['output_dir'], f'fine_tuned_minilm_{timestamp}')
    model.save(model_path)
    print(f"\nSave Fine-tuned model to: {model_path}")

    # Find winner
    methods = ['grid_search', 'random_search'] # Removed 'optuna_search'
    method_names = ['Grid Search', 'Random Search'] # Removed 'Optuna (Bayesian)'
    if comparison_results:
        best_method = max(
            [m for m in methods if m in comparison_results],
            key=lambda m: comparison_results[m]['best']['f1']
        )
        winner_name = method_names[methods.index(best_method)]
        best_result = comparison_results[best_method]['best']
    else:
        winner_name = 'None'
        best_method = 'None'
        best_result = {'params': {}, 'f1': 0.0}


    # Save comprehensive results
    results = {
        'timestamp': timestamp,
        'study_title': 'MiniLM for Title-Content Mismatch Detection in Tagalog Essays',
        'preprocessing': 'enabled (preprocess_text function)',
        'dataset_stats': {
            'total_samples': len(df),
            'train_samples': len(train_df),
            'val_samples': len(val_df),
            'test_samples': len(test_df),
            'avg_title_length': float(df['TITLE_CLEAN'].str.len().mean()),
            'avg_essay_length': float(df['ESSAY_CLEAN'].str.len().mean())
        },
        'baseline': baseline_metrics,
        'hyperparameter_optimization': {
            'winner': winner_name,
            'winning_method': best_method,
            'best_params': best_result['params'],
            'validation_f1': best_result['f1']
        },
        'finetuned': {
            'accuracy': float(test_metrics['accuracy']),
            'precision': float(test_metrics['precision']),
            'recall': float(test_metrics['recall']),
            'f1': float(test_metrics['f1']),
            'threshold': float(test_metrics['threshold']),
            'training_time_seconds': float(training_time)
        },
        'improvements': {
            'accuracy': float(test_metrics['accuracy'] - baseline_metrics['accuracy']),
            'precision': float(test_metrics['precision'] - baseline_metrics['precision']),
            'recall': float(test_metrics['recall'] - baseline_metrics['recall']),
            'f1': float(test_metrics['f1'] - baseline_metrics['f1'])
        },
        'comparison': {
            method: {
                'best_f1': comparison_results[method]['best']['f1'],
                'best_params': comparison_results[method]['best']['params']
            }
            for method in methods if method in comparison_results
        }
    }

    results_path = os.path.join(CONFIG['output_dir'], f'optimization_results_{timestamp}.json')
    with open(results_path, 'w') as f:
        json.dump(results, f, indent=2)
    print(f"Results saved to: {results_path}")

    # Save preprocessed data
    preprocessed_path = os.path.join(CONFIG['output_dir'], 'TAGALOG-ESSAYS-PREPROCESSED.csv')
    df.to_csv(preprocessed_path, index=False)
    print(f"Preprocessed data saved to: {preprocessed_path}")


    return results_path, model_path

# Main Execution

**Use and Importance:** This cell serves as the entry point for the entire experiment pipeline. It orchestrates the sequence of operations: starting the study, running the comparison of hyperparameter search methods (Grid Search and Random Search), training the final model with the best parameters found, evaluating it, saving all the relevant results, and finally printing a summary of the entire experiment. It also provides clear instructions on how to load and use the saved model for future predictions.

**Observation:** The output shows the overall flow of the experiment, starting with "Starting Hyperparameter Optimization Study...". It then includes the detailed outputs from the `compare_search_methods` function (which in turn calls the search and evaluation functions). After the comparison and final training/evaluation are complete, it prints "EXPERIMENT COMPLETE!" and provides a concise summary of the key steps and results, including the performance improvement over the baseline. It concludes with instructions on how to load the saved model and use it for inference, making the results actionable.

In [15]:
# ==================== MAIN EXECUTION ====================
if __name__ == "__main__":
    print("\nStarting Hyperparameter Optimization Study...")

    # Run comparison
    final_model, comparison_results, test_metrics, training_time = compare_search_methods(
        train_df, val_df, test_df
    )

    # Prepare baseline metrics for saving
    baseline_metrics = {
        'accuracy': float(baseline_accuracy),
        'precision': float(baseline_precision),
        'recall': float(baseline_recall),
        'f1': float(baseline_f1),
        'threshold': float(baseline_threshold)
    }

    # Save all results
    if final_model is not None:
        results_path, model_path = save_all_results(
            final_model, comparison_results, test_metrics, training_time, baseline_metrics
        )

        # ==================== EXPERIMENT SUMMARY ====================
        print("\n" + "="*80)
        print("EXPERIMENT COMPLETE!")
        print("="*80)
        print(f"Data preprocessed and cleaned ({len(df)} samples)")
        print(f"Baseline evaluated (F1: {baseline_f1:.4f})")
        print(f"Selected optimization methods compared:")
        print(f"    * Grid Search")
        print(f"    * Random Search")
        print(f"Final model fine-tuned (F1: {test_metrics['f1']:.4f})")
        print(f"Performance improvement: {(test_metrics['f1'] - baseline_f1):.4f}")
        print(f"\nSaved Files:")
        print(f"  * Model: {model_path}")
        print(f"  * Results: {results_path}")
        print(f"  * Preprocessed Data: {CONFIG['output_dir']}/TAGALOG-ESSAYS-PREPROCESSED.csv")
        print("\n" + "="*80)
        print("USAGE INSTRUCTIONS")
        print("="*80)
        print("\n1. To load the fine-tuned model:")
        print(f"   model = SentenceTransformer('{model_path}')")
        print("\n2. To use for prediction (remember to preprocess):")
        print("   title_clean = preprocess_text(title)")
        print("   essay_clean = preprocess_text(essay)")
        print("   title_emb = model.encode([title_clean])")
        print("   essay_emb = model.encode([essay_clean])")
        print("   similarity = cosine_similarity(title_emb, essay_emb)[0][0]")
        print(f"   is_match = similarity >= {test_metrics['threshold']}")
        print("\n3. Winner Method Details:")

        methods = ['grid_search', 'random_search'] # Removed 'optuna_search'
        method_names = ['Grid Search', 'Random Search'] # Removed 'Optuna (Bayesian)']
        best_method = max(
            [m for m in comparison_results],
            key=lambda m: comparison_results[m]['best']['f1']
        )
        winner_name = method_names[methods.index(best_method)]
        best_result = comparison_results[best_method]['best']

        print(f"   Winning Method: {winner_name}")
        print(f"   Best Validation F1: {best_result['f1']:.4f}")
        for k, v in best_result['params'].items():
            print(f"     * {k}: {v}")
        print("\n" + "="*80)
        print("Study Complete: MiniLM Fine-Tuning with Hyperparameter Optimization")
        print("="*80)
    else:
        print("\nHyperparameter optimization failed. No final model trained or saved.")


Starting Hyperparameter Optimization Study...

Running selected hyperparameter optimization methods...
This may take a while depending on your hardware and dataset size.


METHOD 1: GRID SEARCH (EXHAUSTIVE)
Total combinations: 24
Optimizing for: F1
Parameter space:
  * batch_size: [16, 32]
  * num_epochs: [3, 4]
  * threshold: [0.35, 0.4, 0.45]
  * model_name: ['all-MiniLM-L6-v2', 'paraphrase-multilingual-MiniLM-L12-v2']
  * warmup_steps: [100]
  * weight_decay: [0.01]

[Trial 1/24] batch_size=16, num_epochs=3, threshold=0.35, model_name=all-MiniLM-L6-v2, warmup_steps=100, weight_decay=0.01


Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

{'train_runtime': 24.8814, 'train_samples_per_second': 73.669, 'train_steps_per_second': 4.702, 'train_loss': 0.2059932610927484, 'epoch': 3.0}
  Acc=0.7273 | P=0.6769 | R=0.9362 | F1=0.7857 | Time=30.2s
  NEW BEST! F1: 0.7857 

[Trial 2/24] batch_size=16, num_epochs=3, threshold=0.35, model_name=paraphrase-multilingual-MiniLM-L12-v2, warmup_steps=100, weight_decay=0.01


modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/645 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/471M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.08M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

{'train_runtime': 28.4183, 'train_samples_per_second': 64.501, 'train_steps_per_second': 4.117, 'train_loss': 0.19753063234508547, 'epoch': 3.0}
  Acc=0.6818 | P=0.6377 | R=0.9362 | F1=0.7586 | Time=38.1s

[Trial 3/24] batch_size=16, num_epochs=3, threshold=0.4, model_name=all-MiniLM-L6-v2, warmup_steps=100, weight_decay=0.01


Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

{'train_runtime': 18.4319, 'train_samples_per_second': 99.447, 'train_steps_per_second': 6.348, 'train_loss': 0.2098206577138004, 'epoch': 3.0}
  Acc=0.7159 | P=0.6774 | R=0.8936 | F1=0.7706 | Time=21.6s

[Trial 4/24] batch_size=16, num_epochs=3, threshold=0.4, model_name=paraphrase-multilingual-MiniLM-L12-v2, warmup_steps=100, weight_decay=0.01


Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

{'train_runtime': 25.0183, 'train_samples_per_second': 73.266, 'train_steps_per_second': 4.677, 'train_loss': 0.19753063234508547, 'epoch': 3.0}
  Acc=0.6818 | P=0.6462 | R=0.8936 | F1=0.7500 | Time=28.6s

[Trial 5/24] batch_size=16, num_epochs=3, threshold=0.45, model_name=all-MiniLM-L6-v2, warmup_steps=100, weight_decay=0.01


Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

{'train_runtime': 18.3785, 'train_samples_per_second': 99.736, 'train_steps_per_second': 6.366, 'train_loss': 0.2098206577138004, 'epoch': 3.0}
  Acc=0.7045 | P=0.6909 | R=0.8085 | F1=0.7451 | Time=21.5s

[Trial 6/24] batch_size=16, num_epochs=3, threshold=0.45, model_name=paraphrase-multilingual-MiniLM-L12-v2, warmup_steps=100, weight_decay=0.01


Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

{'train_runtime': 23.843, 'train_samples_per_second': 76.878, 'train_steps_per_second': 4.907, 'train_loss': 0.19753063234508547, 'epoch': 3.0}
  Acc=0.6818 | P=0.6557 | R=0.8511 | F1=0.7407 | Time=27.8s

[Trial 7/24] batch_size=16, num_epochs=4, threshold=0.35, model_name=all-MiniLM-L6-v2, warmup_steps=100, weight_decay=0.01


Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

{'train_runtime': 24.7315, 'train_samples_per_second': 98.821, 'train_steps_per_second': 6.308, 'train_loss': 0.19428144357143304, 'epoch': 4.0}
  Acc=0.7273 | P=0.6825 | R=0.9149 | F1=0.7818 | Time=27.8s

[Trial 8/24] batch_size=16, num_epochs=4, threshold=0.35, model_name=paraphrase-multilingual-MiniLM-L12-v2, warmup_steps=100, weight_decay=0.01


Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

{'train_runtime': 32.501, 'train_samples_per_second': 75.198, 'train_steps_per_second': 4.8, 'train_loss': 0.18006324768066406, 'epoch': 4.0}
  Acc=0.7045 | P=0.6615 | R=0.9149 | F1=0.7679 | Time=36.3s

[Trial 9/24] batch_size=16, num_epochs=4, threshold=0.4, model_name=all-MiniLM-L6-v2, warmup_steps=100, weight_decay=0.01


Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

{'train_runtime': 24.6127, 'train_samples_per_second': 99.298, 'train_steps_per_second': 6.338, 'train_loss': 0.19031132184542143, 'epoch': 4.0}
  Acc=0.7500 | P=0.7193 | R=0.8723 | F1=0.7885 | Time=27.7s
  NEW BEST! F1: 0.7885 

[Trial 10/24] batch_size=16, num_epochs=4, threshold=0.4, model_name=paraphrase-multilingual-MiniLM-L12-v2, warmup_steps=100, weight_decay=0.01


Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

{'train_runtime': 31.7084, 'train_samples_per_second': 77.077, 'train_steps_per_second': 4.92, 'train_loss': 0.18006324768066406, 'epoch': 4.0}
  Acc=0.7159 | P=0.6774 | R=0.8936 | F1=0.7706 | Time=35.5s

[Trial 11/24] batch_size=16, num_epochs=4, threshold=0.45, model_name=all-MiniLM-L6-v2, warmup_steps=100, weight_decay=0.01


Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

{'train_runtime': 24.5172, 'train_samples_per_second': 99.685, 'train_steps_per_second': 6.363, 'train_loss': 0.19031132184542143, 'epoch': 4.0}
  Acc=0.7159 | P=0.7115 | R=0.7872 | F1=0.7475 | Time=27.9s

[Trial 12/24] batch_size=16, num_epochs=4, threshold=0.45, model_name=paraphrase-multilingual-MiniLM-L12-v2, warmup_steps=100, weight_decay=0.01


Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

{'train_runtime': 31.2572, 'train_samples_per_second': 78.19, 'train_steps_per_second': 4.991, 'train_loss': 0.18006324768066406, 'epoch': 4.0}
  Acc=0.7500 | P=0.7193 | R=0.8723 | F1=0.7885 | Time=35.0s

[Trial 13/24] batch_size=32, num_epochs=3, threshold=0.35, model_name=all-MiniLM-L6-v2, warmup_steps=100, weight_decay=0.01


Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

{'train_runtime': 17.2, 'train_samples_per_second': 106.57, 'train_steps_per_second': 3.488, 'train_loss': 0.22003564834594727, 'epoch': 3.0}
  Acc=0.6591 | P=0.6133 | R=0.9787 | F1=0.7541 | Time=20.9s

[Trial 14/24] batch_size=32, num_epochs=3, threshold=0.35, model_name=paraphrase-multilingual-MiniLM-L12-v2, warmup_steps=100, weight_decay=0.01


Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

{'train_runtime': 19.5007, 'train_samples_per_second': 93.997, 'train_steps_per_second': 3.077, 'train_loss': 0.21392677625020345, 'epoch': 3.0}
  Acc=0.6364 | P=0.6027 | R=0.9362 | F1=0.7333 | Time=23.0s

[Trial 15/24] batch_size=32, num_epochs=3, threshold=0.4, model_name=all-MiniLM-L6-v2, warmup_steps=100, weight_decay=0.01


Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

{'train_runtime': 17.0539, 'train_samples_per_second': 107.483, 'train_steps_per_second': 3.518, 'train_loss': 0.22470617294311523, 'epoch': 3.0}
  Acc=0.7159 | P=0.6719 | R=0.9149 | F1=0.7748 | Time=20.0s

[Trial 16/24] batch_size=32, num_epochs=3, threshold=0.4, model_name=paraphrase-multilingual-MiniLM-L12-v2, warmup_steps=100, weight_decay=0.01


Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

{'train_runtime': 19.3477, 'train_samples_per_second': 94.74, 'train_steps_per_second': 3.101, 'train_loss': 0.21392677625020345, 'epoch': 3.0}
  Acc=0.6477 | P=0.6176 | R=0.8936 | F1=0.7304 | Time=23.5s

[Trial 17/24] batch_size=32, num_epochs=3, threshold=0.45, model_name=all-MiniLM-L6-v2, warmup_steps=100, weight_decay=0.01


Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

{'train_runtime': 17.3346, 'train_samples_per_second': 105.742, 'train_steps_per_second': 3.461, 'train_loss': 0.22470617294311523, 'epoch': 3.0}
  Acc=0.6932 | P=0.6786 | R=0.8085 | F1=0.7379 | Time=20.4s

[Trial 18/24] batch_size=32, num_epochs=3, threshold=0.45, model_name=paraphrase-multilingual-MiniLM-L12-v2, warmup_steps=100, weight_decay=0.01


Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

{'train_runtime': 19.1568, 'train_samples_per_second': 95.684, 'train_steps_per_second': 3.132, 'train_loss': 0.21392677625020345, 'epoch': 3.0}
  Acc=0.6705 | P=0.6500 | R=0.8298 | F1=0.7290 | Time=22.7s

[Trial 19/24] batch_size=32, num_epochs=4, threshold=0.35, model_name=all-MiniLM-L6-v2, warmup_steps=100, weight_decay=0.01


Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

{'train_runtime': 22.5856, 'train_samples_per_second': 108.21, 'train_steps_per_second': 3.542, 'train_loss': 0.21446948051452636, 'epoch': 4.0}
  Acc=0.7159 | P=0.6719 | R=0.9149 | F1=0.7748 | Time=26.4s

[Trial 20/24] batch_size=32, num_epochs=4, threshold=0.35, model_name=paraphrase-multilingual-MiniLM-L12-v2, warmup_steps=100, weight_decay=0.01


Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

{'train_runtime': 25.6795, 'train_samples_per_second': 95.173, 'train_steps_per_second': 3.115, 'train_loss': 0.2009979248046875, 'epoch': 4.0}
  Acc=0.6591 | P=0.6197 | R=0.9362 | F1=0.7458 | Time=29.5s

[Trial 21/24] batch_size=32, num_epochs=4, threshold=0.4, model_name=all-MiniLM-L6-v2, warmup_steps=100, weight_decay=0.01


Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

{'train_runtime': 22.9811, 'train_samples_per_second': 106.348, 'train_steps_per_second': 3.481, 'train_loss': 0.20998086929321289, 'epoch': 4.0}
  Acc=0.7386 | P=0.7000 | R=0.8936 | F1=0.7850 | Time=25.9s

[Trial 22/24] batch_size=32, num_epochs=4, threshold=0.4, model_name=paraphrase-multilingual-MiniLM-L12-v2, warmup_steps=100, weight_decay=0.01


Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

{'train_runtime': 25.7796, 'train_samples_per_second': 94.804, 'train_steps_per_second': 3.103, 'train_loss': 0.2009979248046875, 'epoch': 4.0}
  Acc=0.6477 | P=0.6176 | R=0.8936 | F1=0.7304 | Time=29.3s

[Trial 23/24] batch_size=32, num_epochs=4, threshold=0.45, model_name=all-MiniLM-L6-v2, warmup_steps=100, weight_decay=0.01


Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

{'train_runtime': 23.0055, 'train_samples_per_second': 106.235, 'train_steps_per_second': 3.477, 'train_loss': 0.20998086929321289, 'epoch': 4.0}
  Acc=0.6932 | P=0.6852 | R=0.7872 | F1=0.7327 | Time=26.0s

[Trial 24/24] batch_size=32, num_epochs=4, threshold=0.45, model_name=paraphrase-multilingual-MiniLM-L12-v2, warmup_steps=100, weight_decay=0.01


Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

{'train_runtime': 25.7221, 'train_samples_per_second': 95.015, 'train_steps_per_second': 3.11, 'train_loss': 0.2009979248046875, 'epoch': 4.0}
  Acc=0.6818 | P=0.6557 | R=0.8511 | F1=0.7407 | Time=29.2s

GRID SEARCH COMPLETED
Total time: 10.94 minutes
Avg time/trial: 27.36 seconds
Successful trials: 24/24
Failed trials: 0/24

Best F1: 0.7885

Best Parameters:
  * batch_size: 16
  * num_epochs: 4
  * threshold: 0.4
  * model_name: all-MiniLM-L6-v2
  * warmup_steps: 100
  * weight_decay: 0.01

Top 3 Configurations:
  1. F1=0.7885: {'batch_size': 16, 'num_epochs': 4, 'threshold': 0.4, 'model_name': 'all-MiniLM-L6-v2', 'warmup_steps': 100, 'weight_decay': 0.01}
  2. F1=0.7885: {'batch_size': 16, 'num_epochs': 4, 'threshold': 0.45, 'model_name': 'paraphrase-multilingual-MiniLM-L12-v2', 'warmup_steps': 100, 'weight_decay': 0.01}
  3. F1=0.7857: {'batch_size': 16, 'num_epochs': 3, 'threshold': 0.35, 'model_name': 'all-MiniLM-L6-v2', 'warmup_steps': 100, 'weight_decay': 0.01}

METHOD 2: RANDOM

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

{'train_runtime': 13.0345, 'train_samples_per_second': 93.751, 'train_steps_per_second': 3.069, 'train_loss': 0.22485849857330323, 'epoch': 2.0}
  Acc=0.5909 | P=0.5679 | R=0.9787 | F1=0.7188 | Time=17.8s
  NEW BEST! F1: 0.7188 

[Trial 2/20] batch_size=8, num_epochs=3, threshold=0.5499, model_name=paraphrase-multilingual-MiniLM-L12-v2, warmup_steps=50, weight_decay=0.0428


Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

{'train_runtime': 30.699, 'train_samples_per_second': 59.709, 'train_steps_per_second': 7.525, 'train_loss': 0.1688884718593581, 'epoch': 3.0}
  Acc=0.7045 | P=0.7234 | R=0.7234 | F1=0.7234 | Time=34.5s
  NEW BEST! F1: 0.7234 

[Trial 3/20] batch_size=32, num_epochs=5, threshold=0.4431, model_name=paraphrase-multilingual-MiniLM-L12-v2, warmup_steps=150, weight_decay=0.0377


Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

{'train_runtime': 32.0332, 'train_samples_per_second': 95.37, 'train_steps_per_second': 3.122, 'train_loss': 0.19892139434814454, 'epoch': 5.0}
  Acc=0.7045 | P=0.6909 | R=0.8085 | F1=0.7451 | Time=36.8s
  NEW BEST! F1: 0.7451 

[Trial 4/20] batch_size=16, num_epochs=3, threshold=0.4465, model_name=paraphrase-multilingual-MiniLM-L12-v2, warmup_steps=200, weight_decay=0.0243


Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

{'train_runtime': 23.5152, 'train_samples_per_second': 77.95, 'train_steps_per_second': 4.976, 'train_loss': 0.20741725171733105, 'epoch': 3.0}
  Acc=0.6591 | P=0.6308 | R=0.8723 | F1=0.7321 | Time=27.5s

[Trial 5/20] batch_size=16, num_epochs=4, threshold=0.5007, model_name=paraphrase-multilingual-MiniLM-L12-v2, warmup_steps=100, weight_decay=0.042


Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

{'train_runtime': 31.3842, 'train_samples_per_second': 77.874, 'train_steps_per_second': 4.971, 'train_loss': 0.179810670705942, 'epoch': 4.0}
  Acc=0.6818 | P=0.6863 | R=0.7447 | F1=0.7143 | Time=35.9s

[Trial 6/20] batch_size=8, num_epochs=2, threshold=0.5721, model_name=all-MiniLM-L6-v2, warmup_steps=50, weight_decay=0.0321


Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

{'train_runtime': 13.3985, 'train_samples_per_second': 91.205, 'train_steps_per_second': 11.494, 'train_loss': 0.1976575108317586, 'epoch': 2.0}
  Acc=0.6705 | P=0.7368 | R=0.5957 | F1=0.6588 | Time=16.4s

[Trial 7/20] batch_size=8, num_epochs=5, threshold=0.3545, model_name=paraphrase-multilingual-MiniLM-L12-v2, warmup_steps=200, weight_decay=0.0164


Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

{'train_runtime': 51.1985, 'train_samples_per_second': 59.67, 'train_steps_per_second': 7.52, 'train_loss': 0.1553700335614093, 'epoch': 5.0}
  Acc=0.7045 | P=0.6667 | R=0.8936 | F1=0.7636 | Time=55.7s
  NEW BEST! F1: 0.7636 

[Trial 8/20] batch_size=16, num_epochs=4, threshold=0.4013, model_name=paraphrase-multilingual-MiniLM-L12-v2, warmup_steps=150, weight_decay=0.0376


Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

{'train_runtime': 31.1872, 'train_samples_per_second': 78.366, 'train_steps_per_second': 5.002, 'train_loss': 0.18801326018113357, 'epoch': 4.0}
  Acc=0.7386 | P=0.7069 | R=0.8723 | F1=0.7810 | Time=35.1s
  NEW BEST! F1: 0.7810 

[Trial 9/20] batch_size=16, num_epochs=4, threshold=0.5505, model_name=paraphrase-multilingual-MiniLM-L12-v2, warmup_steps=50, weight_decay=0.0375


Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

{'train_runtime': 31.293, 'train_samples_per_second': 78.101, 'train_steps_per_second': 4.985, 'train_loss': 0.1695286432902018, 'epoch': 4.0}
  Acc=0.7500 | P=0.7451 | R=0.8085 | F1=0.7755 | Time=36.2s

[Trial 10/20] batch_size=8, num_epochs=4, threshold=0.3673, model_name=all-MiniLM-L6-v2, warmup_steps=50, weight_decay=0.0377


Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

{'train_runtime': 26.8518, 'train_samples_per_second': 91.018, 'train_steps_per_second': 11.47, 'train_loss': 0.16752228179535308, 'epoch': 4.0}
  Acc=0.7614 | P=0.7241 | R=0.8936 | F1=0.8000 | Time=30.3s
  NEW BEST! F1: 0.8000 

[Trial 11/20] batch_size=8, num_epochs=3, threshold=0.5518, model_name=all-MiniLM-L6-v2, warmup_steps=100, weight_decay=0.0063


Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

{'train_runtime': 19.9308, 'train_samples_per_second': 91.968, 'train_steps_per_second': 11.59, 'train_loss': 0.18756057276870264, 'epoch': 3.0}
  Acc=0.7386 | P=0.7609 | R=0.7447 | F1=0.7527 | Time=23.1s

[Trial 12/20] batch_size=32, num_epochs=2, threshold=0.4692, model_name=all-MiniLM-L6-v2, warmup_steps=200, weight_decay=0.035


Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

{'train_runtime': 11.54, 'train_samples_per_second': 105.892, 'train_steps_per_second': 3.466, 'train_loss': 0.23953239917755126, 'epoch': 2.0}
  Acc=0.6023 | P=0.6111 | R=0.7021 | F1=0.6535 | Time=14.4s

[Trial 13/20] batch_size=16, num_epochs=3, threshold=0.4818, model_name=all-MiniLM-L6-v2, warmup_steps=100, weight_decay=0.0482


Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

{'train_runtime': 18.0938, 'train_samples_per_second': 101.305, 'train_steps_per_second': 6.466, 'train_loss': 0.20745414342635718, 'epoch': 3.0}
  Acc=0.6932 | P=0.7000 | R=0.7447 | F1=0.7216 | Time=21.8s

[Trial 14/20] batch_size=8, num_epochs=2, threshold=0.5787, model_name=paraphrase-multilingual-MiniLM-L12-v2, warmup_steps=200, weight_decay=0.0471


Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

{'train_runtime': 20.912, 'train_samples_per_second': 58.435, 'train_steps_per_second': 7.364, 'train_loss': 0.21351593810242492, 'epoch': 2.0}
  Acc=0.6818 | P=0.6727 | R=0.7872 | F1=0.7255 | Time=24.5s

[Trial 15/20] batch_size=8, num_epochs=2, threshold=0.4776, model_name=all-MiniLM-L6-v2, warmup_steps=50, weight_decay=0.0349


Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

{'train_runtime': 13.5954, 'train_samples_per_second': 89.883, 'train_steps_per_second': 11.327, 'train_loss': 0.20097228458949498, 'epoch': 2.0}
  Acc=0.7159 | P=0.7200 | R=0.7660 | F1=0.7423 | Time=16.6s

[Trial 16/20] batch_size=16, num_epochs=2, threshold=0.5389, model_name=all-MiniLM-L6-v2, warmup_steps=150, weight_decay=0.0266


Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

{'train_runtime': 12.2615, 'train_samples_per_second': 99.662, 'train_steps_per_second': 6.361, 'train_loss': 0.22670983045529097, 'epoch': 2.0}
  Acc=0.6250 | P=0.6667 | R=0.5957 | F1=0.6292 | Time=15.3s

[Trial 17/20] batch_size=32, num_epochs=4, threshold=0.3207, model_name=paraphrase-multilingual-MiniLM-L12-v2, warmup_steps=50, weight_decay=0.0425


Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

{'train_runtime': 25.9559, 'train_samples_per_second': 94.16, 'train_steps_per_second': 3.082, 'train_loss': 0.18913331031799316, 'epoch': 4.0}
  Acc=0.6477 | P=0.6081 | R=0.9574 | F1=0.7438 | Time=29.9s

[Trial 18/20] batch_size=16, num_epochs=2, threshold=0.4301, model_name=paraphrase-multilingual-MiniLM-L12-v2, warmup_steps=200, weight_decay=0.0354


Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

{'train_runtime': 15.8513, 'train_samples_per_second': 77.092, 'train_steps_per_second': 4.921, 'train_loss': 0.2191323989476913, 'epoch': 2.0}
  Acc=0.6250 | P=0.6000 | R=0.8936 | F1=0.7179 | Time=19.7s

[Trial 19/20] batch_size=16, num_epochs=3, threshold=0.5201, model_name=paraphrase-multilingual-MiniLM-L12-v2, warmup_steps=200, weight_decay=0.0232


Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

{'train_runtime': 23.4504, 'train_samples_per_second': 78.165, 'train_steps_per_second': 4.989, 'train_loss': 0.20713589334080362, 'epoch': 3.0}
  Acc=0.7500 | P=0.7451 | R=0.8085 | F1=0.7755 | Time=28.4s

[Trial 20/20] batch_size=32, num_epochs=4, threshold=0.3967, model_name=all-MiniLM-L6-v2, warmup_steps=50, weight_decay=0.0139


Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

{'train_runtime': 22.7236, 'train_samples_per_second': 107.553, 'train_steps_per_second': 3.521, 'train_loss': 0.2031114101409912, 'epoch': 4.0}
  Acc=0.7273 | P=0.6949 | R=0.8723 | F1=0.7736 | Time=26.0s

RANDOM SEARCH COMPLETED
Total time: 9.12 minutes
Avg time/trial: 27.36 seconds
Successful trials: 20/20
Failed trials: 0/20
Unique configurations: 20/20

Best F1: 0.8000

Best Parameters:
  * batch_size: 8
  * num_epochs: 4
  * threshold: 0.3673
  * model_name: all-MiniLM-L6-v2
  * warmup_steps: 50
  * weight_decay: 0.0377

Top 3 Configurations:
  1. F1=0.8000: {'batch_size': 8, 'num_epochs': 4, 'threshold': 0.3673, 'model_name': 'all-MiniLM-L6-v2', 'warmup_steps': 50, 'weight_decay': 0.0377}
  2. F1=0.7810: {'batch_size': 16, 'num_epochs': 4, 'threshold': 0.4013, 'model_name': 'paraphrase-multilingual-MiniLM-L12-v2', 'warmup_steps': 150, 'weight_decay': 0.0376}
  3. F1=0.7755: {'batch_size': 16, 'num_epochs': 4, 'threshold': 0.5505, 'model_name': 'paraphrase-multilingual-MiniLM-L12-

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

Step,Training Loss



FINAL TEST SET EVALUATION

Fine-Tuned Model Performance:
  Accuracy:  0.7543 (75.43%)
  Precision: 0.7143 (71.43%)
  Recall:    0.9043 (90.43%)
  F1 Score:  0.7981 (79.81%)
  Training Time: 33.39s

Confusion Matrix:
                    Predicted
               Not Match  Match
  Actual Not Match    47      34
         Match         9      85

Classification Report:
              precision    recall  f1-score   support

   Not Match       0.84      0.58      0.69        81
       Match       0.71      0.90      0.80        94

    accuracy                           0.75       175
   macro avg       0.78      0.74      0.74       175
weighted avg       0.77      0.75      0.75       175


BASELINE vs FINE-TUNED COMPARISON
Metric       Baseline     Fine-tuned   Improvement    
--------------------------------------------------------------------------------
Accuracy     0.5029       0.7543       +0.2514 (+50.0%)
Precision    0.5215       0.7143       +0.1928 (+37.0%)
Recall       0.9043  