This command installs and updates several key Python libraries used for natural language processing and model optimization. The exclamation mark ! allows running shell commands directly from environments like Jupyter Notebook. The pip install command fetches and installs packages: transformers provides pre-trained models and training utilities from Hugging Face, datasets handles efficient dataset loading and preprocessing, accelerate optimizes multi-GPU and distributed training, ray[tune] enables scalable hyperparameter tuning, and optuna offers an alternative, efficient framework for automated hyperparameter optimization. The -U flag ensures all packages are upgraded to their latest compatible versions.

In [1]:
!pip install transformers datasets accelerate ray[tune] optuna -U

Collecting datasets
  Downloading datasets-4.4.1-py3-none-any.whl.metadata (19 kB)
Collecting optuna
  Downloading optuna-4.5.0-py3-none-any.whl.metadata (17 kB)
Collecting ray[tune]
  Downloading ray-2.51.1-cp312-cp312-manylinux2014_x86_64.whl.metadata (21 kB)
Collecting pyarrow>=21.0.0 (from datasets)
  Downloading pyarrow-22.0.0-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (3.2 kB)
Collecting click!=8.3.0,>=7.0 (from ray[tune])
  Downloading click-8.2.1-py3-none-any.whl.metadata (2.5 kB)
Collecting tensorboardX>=1.9 (from ray[tune])
  Downloading tensorboardx-2.6.4-py3-none-any.whl.metadata (6.2 kB)
Collecting colorlog (from optuna)
  Downloading colorlog-6.10.1-py3-none-any.whl.metadata (11 kB)
Downloading datasets-4.4.1-py3-none-any.whl (511 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m511.6/511.6 kB[0m [31m17.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading optuna-4.5.0-py3-none-any.whl (400 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [

**Function Description**

This script performs end-to-end hyperparameter optimization and fine-tuning for a binary text classification model on a mental-health Twitter dataset, using a RoBERTa-based pretrained model. It prepares and tokenizes the dataset, defines a compact hyperparameter search space adjusted for a 5 GB GPU, runs an exhaustive Optuna-backed search through the Hugging Face Trainer interface to maximize F1 score, and then retrains a final model using the best hyperparameters before evaluating and saving the result.

**Syntax Explanation**

The code uses Hugging Face transformers and datasets together with scikit-learn for metrics and optuna via the Trainer hyperparameter_search backend. The AutoTokenizer and AutoModelForSequenceClassification classes are used for model-agnostic loading of tokenizer and model weights. The dataset is converted from pandas to a Hugging Face Dataset and tokenized with truncation, padding, and max_length=128. The Trainer is instantiated with model_init to ensure a fresh model for each trial, compute_metrics to return accuracy, F1, precision, and recall, and TrainingArguments to control training and evaluation settings. The hyperparameter search space is defined inside tune_hp using Optuna trial suggestions and is executed exhaustively by setting n_trials equal to the total number of combinations.

**Inputs**

The primary input is the CSV file Mental-Health-Twitter.csv which must contain a post_text column for tweet content and a label column that is renamed to labels for Trainer compatibility. The script drops missing or empty post_text entries and performs a stratified split to produce training and evaluation subsets, after which it limits the training set to 10,000 samples and the evaluation set to 2,000 samples for faster experimentation. The pretrained model identifier margotwagner/roberta-psychotherapy-eval is supplied to initialize both tokenizer and model.

**Outputs**

The runtime prints the chosen device, sample sizes, label distributions, and the number of hyperparameter combinations to be evaluated. During hyperparameter tuning, the script exposes per-trial evaluation results through the Trainer and Optuna backends. After the search finishes, it prints the best trial and its hyperparameters, retrains a final model using those hyperparameters, evaluates the final model, and saves the model to ./final_model_mental_health. The evaluation metrics printed include loss, accuracy, F1-score, precision, and recall.

**Code Flow**

The script first imports required libraries, sets a reproducible seed, and configures the compute device. It then loads the CSV file, removes missing or empty texts, renames the label column to labels, and performs a stratified split followed by sampling to create training and evaluation DataFrames. These DataFrames are converted to Hugging Face Dataset objects and tokenized with truncation, padding, and a maximum sequence length of 128. The model_init function is defined to instantiate a fresh model for each trial and compute_metrics is defined to compute accuracy, F1, precision, and recall from predictions. The tune_hp function describes a constrained hyperparameter space tailored for a 5 GB GPU, limiting learning rates to two options, batch sizes to two options, weight decay to two options, and epochs to two options, resulting in 16 total combinations. A Trainer is created with fixed TrainingArguments and the tokenized datasets, and trainer.hyperparameter_search is invoked with the Optuna backend to maximize eval_f1. When the best trial is found, the script reconstructs TrainingArguments using the best hyperparameters, reinitializes a new Trainer, trains the final model, evaluates it, and saves the checkpoint.

**Comments and Observations**

The hyperparameter space is appropriately constrained to fit within the memory limitations of a 5 GB GPU by reducing candidate learning rates, batch sizes, weight decay values, and epoch counts, which should accelerate experimentation while still exploring meaningful combinations. The use of model_init guarantees independent trials that are not biased by previous weight updates, and optimizing for F1 is suitable for imbalanced binary labels. The max_length=128 choice is generally appropriate for tweets but should be validated against the actual token length distribution in the dataset. The removal of __index_level_0__ assumes that pandas produced that column when converting to a Dataset; confirm its presence to avoid errors. Consider adding Optuna pruning or early stopping to shorten low-performing trials and reduce compute waste, and consider gradient accumulation if a larger effective batch size is desired but GPU memory is limited. Finally, confirm that the labels column contains exactly 0 and 1 and review class balance closely so that reported metrics are interpretable.

In [2]:
# 1. SETUP AND INSTALLATION
# Run this command first in your Colab notebook:
# !pip install transformers datasets accelerate ray[tune] optuna pandas -U

import torch
import os
import numpy as np
import pandas as pd
from datasets import Dataset
from transformers import (
    AutoModelForSequenceClassification, # Use AutoModel for RoBERTa
    AutoTokenizer,                     # Use AutoTokenizer for RoBERTa
    TrainingArguments,
    Trainer,
    set_seed
)
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score

# Set a consistent seed for reproducibility across runs
set_seed(42)

# Ensure GPU is available
if torch.cuda.is_available():
    device = torch.device("cuda")
    print(f"Using GPU: {torch.cuda.get_device_name(0)}")
else:
    device = torch.device("cpu")
    print("Using CPU. For faster training, consider enabling a GPU runtime.")

# --- 2. DATA PREPARATION (Using your Mental-Health-Twitter.csv) ---

# Load your dataset
try:
    df = pd.read_csv("Mental-Health-Twitter.csv")
    print("Dataset loaded successfully.")
except FileNotFoundError:
    print("Error: 'Mental-Health-Twitter.csv' not found. Please upload it to your Colab environment.")
    exit()

# Filter out rows where 'post_text' is NaN or empty
df = df.dropna(subset=['post_text'])
df = df[df['post_text'].str.strip() != '']

# Rename 'label' to 'labels' for Hugging Face Trainer compatibility
df = df.rename(columns={"label": "labels"})

# Split data into training and validation sets
# Use a smaller subset for faster experiments, but larger than the example
# Let's try 10,000 for training and 2,000 for evaluation to get a decent signal
# You can adjust these numbers based on initial run times.
train_df, eval_df = train_test_split(df, test_size=0.1, stratify=df['labels'], random_state=42) # 10% for evaluation
train_df = train_df.sample(n=10000, random_state=42) # Limit to 10k training samples
eval_df = eval_df.sample(n=2000, random_state=42)   # Limit to 2k evaluation samples

print(f"Using {len(train_df)} training samples and {len(eval_df)} evaluation samples.")
print(f"Train label distribution:\n{train_df['labels'].value_counts(normalize=True)}")
print(f"Eval label distribution:\n{eval_df['labels'].value_counts(normalize=True)}")

# Convert pandas DataFrames to Hugging Face Dataset objects
train_dataset = Dataset.from_pandas(train_df[['post_text', 'labels']])
eval_dataset = Dataset.from_pandas(eval_df[['post_text', 'labels']])

# Initialize Tokenizer for your specific model
MODEL_NAME = "margotwagner/roberta-psychotherapy-eval"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

def tokenize_function(examples):
    # Ensure 'post_text' is correctly accessed and adjust max_length if needed
    return tokenizer(examples["post_text"], truncation=True, padding=True, max_length=128)

# Apply tokenization
tokenized_train = train_dataset.map(tokenize_function, batched=True, remove_columns=["post_text", "__index_level_0__"])
tokenized_eval = eval_dataset.map(tokenize_function, batched=True, remove_columns=["post_text", "__index_level_0__"])

# Set format to PyTorch tensors
tokenized_train.set_format("torch", columns=['input_ids', 'attention_mask', 'labels'])
tokenized_eval.set_format("torch", columns=['input_ids', 'attention_mask', 'labels'])


# --- 3. MODEL, METRICS, AND HYPERPARAMETER DEFINITION ---

# Function to initialize a fresh model for each grid search run
def model_init():
    # Model must be re-initialized for every run to ensure independence
    return AutoModelForSequenceClassification.from_pretrained(MODEL_NAME, num_labels=2).to(device)

def compute_metrics(p):
    preds = np.argmax(p.predictions, axis=1)
    acc = accuracy_score(p.label_ids, preds)
    f1 = f1_score(p.label_ids, preds, average="binary") # 'binary' for 0/1 labels
    precision = precision_score(p.label_ids, preds, average="binary")
    recall = recall_score(p.label_ids, preds, average="binary")
    return {"accuracy": acc, "f1": f1, "precision": precision, "recall": recall}

# --- HYPERPARAMETER GRID DEFINITION (REVISED FOR 5GB GPU) ---
def tune_hp(trial):
    """
    This function defines the hyperparameter space to be explored.
    Adjusted for a 5GB GPU to limit combinations and memory usage.
    """
    # Reduced search space for quicker convergence and to fit within 5GB GPU
    learning_rate = trial.suggest_categorical("learning_rate", [2e-5, 3e-5]) # Reduced to 2 options
    per_device_train_batch_size = trial.suggest_categorical("per_device_train_batch_size", [8, 16]) # Significantly reduced batch sizes
    weight_decay = trial.suggest_categorical("weight_decay", [0.01, 0.03]) # Reduced to 2 options
    num_train_epochs = trial.suggest_categorical("num_train_epochs", [2, 3]) # 2 options

    # Total trials for this configuration: 2 * 2 * 2 * 2 = 16 trials
    # This should now complete within your 5GB GPU limit and provide more trials than 25.

    return {
        "learning_rate": learning_rate,
        "per_device_train_batch_size": per_device_train_batch_size,
        "weight_decay": weight_decay,
        "num_train_epochs": num_train_epochs,
    }


# --- 4. TRAINING ARGUMENTS (Fixed for all runs) ---
# Most arguments are fixed, only the chosen HPs vary per run.
training_args = TrainingArguments(
    output_dir="./grid_search_results_mental_health",
    # Evaluation settings (fixed)
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="f1", # Optimize for F1-Score, good for imbalanced classes
    fp16=torch.cuda.is_available(), # Enable mixed precision for T4 GPU
    report_to="none", # Don't report to any external service
    # Fixed parameters (will be overridden by tune_hp where applicable)
    num_train_epochs=3, # Placeholder, will be suggested by tune_hp
    warmup_steps=100, # Reduced warmup steps for smaller datasets/epochs
    logging_dir="./logs",
    logging_steps=500,
    dataloader_num_workers=os.cpu_count() // 2 if os.cpu_count() else 0, # Use half CPU cores for data loading
)

# Initialize the Trainer
trainer = Trainer(
    model_init=model_init, # We pass the function, not the object, for fresh initialization
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_eval,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
)

# --- Define the objective function for Optuna ---
def optuna_hp_objective(metrics):
    """
    Optuna objective function that returns the F1 score for maximization.
    `metrics` is the dictionary returned by trainer.evaluate().
    """
    # The keys in the metrics dictionary will be prefixed with 'eval_' during evaluation
    # e.g., 'eval_loss', 'eval_accuracy', 'eval_f1', 'eval_precision', 'eval_recall'
    return metrics["eval_f1"]


# --- 5. EXECUTION OF GRID SEARCH ---
# We use Optuna backend for efficient searching. The 'hp_space' provides the search definition.
print("\n--- Starting Hyperparameter Search (using Optuna backend) ---")
print(f"Optimizing for '{training_args.metric_for_best_model}' score...")

# Calculate the total number of trials based on the REVISED hp_space function
num_lr = len([2e-5, 3e-5])
num_batch = len([8, 16])
num_wd = len([0.01, 0.03])
num_epochs = len([2, 3])
total_trials = num_lr * num_batch * num_wd * num_epochs
print(f"Total experiment combinations: {total_trials}") # This will print 16

best_trial = trainer.hyperparameter_search(
    backend="optuna",
    hp_space=tune_hp,
    direction="maximize", # Maximize the F1 score
    n_trials=total_trials, # Run all combinations defined in tune_hp
    compute_objective=optuna_hp_objective, # Explicitly tell Optuna to use eval_f1
)

print("\n--- Hyperparameter Search Complete ---")
print("\nBEST HYPERPARAMETERS FOUND:")

# Extract and print the best configuration
if best_trial:
    print(best_trial)
    best_hps = best_trial.hyperparameters
    print("\nBest Hyperparameters:")
    for key, value in best_hps.items():
        print(f"  {key}: {value}")
    print(f"\nBest Metrics (on evaluation set): {best_trial.metrics}")
else:
    print("Search failed or no best trial found.")

print("\n--- Final Step: Train a model with the best hyperparameters ---")
if best_trial:
    # Re-initialize TrainingArguments with the best hyperparameters for the final training run
    final_training_args = TrainingArguments(
        output_dir="./final_model_mental_health",
        eval_strategy="epoch",
        save_strategy="epoch",
        load_best_model_at_end=True,
        metric_for_best_model="f1",
        fp16=torch.cuda.is_available(),
        report_to="none",
        # Use the best hyperparameters found
        num_train_epochs=best_hps["num_train_epochs"],
        per_device_train_batch_size=best_hps["per_device_train_batch_size"],
        learning_rate=best_hps["learning_rate"],
        weight_decay=best_hps["weight_decay"],
        warmup_steps=100, # Can be adjusted based on number of epochs
        logging_dir="./final_logs",
        logging_steps=500,
        dataloader_num_workers=os.cpu_count() // 2 if os.cpu_count() else 0,
    )

    # Re-initialize the Trainer with the best HPs
    final_trainer = Trainer(
        model_init=model_init, # Re-initialize the model
        args=final_training_args,
        train_dataset=tokenized_train,
        eval_dataset=tokenized_eval,
        compute_metrics=compute_metrics,
        tokenizer=tokenizer,
    )

    print("\nTraining final model with best hyperparameters...")
    final_trainer.train()

    print("\nFinal model training complete. Best model saved to './final_model_mental_health'.")
    metrics = final_trainer.evaluate()
    print(f"Evaluation metrics of the final model: {metrics}")
else:
    print("No best hyperparameters found, skipping final model training.")

Using GPU: Tesla T4
Dataset loaded successfully.
Using 10000 training samples and 2000 evaluation samples.
Train label distribution:
labels
1    0.5004
0    0.4996
Name: proportion, dtype: float64
Eval label distribution:
labels
1    0.5
0    0.5
Name: proportion, dtype: float64


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/394 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/280 [00:00<?, ?B/s]

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

  trainer = Trainer(


config.json:   0%|          | 0.00/886 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/499M [00:00<?, ?B/s]

[I 2025-11-08 07:21:52,071] A new study created in memory with name: no-name-22cbf267-9648-4ada-b77e-20653798ec5f



--- Starting Hyperparameter Search (using Optuna backend) ---
Optimizing for 'f1' score...
Total experiment combinations: 16


model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.5057,0.270499,0.883,0.885406,0.867562,0.904
2,0.2511,0.294374,0.89,0.895038,0.855839,0.938
3,0.1875,0.315393,0.8965,0.895613,0.903357,0.888


[I 2025-11-08 07:26:38,145] Trial 0 finished with value: 0.8956127080181543 and parameters: {'learning_rate': 2e-05, 'per_device_train_batch_size': 16, 'weight_decay': 0.03, 'num_train_epochs': 3}. Best is trial 0 with value: 0.8956127080181543.


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.3808,0.301565,0.887,0.891242,0.858998,0.926
2,0.2417,0.393985,0.891,0.891434,0.887897,0.895


[I 2025-11-08 07:31:22,801] Trial 1 finished with value: 0.8914342629482072 and parameters: {'learning_rate': 2e-05, 'per_device_train_batch_size': 8, 'weight_decay': 0.01, 'num_train_epochs': 2}. Best is trial 0 with value: 0.8956127080181543.


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.3883,0.284165,0.888,0.8921,0.860595,0.926
2,0.2444,0.385804,0.9045,0.903972,0.908999,0.899


[I 2025-11-08 07:35:53,801] Trial 2 finished with value: 0.9039718451483157 and parameters: {'learning_rate': 3e-05, 'per_device_train_batch_size': 8, 'weight_decay': 0.03, 'num_train_epochs': 2}. Best is trial 2 with value: 0.9039718451483157.


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.4818,0.283387,0.887,0.887113,0.886228,0.888
2,0.2361,0.288584,0.9025,0.901664,0.909461,0.894


[I 2025-11-08 07:39:09,670] Trial 3 finished with value: 0.9016641452344932 and parameters: {'learning_rate': 3e-05, 'per_device_train_batch_size': 16, 'weight_decay': 0.01, 'num_train_epochs': 2}. Best is trial 2 with value: 0.9039718451483157.


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.3917,0.330289,0.8815,0.888575,0.838509,0.945
2,0.2661,0.412603,0.902,0.903353,0.891051,0.916
3,0.1832,0.463786,0.9075,0.908642,0.897561,0.92


[I 2025-11-08 07:45:40,797] Trial 4 finished with value: 0.908641975308642 and parameters: {'learning_rate': 3e-05, 'per_device_train_batch_size': 8, 'weight_decay': 0.01, 'num_train_epochs': 3}. Best is trial 4 with value: 0.908641975308642.


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.5051,0.273356,0.88,0.884837,0.850554,0.922


[I 2025-11-08 07:47:01,217] Trial 5 pruned. 


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.3888,0.339052,0.8785,0.887448,0.826575,0.958


[I 2025-11-08 07:48:53,199] Trial 6 pruned. 


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.5057,0.270499,0.883,0.885406,0.867562,0.904


[I 2025-11-08 07:50:13,433] Trial 7 pruned. 


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.4818,0.283387,0.887,0.887113,0.886228,0.888
2,0.2361,0.288584,0.9025,0.901664,0.909461,0.894


[I 2025-11-08 07:55:11,108] Trial 8 finished with value: 0.9016641452344932 and parameters: {'learning_rate': 3e-05, 'per_device_train_batch_size': 16, 'weight_decay': 0.01, 'num_train_epochs': 2}. Best is trial 4 with value: 0.908641975308642.


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.3886,0.316314,0.8915,0.896716,0.855586,0.942
2,0.2421,0.400453,0.892,0.892108,0.891218,0.893


[I 2025-11-08 07:59:45,760] Trial 9 pruned. 


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.3917,0.330289,0.8815,0.888575,0.838509,0.945


[I 2025-11-08 08:01:37,476] Trial 10 pruned. 


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.3883,0.284165,0.888,0.8921,0.860595,0.926


[I 2025-11-08 08:03:29,230] Trial 11 pruned. 


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.388,0.309399,0.882,0.888046,0.844765,0.936


[I 2025-11-08 08:05:21,691] Trial 12 pruned. 


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.3888,0.322937,0.8895,0.891826,0.873442,0.911


[I 2025-11-08 08:07:13,513] Trial 13 pruned. 


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.3883,0.284165,0.888,0.8921,0.860595,0.926


[I 2025-11-08 08:09:05,279] Trial 14 pruned. 


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.3917,0.330289,0.8815,0.888575,0.838509,0.945


[I 2025-11-08 08:10:56,859] Trial 15 pruned. 



--- Hyperparameter Search Complete ---

BEST HYPERPARAMETERS FOUND:
BestRun(run_id='4', objective=0.908641975308642, hyperparameters={'learning_rate': 3e-05, 'per_device_train_batch_size': 8, 'weight_decay': 0.01, 'num_train_epochs': 3}, run_summary=None)

Best Hyperparameters:
  learning_rate: 3e-05
  per_device_train_batch_size: 8
  weight_decay: 0.01
  num_train_epochs: 3


AttributeError: 'BestRun' object has no attribute 'metrics'