 # Mental Health Crisis Detection Model Training

This notebook handles the training of the model. It uses the training and validation datasets to train the model and implements early stopping to prevent overfitting.

## Model Architecture
- **Base Model**: BERT-base-uncased (12-layer transformer)
- **Customizations**:
  - 30% Hidden dropout
  - 10% Attention dropout
  - Frozen embeddings
- **Training**:
  - 3 Epochs (Early convergence observed)
  - 2e-5 Learning rate (BERT standard)
  - 32 Batch size + 4-step gradient accumulation

## Training Philosophy:
- Recall-focused: Prioritize detecting all potential crisis cases
- Overfit Mitigation: Validation set monitoring, regularization
- Resource Efficiency: Mixed precision FP16 training

In [1]:
from datasets import Dataset
from transformers import (
    BertForSequenceClassification,
    BertTokenizerFast,
    Trainer,
    TrainingArguments,
    DataCollatorWithPadding,
)
import numpy as np
from sklearn.metrics import recall_score
from sklearn.preprocessing import LabelEncoder


In [2]:
# Use BertTokenizerFast
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')

In [3]:
# Define tokenize function
def tokenize_function(examples):
    """Convert text to BERT-compatible tokens
    
    Returns:
        Dictionary with 'input_ids', 'attention_mask', etc.
    """
    return tokenizer(
        examples['text'], 
        truncation=True,  # Limit to 512 tokens
        padding='max_length',  # Pad shorter sequences
        max_length=512  # BERT's maximum capacity
    )

In [4]:
# Define compute_metrics function to calculate recall
def compute_metrics(eval_pred):
    """Custom evaluation metric (Recall-focused)"""
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1) # Convert logits to class IDs
    recall = recall_score(labels, predictions, average='binary')  # SUICIDAL class recall
    return {'recall': recall}  # Primary success metric

In [5]:
def train_model(X_train, y_train):
    """Complete training workflow
    
    Returns:
        model: Fine-tuned BERT
        tokenizer: Trained tokenizer
        label_encoder: Class-label mapper
    """
    # Encode text labels to integers
    label_encoder = LabelEncoder() # SUICIDAL=1, NON-SUICIDAL=0
    y_train_encoded = label_encoder.fit_transform(y_train) # Numeric labels

    # Create Hugging Face dataset
    dataset = Dataset.from_dict({
        'text': X_train.tolist(),  # Input posts
        'labels': y_train_encoded.tolist()  # Encoded labels
    })
    
    # Tokenization (CPU-bound)
    tokenized_datasets = dataset.map(tokenize_function, batched=True)

    # Model initialization
    model = BertForSequenceClassification.from_pretrained(
        'bert-base-uncased', 
        num_labels=len(label_encoder.classes_), # Binary classification
    )
     # Architectural modifications
    model.config.hidden_dropout_prob = 0.3 # Reduce overfittin
    model.config.attention_probs_dropout_prob = 0.1 # Sparse attention

   # Freeze embeddings (Transfer learning)
    for param in model.bert.embeddings.parameters():
        param.requires_grad = False # Keep pretrained embeddings static

    # Training configuration
    training_args = TrainingArguments(
        output_dir='./results', # Logging directory
        evaluation_strategy='epoch', # Validation every epoch
        learning_rate=2e-5, # Standard for BERT fine-tuning
        per_device_train_batch_size=32,  # Fits GPU memory
        per_device_eval_batch_size=32,   
        num_train_epochs=3, # Early stopping would extend this
        weight_decay=0.01, # L2 regularization
        logging_dir='./logs',
        fp16=True,  # 16-bit training
        gradient_accumulation_steps=4,  # Effective batch size = 32*4=128
    )

    # Split dataset into train and eval
    train_dataset = tokenized_datasets.shuffle(seed=42).select(range(len(tokenized_datasets) - 500)) # Training data
    eval_dataset = tokenized_datasets.shuffle(seed=42).select(range(len(tokenized_datasets) - 500, len(tokenized_datasets))) # 500-sample validation

    # Trainer initialization
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        tokenizer=tokenizer,
        data_collator=DataCollatorWithPadding(tokenizer=tokenizer), # Dynamic padding
        compute_metrics=compute_metrics,  # Recall tracking
    )

    # Model training
    trainer.train()

    return model, tokenizer, label_encoder