<!-- PS4E11 | CatBoost Optuna -->
<div style="font-family: 'Poppins'; font-weight: bold; letter-spacing: 0px; color: #FFFFFF; font-size: 300%; text-align: left; padding: 15px; background: #0A0F29; border: 8px solid #00FFFF; border-radius: 15px; box-shadow: 5px 5px 20px rgba(0, 0, 0, 0.5);">
    Depression Prediction with CatBoost & Optuna<br>
</div>

<div style="text-align: center;">

  <img src="https://i.imgur.com/EhXAhl1.jpg" alt="Centered Image" style="max-width: 60%; height: auto;">

</div>

Photo de <a href="https://unsplash.com/fr/@dmey503?utm_content=creditCopyText&utm_medium=referral&utm_source=unsplash">Dan Meyers</a> sur <a href="https://unsplash.com/fr/photos/nabandonnez-pas-vous-netes-pas-seul-vous-comptez-sur-la-signalisation-sur-la-cloture-metallique-hluOJZjLVXc?utm_content=creditCopyText&utm_medium=referral&utm_source=unsplash">Unsplash</a>

# <div style="background-color:#0A0F29; font-family:'Poppins', cursive; color:#E0F7FA; font-size:140%; text-align:center; border: 2px solid #00FFFF; border-radius:15px; padding: 15px; box-shadow: 5px 5px 20px rgba(0, 0, 0, 0.5); font-weight: bold; letter-spacing: 1px; text-transform: uppercase;">Challenge Overview</div>

## <div style="background-color:#0A0F29; font-family:'Poppins', cursive; color:#E0F7FA; font-size:100%; text-align:center; border: 2px solid #0A0F29; border-radius:10px; padding: 10px; box-shadow: 5px 5px 20px rgba(0, 0, 0, 0.5); font-weight: bold; letter-spacing: 1px; text-transform: uppercase;">Competition</div>

- **Objective:** Participants analyze a synthetic dataset based on a mental health survey to identify factors linked to depression, predicting whether individuals are likely to experience depression.

- **Evaluation:** Submissions are scored by accuracy, with predictions required for each row in the test set as either 0 (no depression) or 1 (depression).

## <div style="background-color:#0A0F29; font-family:'Poppins', cursive; color:#E0F7FA; font-size:100%; text-align:center; border: 2px solid #0A0F29; border-radius:10px; padding: 10px; box-shadow: 5px 5px 20px rgba(0, 0, 0, 0.5); font-weight: bold; letter-spacing: 1px; text-transform: uppercase;">Notebook aim</div>








- In this notebook I train a Catboot model and save out of fold predictions for ensembling.

- Model choise is based on this notebook: [PS4E11 | Explore Best Models](https://www.kaggle.com/code/wguesdon/ps4e11-explore-best-models)

# <div style="background-color:#0A0F29; font-family:'Poppins', cursive; color:#E0F7FA; font-size:140%; text-align:center; border: 2px solid #00FFFF; border-radius:15px; padding: 15px; box-shadow: 5px 5px 20px rgba(0, 0, 0, 0.5); font-weight: bold; letter-spacing: 1px; text-transform: uppercase;">Automated EDA</div>

- see [PS4E11 | AutoML Baseline](https://www.kaggle.com/code/wguesdon/ps4e11-automl-baseline)

# <div style="background-color:#0A0F29; font-family:'Poppins', cursive; color:#E0F7FA; font-size:140%; text-align:center; border: 2px solid #00FFFF; border-radius:15px; padding: 15px; box-shadow: 5px 5px 20px rgba(0, 0, 0, 0.5); font-weight: bold; letter-spacing: 1px; text-transform: uppercase;">Model training</div>

In [1]:
# Constants

ENV = 'Kaggle'  # Set to 'Colab', 'Kaggle', or 'Sagemaker'
DEV = False  # Set to True to enable subsetting, False for full training data
SUBSET_SIZE = 1000  # Number of samples for the subset during development
TRIALS = 50 # Number of trials for Optuna
GPU = False
ID_COL = 'id'
TARGET_COL = 'Depression'
MODEL_TYPE = 'CatBoost'  # Set to 'CatBoost', 'XGBoost', or 'LightGBM'

In [2]:
!python --version

Python 3.10.14


In [3]:
if ENV == 'Colab':
    from google.colab import drive
    drive.mount('/content/drive')

In [4]:
if ENV == 'Kaggle':
    print('configure Kaggle env')
    !pip install autogluon.tabular  > /dev/null 2>&1
    !pip install optuna-integration[sklearn] > /dev/null 2>&1
    !pip install langchain-core > /dev/null 2>&1
    !pip install langchain-openai  > /dev/null 2>&1
    !pip install sweetviz > /dev/null 2>&1
    !pip install numba==0.58.1 visions==0.7.5 pandas==1.5.3 ydata-profiling==4.7.0 > /dev/null 2>&1

if ENV == 'Colab':
    print('configure Colab env')
    !pip install -r /content/drive/MyDrive/Kaggle_analysis/PS4E11/requirements.txt > /dev/null 2>&1
    !pip install scikit-learn==1.3.0
    !pip install shap

configure Kaggle env


In [5]:
import pandas as pd
import numpy as np
import optuna
import joblib
import logging
import json
from datetime import datetime
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score
from catboost import CatBoostClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier

In [6]:
# Set up logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

In [7]:
def load_data():

    """Load the dataset based on the environment."""

    paths = {
        'Kaggle': '/kaggle/input/playground-series-s4e11/',
        'Sagemaker': '/home/ec2-user/SageMaker/data/PS4E11/',
        'Colab': '/content/drive/MyDrive/Kaggle_analysis/PS4E11/data/'
    }

    base_path = paths.get(ENV)

    if not base_path:
        raise ValueError("Invalid environment specified")

    

    train_data = pd.read_csv(base_path + 'train.csv')
    test_data = pd.read_csv(base_path + 'test.csv')
    sample_submission = pd.read_csv(base_path + 'sample_submission.csv')

    return train_data, test_data, sample_submission

def preprocess_data(train_data, test_data):

    """Preprocess training and testing data."""

    # Separate features and target
    X = train_data.drop(columns=[ID_COL, TARGET_COL])
    y = train_data[TARGET_COL]
    X_test = test_data.drop(columns=[ID_COL])

    # Subset the training data if DEV is True

    if DEV:
        subset_indices = np.random.choice(X.index, size=min(SUBSET_SIZE, len(X)), replace=False)
        X = X.loc[subset_indices]
        y = y.loc[subset_indices]
        X_test = X_test.loc[subset_indices]

    # Detect categorical columns (assuming all non-numeric columns are categorical)
    cat_features = X.select_dtypes(include=['object', 'category']).columns.tolist()

    # Fill missing values
    X[cat_features] = X[cat_features].fillna('missing')
    X_test[cat_features] = X_test[cat_features].fillna('missing')

    return X, y, X_test, cat_features

from sklearn.model_selection import train_test_split

def objective(trial, X, y, cat_features):

    """Objective function for Optuna hyperparameter optimization."""

    if MODEL_TYPE == 'CatBoost':
        params = {
            'iterations': trial.suggest_int('iterations', 500, 3000),
            'depth': trial.suggest_int('depth', 4, 10),
            'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3, log=True),
            'l2_leaf_reg': trial.suggest_float('l2_leaf_reg', 1e-3, 10, log=True),
            'random_strength': trial.suggest_float('random_strength', 0, 10),
            'bagging_temperature': trial.suggest_float('bagging_temperature', 0, 10),
            'border_count': trial.suggest_int('border_count', 32, 255),
            'task_type': 'GPU' if GPU else 'CPU',
            'verbose': 0,
        }

    elif MODEL_TYPE == 'XGBoost':
        params = {
            'n_estimators': trial.suggest_int('n_estimators', 500, 3000),
            'max_depth': trial.suggest_int('max_depth', 3, 10),
            'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3, log=True),
            'subsample': trial.suggest_float('subsample', 0.5, 1.0),
            'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 1.0),
            'reg_alpha': trial.suggest_float('reg_alpha', 1e-3, 10, log=True),
            'reg_lambda': trial.suggest_float('reg_lambda', 1e-3, 10, log=True),
            'gamma': trial.suggest_float('gamma', 0, 10),
            'tree_method': 'gpu_hist' if GPU else 'auto',
        }

    elif MODEL_TYPE == 'LightGBM':
        params = {
            'n_estimators': trial.suggest_int('n_estimators', 500, 3000),
            'max_depth': trial.suggest_int('max_depth', -1, 10),
            'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3, log=True),
            'num_leaves': trial.suggest_int('num_leaves', 20, 300),
            'feature_fraction': trial.suggest_float('feature_fraction', 0.5, 1.0),
            'bagging_fraction': trial.suggest_float('bagging_fraction', 0.5, 1.0),
            'bagging_freq': trial.suggest_int('bagging_freq', 1, 10),
            'lambda_l1': trial.suggest_float('lambda_l1', 1e-3, 10, log=True),
            'lambda_l2': trial.suggest_float('lambda_l2', 1e-3, 10, log=True),
            'min_child_samples': trial.suggest_int('min_child_samples', 5, 100),
            'device': 'gpu' if GPU else 'cpu',
        }

    else:
        raise ValueError(f"Invalid model type: {MODEL_TYPE}")

    # Single train-validation split
    X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

    model = get_model(params)
    model.fit(
        X_train, y_train, 
        cat_features=cat_features if MODEL_TYPE == 'CatBoost' else None,
        eval_set=(X_valid, y_valid),
        early_stopping_rounds=100,
        verbose=0
    )

    # Evaluate model on validation set
    valid_preds = model.predict(X_valid)
    score = accuracy_score(y_valid, valid_preds.round())

    return score

def save_best_params(best_params):
    """Save the best parameters to JSON and joblib files."""
    
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    file_prefix = f"{MODEL_TYPE}_best_params_{timestamp}"

    # Save to JSON
    with open(f"{file_prefix}.json", 'w') as json_file:
        json.dump(best_params, json_file)
    
    # Save to joblib
    joblib.dump(best_params, f"{file_prefix}.joblib")

def get_model(params):

    """Returns the appropriate model based on MODEL_TYPE."""

    if MODEL_TYPE == 'CatBoost':
        return CatBoostClassifier(**params)

    elif MODEL_TYPE == 'XGBoost':
        return XGBClassifier(**params)

    elif MODEL_TYPE == 'LightGBM':
        return LGBMClassifier(**params)

    else:
        raise ValueError(f"Invalid model type: {MODEL_TYPE}")

def run_hyperparameter_optimization(X, y, cat_features, trials=TRIALS):

    """Run Optuna hyperparameter optimization."""

    study = optuna.create_study(direction='maximize')
    study.optimize(lambda trial: objective(trial, X, y, cat_features), n_trials=trials)
    logging.info(f"Best parameters: {study.best_params}")
    
    return study.best_params

def train_and_predict(X, y, X_test, best_params, cat_features):

    """Train final model using the best parameters and make OOF and test predictions."""

    skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    oof_preds = np.zeros(len(X))
    test_preds = np.zeros(len(X_test))

    for fold, (train_idx, valid_idx) in enumerate(skf.split(X, y)):
        X_train, X_valid = X.iloc[train_idx], X.iloc[valid_idx]
        y_train, y_valid = y.iloc[train_idx], y.iloc[valid_idx]

        model = get_model(best_params)
        model.fit(X_train, y_train, cat_features=cat_features if MODEL_TYPE == 'CatBoost' else None, 
                  eval_set=(X_valid, y_valid), early_stopping_rounds=100, verbose=100)

        # Save out-of-fold predictions
        oof_preds[valid_idx] = model.predict(X_valid)
        test_preds += model.predict(X_test) / skf.n_splits

    return oof_preds, test_preds

def save_predictions(oof_preds, test_preds, train_data, test_data):

    """Save out-of-fold predictions and test predictions for meta-model."""

    oof_df = pd.DataFrame({ID_COL: train_data[ID_COL], 'oof_preds': oof_preds})
    oof_df.to_csv('oof_predictions.csv', index=False)

    # Convert test predictions to 0 or 1 integers
    test_preds = (test_preds >= 0.5).astype(int)

    # Prepare the submission DataFrame
    submission = pd.DataFrame({ID_COL: test_data[ID_COL]})
    submission['Depression'] = test_preds
    submission.to_csv('submission.csv', index=False)

    logging.info("Training complete. OOF predictions and submission file are saved.")

    # Ensemble with additional external submissions

    try:
        sub2 = pd.read_csv('/kaggle/input/ps4e11-submissions-for-ensemble/catboost_optuna_new_top5_features.csv').sort_values(by='id').reset_index(drop=True)
        sub3 = pd.read_csv('/kaggle/input/ps4e11-submissions-for-ensemble/mental_health_automl_baseline.csv').sort_values(by='id').reset_index(drop=True)
        sub4 = pd.read_csv('/kaggle/input/ps4e11-submissions-for-ensemble/mental_health_catboost_edited.csv').sort_values(by='id').reset_index(drop=True)
        sub5 = pd.read_csv('/kaggle/input/ps4e11-submissions-for-ensemble/this_code_fixed_my_depression.csv').sort_values(by='id').reset_index(drop=True)

        # List of submission dataframes

        submissions = [submission, sub2, sub3, sub4, sub5]

        # Ensuring all values are integers

        for sub in submissions:
            sub['Depression'] = sub['Depression'].astype(int)

        # Creating a copy of the 'id' column from the first submission
        ensemble = submissions[0][['id']].copy()

        # Stack all predictions into a 2D numpy array and calculate the majority vote
        predictions = np.array([sub['Depression'].values for sub in submissions])
        ensemble['Depression'] = (np.sum(predictions, axis=0) >= len(submissions) / 2).astype(int)  # Majority vote logic

        # Saving the output to a CSV file

        ensemble.to_csv('final_ensemble.csv', index=False)
        logging.info("Ensemble submission file is saved.")

    except FileNotFoundError as e:
        logging.warning(f"Ensembling skipped due to missing files: {e}")

In [8]:
train_data, test_data, _ = load_data()
X, y, X_test, cat_features = preprocess_data(train_data, test_data)
best_params = run_hyperparameter_optimization(X, y, cat_features, trials=TRIALS)

# Save the best parameters
save_best_params(best_params)

# training and prediction
oof_preds, test_preds = train_and_predict(X, y, X_test, best_params, cat_features)

[I 2024-11-14 08:30:08,240] A new study created in memory with name: no-name-b36f837d-a3f9-4fea-bc64-836cfb720c1b
[I 2024-11-14 08:30:48,293] Trial 0 finished with value: 0.939410092395167 and parameters: {'iterations': 1896, 'depth': 8, 'learning_rate': 0.15866364733217117, 'l2_leaf_reg': 0.019191484725770352, 'random_strength': 9.122320001210516, 'bagging_temperature': 4.374612226312661, 'border_count': 132}. Best is trial 0 with value: 0.939410092395167.
[I 2024-11-14 08:33:33,441] Trial 1 finished with value: 0.9400852878464819 and parameters: {'iterations': 2794, 'depth': 6, 'learning_rate': 0.02794385618534179, 'l2_leaf_reg': 3.806243007424324, 'random_strength': 7.468772883334623, 'bagging_temperature': 7.904540125455123, 'border_count': 175}. Best is trial 1 with value: 0.9400852878464819.
[I 2024-11-14 08:34:50,828] Trial 2 finished with value: 0.9399431414356787 and parameters: {'iterations': 1917, 'depth': 7, 'learning_rate': 0.09158029325651064, 'l2_leaf_reg': 0.52839897835

0:	learn: 0.4967540	test: 0.4973229	best: 0.4973229 (0)	total: 143ms	remaining: 6m 33s
100:	learn: 0.1469359	test: 0.1510172	best: 0.1510172 (100)	total: 11.7s	remaining: 5m 7s
200:	learn: 0.1399991	test: 0.1498412	best: 0.1496234 (191)	total: 24.6s	remaining: 5m 13s
300:	learn: 0.1347966	test: 0.1495291	best: 0.1494977 (293)	total: 37.4s	remaining: 5m 5s
Stopped by overfitting detector  (100 iterations wait)

bestTest = 0.1494976536
bestIteration = 293

Shrink model to first 294 iterations.
0:	learn: 0.4947772	test: 0.4962570	best: 0.4962570 (0)	total: 144ms	remaining: 6m 35s
100:	learn: 0.1449935	test: 0.1540889	best: 0.1540889 (100)	total: 12.1s	remaining: 5m 17s
200:	learn: 0.1375747	test: 0.1528607	best: 0.1526687 (180)	total: 24.5s	remaining: 5m 12s
Stopped by overfitting detector  (100 iterations wait)

bestTest = 0.1526686865
bestIteration = 180

Shrink model to first 181 iterations.
0:	learn: 0.4963095	test: 0.4961536	best: 0.4961536 (0)	total: 142ms	remaining: 6m 32s
100:	lea

# <div style="background-color:#0A0F29; font-family:'Poppins', cursive; color:#E0F7FA; font-size:140%; text-align:center; border: 2px solid #00FFFF; border-radius:15px; padding: 15px; box-shadow: 5px 5px 20px rgba(0, 0, 0, 0.5); font-weight: bold; letter-spacing: 1px; text-transform: uppercase;">Final Ensemble</div>

## Submission Generation

This notebook generates two submissions for scoring:

- `depression_prediction_autogluon_sub.csv`
- `final_ensemble.csv`

## Final Ensemble Submission

For the final ensemble submission, I combine the predictions from this notebook with results from four public notebooks, each utilizing different methods. Importantly, these public notebooks do not contain ensemble predictions from other sources. While the ensemble submission scores higher on the public leaderboard, it carries a greater risk of overfitting, particularly because this competition is based on accuracy metrics.

## Best Ensembling Approach

The most effective ensembling approach involves combining out-of-fold predictions from models trained using cross-validation with a unified cross-validation scoring strategy. A simple meta-model, such as logistic regression or a voting method, can then be applied to enhance performance.

## Notebooks used for the Ensemble Submission

- [This code fixed my depression](https://www.kaggle.com/code/adyiemaz/this-code-fixed-my-depression/notebook)
- [Mental health | Catboost | Edited](https://www.kaggle.com/code/abdmental01/mental-health-catboost-edited)
- [[0.94317] 🚀 Catboost + Optuna | New Top5 Features](https://www.kaggle.com/code/harshg97/0-94317-catboost-optuna-new-top5-features)
- [Mental health - automl baseline](https://www.kaggle.com/code/thomasmeiner/mental-health-automl-baseline)

In [9]:
save_predictions(oof_preds, test_preds, train_data, test_data)