<a href="https://www.kaggle.com/code/zjzhao1002/mental-health-xgb-lgbm-cb-ensemble?scriptVersionId=204923466" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# 1. Previous works

I have already done some works with different models and their fine tunings: 
* [Fine tuning for XGBoost](https://www.kaggle.com/code/zjzhao1002/mental-health-xgboost-optuna)
* [Fine tuning for LightGBM](https://www.kaggle.com/code/zjzhao1002/mental-health-lgbm-optuna)
* [Fine tuning for CatBoost](https://www.kaggle.com/code/zjzhao1002/mental-health-catboost-optuna)

This notebook will perform an ensemble with these models.

# 2. Data Loading and Cleaning

This section follows the cleaning strategy of notebooks above.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

In [2]:
df_train = pd.read_csv("/kaggle/input/playground-series-s4e11/train.csv")
df_test = pd.read_csv("/kaggle/input/playground-series-s4e11/test.csv")

In [3]:
def fillna_num(df):
    columns = ['Academic Pressure', 'Work Pressure', 'CGPA', 'Study Satisfaction', 'Job Satisfaction']
    for column in columns:
        df[column] = df[column].fillna(-1)
    df['Financial Stress'] = df['Financial Stress'].fillna(float(df['Financial Stress'].median()))
    return df

In [4]:
def fillna_cat(df):
    columns = ['Profession', 'Sleep Duration', 'Dietary Habits', 'Degree']
    for column in columns:
        df[column].fillna('other', inplace=True)
    return df

In [5]:
def fill_less_frequent_value(df):
    columns = ['Profession', 'Sleep Duration', 'Dietary Habits', 'Degree']
    for column in columns:
        count = df[column].value_counts()
        less_freq = count[count<20].index
        df[column] = df[column].apply(lambda x: 'other' if x in less_freq else x)
    return df

In [6]:
df_train = fillna_num(df_train)
df_train = fillna_cat(df_train)
df_train = fill_less_frequent_value(df_train)

df_test = fillna_num(df_test)
df_test = fillna_cat(df_test)
df_test = fill_less_frequent_value(df_test)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[column].fillna('other', inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[column].fillna('other', inplace=True)


# 3. Encoding

In [7]:
from sklearn.preprocessing import OrdinalEncoder

def encoding(df):
    cat_columns = df.select_dtypes(include='object').columns
    
    encoder = OrdinalEncoder()
    df[cat_columns] = encoder.fit_transform(df[cat_columns].astype(str))
    
    return df

In [8]:
df_train = encoding(df_train)
df_test = encoding(df_test)

# 4 Ensemble

In [9]:
X = df_train.drop(['id', 'Name', 'City', 'Depression'], axis=1)
X_test = df_test.drop(['id', 'Name', 'City'], axis=1)
y = df_train['Depression']

In [10]:
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(
    X,
    y,
    test_size=0.20,
    random_state=1
)

In [11]:
from xgboost import XGBClassifier
from catboost import Pool, CatBoostClassifier
from lightgbm import LGBMClassifier
from sklearn.metrics import accuracy_score

In [12]:
xgb_params = dict(
    objective='binary:logistic', 
    device = "cuda",
    max_depth = 9,
    learning_rate = 0.026531959129846603,
    n_estimators = 1000,
    min_child_weight = 2,
    colsample_bytree = 0.20638223134169686,
    subsample = 0.7475335211758346,
    reg_alpha = 16.29659995080457,
    reg_lambda= 0.07389990358057898,
)

xgb_model = XGBClassifier(**xgb_params)

In [13]:
lgbm_params = dict(
    device = 'gpu',
    objective = 'binary',
    n_estimators = 1000,
    learning_rate = 0.037562161788030325,
    max_depth = 4,
    min_child_samples = 175,  
    subsample = 0.9799593439834279,  
    colsample_bytree = 0.6990355331799533,  
    reg_alpha = 0.00037180614413337545,  
    reg_lambda = 4.086951612354482e-05,  
    num_leaves = 19,  
    min_gain_to_split = 0.398564686475442, 
    verbose = -1
)

lgbm_model = LGBMClassifier(**lgbm_params)

In [14]:
cat_params = dict(
    iterations = 1000,
    learning_rate = 0.0383447294477305,
    depth = 4,
    l2_leaf_reg = 0.000177236244927708,
    bagging_temperature = 0.03805530103941712,
    random_strength=4.818788402165575e-07,
    task_type='GPU',
    early_stopping_rounds=200,
    verbose=False
)

cat_model = CatBoostClassifier(**cat_params)

In [15]:
from sklearn.ensemble import VotingClassifier

estimator = VotingClassifier([
        ('lgbm', lgbm_model),
        ('cat', cat_model),
        ('xgb', xgb_model),
    ], voting = 'soft')

In [17]:
estimator.fit(X_train, y_train)
    
y_pred = estimator.predict(X_val)
score = accuracy_score(y_val, y_pred)
print(score)

0.940724946695096


Potential solutions:
- Use a data structure that matches the device ordinal in the booster.
- Set the device for booster before call to inplace_predict.




In [18]:
results = estimator.predict(X_test)
submission = pd.read_csv("/kaggle/input/playground-series-s4e11/sample_submission.csv")
submission['Depression'] = results
submission.to_csv('submission.csv', index=False)

In [19]:
submission.head()

Unnamed: 0,id,Depression
0,140700,0
1,140701,0
2,140702,0
3,140703,1
4,140704,0
