<p style="background-color:#33FFAA ; font-family:'Times New Romans'; color:#000000; font-size:200%; text-align:center; border: 3px solid #00EEEE; border-radius:10px; padding: 10px;">Child Mind Institute | Single LightGBM Regressor</p>

### Predicting Severity Impairment Index (SII) using the MHN data.

The aim of this [competition](http://www.kaggle.com/competitions/child-mind-institute-problematic-internet-use/data) is to predict the Severity Impairment Index (sii), a standard measure for the level of problematic internet use among children and adolescents, based on physical activity data and other features. 

The sii values are derived from `PCIAT-PCIAT_Total`, the sum of scores from the Parent-Child Internet Addiction Test (PCIAT: 20 questions, scored 0-5), which makes sii an ordinal categorical variable with four levels, where the order of categories is meaningful. It is defined as:
- 0: None (PCIAT-PCIAT_Total from 0 to 30)
- 1: Mild (PCIAT-PCIAT_Total from 31 to 49)
- 2: Moderate (PCIAT-PCIAT_Total from 50 to 79)
- 3: Severe (PCIAT-PCIAT_Total 80 and more) 

The test.csv file contains 20 test samples in the correct format to help for find the solutions. The full test set comprises about 3800 instances.

Dataset is divided into two sources:
 * **parquet** files: containing the accelerometer (actigraphy) series,and
 * **csv** files containing the remaining tabular data.

The majority of measures are missing for most participants. In particular, **the target sii is missing for a portion of the participants in the training set**. You may wish to apply non-supervised learning techniques to this data. The sii value is present for all instances in the test set.

For more info about the data, read the data page of the challage [here](https://www.kaggle.com/competitions/child-mind-institute-problematic-internet-use/data).

# Loading Libraries

In [1]:
import os
import numpy as np
import pandas as pd

from colorama import Fore, Style
from IPython.display import clear_output
import matplotlib.pyplot as plt
import seaborn as sns

from tqdm import tqdm
from concurrent.futures import ThreadPoolExecutor
from scipy.optimize import minimize
import optuna


from sklearn.base import clone
from sklearn.ensemble import VotingRegressor
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import cohen_kappa_score

import lightgbm as lgb
from catboost import CatBoostRegressor, CatBoostClassifier
from xgboost import XGBRegressor

import warnings
warnings.filterwarnings('ignore')
pd.options.display.max_columns = None

SEED = 42
n_splits = 5

<p style="background-color:#33FFAA ; font-family:'Times New Romans'; color:#000000; font-size:170%; text-align:center; border: 3px solid #00EEEE; border-radius:10px; padding: 10px;">Reading Data Files</p>

# Reading Data files

In [2]:
%%time

def process_file(filename, dirname):
    df = pd.read_parquet(os.path.join(dirname, filename, 'part-0.parquet'))
    df.drop('step', axis=1, inplace=True)
    return df.describe().values.reshape(-1), filename.split('=')[1]

def load_time_series(dirname) -> pd.DataFrame:

    ids = os.listdir(dirname)
    
    with ThreadPoolExecutor() as executor:
        results = list(tqdm(
            executor.map(lambda fname: process_file(fname, dirname), ids),
            total=len(ids))
        )
    
    stats, indexes = zip(*results)
    
    df = pd.DataFrame(stats, columns=[f"Stat_{i}" for i in range(len(stats[0]))])
    df['id'] = indexes
    
    return df
    
# Reading data files
data_path = '/kaggle/input/child-mind-institute-problematic-internet-use'
train = pd.read_csv(f'{data_path}/train.csv')
test = pd.read_csv(f'{data_path}/test.csv')
sample = pd.read_csv(f'{data_path}/sample_submission.csv')

train_ts = load_time_series(f'{data_path}/series_train.parquet')
test_ts = load_time_series(f'{data_path}/series_test.parquet')

train_orig = pd.merge(train, train_ts, how="left", on='id')
test_orig = pd.merge(test, test_ts, how="left", on='id')


100%|██████████| 996/996 [01:40<00:00,  9.88it/s]
100%|██████████| 2/2 [00:00<00:00,  7.46it/s]

CPU times: user 4min 25s, sys: 41.6 s, total: 5min 7s
Wall time: 1min 41s





In [3]:
# save original data
train = train_orig.copy()
test = test_orig.copy()

<p style="background-color:#33FFAA ; font-family:'Times New Romans'; color:#000000; font-size:170%; text-align:center; border: 3px solid #00EEEE; border-radius:10px; padding: 10px;">Basic Preprocess</p>

# Basic preprocessing

In [4]:
pciat_Cols = [col for col in train.columns if 'PCIAT' in col]
train = train.drop(pciat_Cols, axis=1)

train.shape, test.shape

((3960, 156), (20, 155))

In [5]:
# Prepare feature values

train = train.dropna(subset='sii')

cat_Cols = [col for col in train.columns if 'Season' in col]

def update(df):
    for c in cat_Cols:
        if df[c].dtype.name == 'category':
            # Add 'Missing' to the categories if it's not already present
            if 'Missing' not in df[c].cat.categories:
                df[c] = df[c].cat.add_categories('Missing')

        # Fill missing values with 'Missing'
        df[c] = df[c].fillna('Missing')

        # Ensure the column is of 'category' dtype
        df[c] = df[c].astype('category')
    return df


train = update(train)
test = update(test)

"""
    This Mapping Works Fine For me, I also 
    check each values in train and test using 
    logic. There no Data Lekage.
"""

def create_mapping(column, dataset):
    unique_values = dataset[column].unique()
    return {value: idx for idx, value in enumerate(unique_values)}
    
for col in cat_Cols:
    mapping_train = create_mapping(col, train)
    mapping_test = create_mapping(col, test)

    train[col] = train[col].replace(mapping_train).astype(int)
    test[col] = test[col].replace(mapping_test).astype(int)

print(f'Train Shape : {train.shape} || Test Shape : {test.shape}')

Train Shape : (2736, 156) || Test Shape : (20, 155)


<p style="background-color:#33FFAA ; font-family:'Times New Romans'; color:#000000; font-size:170%; text-align:center; border: 3px solid #00EEEE; border-radius:10px; padding: 10px;">Modeling | Single LightGBM Regressor</p>

# Modeling and Training

In [6]:
train = train.drop('id', axis=1)
test_id = test['id'].copy()
test = test.drop('id', axis=1)

In [7]:
%%time
# Functions for training the evaluating the selected model 

def quadratic_weighted_kappa(y_true, y_pred):
    return cohen_kappa_score(y_true, y_pred, weights='quadratic')

def threshold_Rounder(oof_non_rounded, thresholds):
    return np.where(oof_non_rounded < thresholds[0], 0,
                    np.where(oof_non_rounded < thresholds[1], 1,
                             np.where(oof_non_rounded < thresholds[2], 2, 3)))

def evaluate_predictions(thresholds, y_true, oof_non_rounded):
    rounded_p = threshold_Rounder(oof_non_rounded, thresholds)
    return -quadratic_weighted_kappa(y_true, rounded_p)

def TrainML(model_class, test_data):
    
    X = train.drop(['sii'], axis=1)
    y = train['sii']

    SKF = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=SEED)
    
    train_S = []
    test_S = []
    
    oof_non_rounded = np.zeros(len(y), dtype=float) 
    oof_rounded = np.zeros(len(y), dtype=int) 
    test_preds = np.zeros((len(test_data), n_splits))

    for fold, (train_idx, test_idx) in enumerate(tqdm(SKF.split(X, y), desc="Training Folds", total=n_splits)):
        X_train, X_val = X.iloc[train_idx], X.iloc[test_idx]
        y_train, y_val = y.iloc[train_idx], y.iloc[test_idx]

        model = clone(model_class)
        model.fit(X_train, y_train)

        y_train_pred = model.predict(X_train)
        y_val_pred = model.predict(X_val)

        oof_non_rounded[test_idx] = y_val_pred
        y_val_pred_rounded = y_val_pred.round(0).astype(int)
        oof_rounded[test_idx] = y_val_pred_rounded

        train_kappa = quadratic_weighted_kappa(y_train, y_train_pred.round(0).astype(int))
        val_kappa = quadratic_weighted_kappa(y_val, y_val_pred_rounded)

        train_S.append(train_kappa)
        test_S.append(val_kappa)
        
        test_preds[:, fold] = model.predict(test_data)
        
        print(f"Fold {fold+1} - Train QWK: {train_kappa:.4f}, Validation QWK: {val_kappa:.4f}")
        clear_output(wait=True)

    print(f"Mean Train QWK --> {np.mean(train_S):.4f}")
    print(f"Mean Validation QWK ---> {np.mean(test_S):.4f}")

    KappaOPtimizer = minimize(evaluate_predictions,
                              x0=[0.5, 1.5, 2.5], args=(y, oof_non_rounded), 
                              method='Nelder-Mead') # Nelder-Mead | # Powell
    assert KappaOPtimizer.success, "Optimization did not converge."
    
    oof_tuned = threshold_Rounder(oof_non_rounded, KappaOPtimizer.x)
    tKappa = quadratic_weighted_kappa(y, oof_tuned)

    print(f"----> || Optimized QWK SCORE :: {Fore.CYAN}{Style.BRIGHT} {tKappa:.3f}{Style.RESET_ALL}")

    tpm = test_preds.mean(axis=1)
    tpTuned = threshold_Rounder(tpm, KappaOPtimizer.x)
    
    submission = pd.DataFrame({
        'id': test_id,     #sample['id'],
        'sii': tpTuned
    })

    return submission, model

CPU times: user 7 µs, sys: 1e+03 ns, total: 8 µs
Wall time: 14.1 µs


In [8]:
# test_id == sample['id']

In [9]:
%%time

#Train and predict sii for test data 

LGB_Params = {
    'learning_rate': 0.01, 
    'random_state': SEED, 
    'n_estimators': 200,
    'max_depth': 15, 
    'num_leaves': 300, 
    'min_data_in_leaf': 30,
    'feature_fraction': 0.7689, 
    'bagging_fraction': 0.6879, 
    'bagging_freq': 2, 
    'lambda_l1': 4.74, 
    'lambda_l2': 4.743e-06,
    'verbose': -1,
    # CV : 0.4094 | LB : 0.471
}

Model = lgb.LGBMRegressor(**LGB_Params)

Submission, model = TrainML(Model,test)

Training Folds: 100%|██████████| 5/5 [00:11<00:00,  2.37s/it]

Mean Train QWK --> 0.4945
Mean Validation QWK ---> 0.3537





----> || Optimized QWK SCORE :: [36m[1m 0.464[0m
CPU times: user 10.1 s, sys: 1.2 s, total: 11.3 s
Wall time: 12.6 s


In [10]:
%%time

#feature_importance_df = pd.DataFrame({
#     'Feature': model.booster_.feature_name(),
#     'Importance': model.booster_.feature_importance(importance_type='gain')
#})

# feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)

# plt.figure(figsize=(20, 40))
# sns.barplot(x='Importance', y='Feature', data=feature_importance_df.head(100)) 
# plt.title("Top Feature Importance")
# plt.show()

CPU times: user 7 µs, sys: 2 µs, total: 9 µs
Wall time: 15.7 µs


In [11]:
%%time

Submission.to_csv('submission.csv', index=False)
print(Submission['sii'].value_counts())

sii
0    12
1     8
Name: count, dtype: int64
CPU times: user 3.86 ms, sys: 2.99 ms, total: 6.85 ms
Wall time: 10.8 ms
