# **Exploring Factors of Problematic Internet Use in Children**

In today's digital age, problematic internet use among children and adolescents is a growing concern, often linked to mental health issues like depression and anxiety. Current methods for measuring problematic internet use are complex and require professional assessments, creating barriers for many families.

Physical fitness indicators, such as posture, diet, and activity levels, are more accessible and can serve as proxies for detecting problematic internet use. Changes in these habits are commonly observed in excessive technology users.

This competition challenges us to develop a predictive model using children's physical activity data to detect early signs of problematic internet use, enabling timely interventions and promoting healthier digital habits.Our work will help children navigate the digital landscape responsibly and lead healthier lives.

#  **Importing important liabraries**

In [None]:
!pip install catboost

Collecting catboost
  Downloading catboost-1.2.7-cp310-cp310-manylinux2014_x86_64.whl.metadata (1.2 kB)
Downloading catboost-1.2.7-cp310-cp310-manylinux2014_x86_64.whl (98.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m98.7/98.7 MB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: catboost
Successfully installed catboost-1.2.7


In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
#from ydata_profiling import ProfileReport
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
import warnings
import matplotlib.pyplot as plt
from sklearn.compose import ColumnTransformer
#from sklearn.preprocessing import FunctionTransformer
import dask.dataframe as dd
import polars as pl
from pathlib import Path
from tqdm import tqdm
import pyarrow as pa
import pyarrow.parquet as pq
from scipy import stats
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import GradientBoostingClassifier
from catboost import CatBoostClassifier
from sklearn.metrics import cohen_kappa_score
pd.set_option('display.max_columns', None)
warnings.filterwarnings('ignore')

# **Loading the train, test and submission csv**

In [None]:
submission= pd.read_csv("/content/drive/MyDrive/comptetion/child-mind-institute-problematic-internet-use/sample_submission.csv")
submission.head(2)

Unnamed: 0,id,prediction
0,00008ff9,2.0
1,000fd460,0.0


In [None]:
X_test = pd.read_csv('/content/drive/MyDrive/comptetion/child-mind-institute-problematic-internet-use/test.csv')
print(X_test.head(2))
print(X_test.shape)
print(X_test.columns)

         id Basic_Demos-Enroll_Season  Basic_Demos-Age  Basic_Demos-Sex  \
0  00008ff9                      Fall                5                0   
1  000fd460                    Summer                9                0   

  CGAS-Season  CGAS-CGAS_Score Physical-Season  Physical-BMI  Physical-Height  \
0      Winter             51.0            Fall     16.877316             46.0   
1         NaN              NaN            Fall     14.035590             48.0   

   Physical-Weight  Physical-Waist_Circumference  Physical-Diastolic_BP  \
0             50.8                           NaN                    NaN   
1             46.0                          22.0                   75.0   

   Physical-HeartRate  Physical-Systolic_BP Fitness_Endurance-Season  \
0                 NaN                   NaN                      NaN   
1                70.0                 122.0                      NaN   

   Fitness_Endurance-Max_Stage  Fitness_Endurance-Time_Mins  \
0                       

In [None]:
missing_values = X_test.isnull().sum()

In [None]:
X_test.describe().style.background_gradient(cmap='viridis')

Unnamed: 0,Basic_Demos-Age,Basic_Demos-Sex,CGAS-CGAS_Score,Physical-BMI,Physical-Height,Physical-Weight,Physical-Waist_Circumference,Physical-Diastolic_BP,Physical-HeartRate,Physical-Systolic_BP,Fitness_Endurance-Max_Stage,Fitness_Endurance-Time_Mins,Fitness_Endurance-Time_Sec,FGC-FGC_CU,FGC-FGC_CU_Zone,FGC-FGC_GSND,FGC-FGC_GSND_Zone,FGC-FGC_GSD,FGC-FGC_GSD_Zone,FGC-FGC_PU,FGC-FGC_PU_Zone,FGC-FGC_SRL,FGC-FGC_SRL_Zone,FGC-FGC_SRR,FGC-FGC_SRR_Zone,FGC-FGC_TL,FGC-FGC_TL_Zone,BIA-BIA_Activity_Level_num,BIA-BIA_BMC,BIA-BIA_BMI,BIA-BIA_BMR,BIA-BIA_DEE,BIA-BIA_ECW,BIA-BIA_FFM,BIA-BIA_FFMI,BIA-BIA_FMI,BIA-BIA_Fat,BIA-BIA_Frame_num,BIA-BIA_ICW,BIA-BIA_LDM,BIA-BIA_LST,BIA-BIA_SMM,BIA-BIA_TBW,PAQ_A-PAQ_A_Total,PAQ_C-PAQ_C_Total,SDS-SDS_Total_Raw,SDS-SDS_Total_T,PreInt_EduHx-computerinternet_hoursday
count,20.0,20.0,8.0,13.0,13.0,13.0,5.0,11.0,12.0,11.0,3.0,3.0,3.0,13.0,13.0,5.0,5.0,5.0,5.0,13.0,13.0,13.0,13.0,13.0,13.0,13.0,13.0,8.0,8.0,8.0,8.0,8.0,8.0,8.0,8.0,8.0,8.0,8.0,8.0,8.0,8.0,8.0,8.0,1.0,9.0,10.0,10.0,16.0
mean,10.75,0.4,62.5,19.835939,52.961538,79.2,25.4,70.545455,81.666667,117.545455,5.0,7.0,34.0,8.692308,0.461538,16.16,1.6,16.74,1.6,4.0,0.153846,7.5,0.538462,7.961538,0.615385,7.961538,0.692308,2.625,3.63636,19.284788,1111.248,1886.9125,16.681051,60.625612,14.432937,4.851857,21.79939,1.625,28.48675,15.457795,56.989275,25.985962,45.167825,1.04,2.372333,36.8,52.3,1.4375
std,3.725799,0.502625,11.275764,4.927625,6.942357,23.632181,3.130495,18.806189,9.316001,21.262002,1.0,2.0,2.645751,7.899205,0.518875,4.879857,0.547723,3.990363,0.547723,5.627314,0.375534,4.0,0.518875,4.436879,0.50637,3.152126,0.480384,1.06066,0.898087,4.876077,143.724879,486.140935,7.651128,15.308597,1.227543,3.728203,19.920902,0.517549,5.099449,4.021153,14.490362,7.479799,11.94,,1.080099,5.533735,7.02456,1.152895
min,5.0,0.0,50.0,14.03559,37.5,46.0,22.0,57.0,70.0,95.0,4.0,5.0,32.0,0.0,0.0,10.2,1.0,11.1,1.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,2.0,2.57949,14.0371,932.498,1492.0,6.01993,41.5862,12.8254,1.21172,3.97085,1.0,21.0352,8.89536,38.9177,15.4107,27.0552,1.04,1.1,27.0,40.0,0.0
25%,9.0,0.0,51.0,16.861286,48.0,60.2,24.0,60.5,74.5,102.5,4.5,6.0,32.5,3.0,0.0,12.6,1.0,14.7,1.0,0.0,0.0,7.0,0.0,6.0,0.0,6.0,0.0,2.0,2.7299,16.875175,986.4665,1503.12,13.423195,47.334825,13.765575,3.15341,10.625893,1.0,24.230725,13.8154,44.62725,19.801775,37.245575,1.04,1.27,33.5,47.75,0.0
50%,10.0,0.0,63.0,18.292347,55.0,81.6,24.0,63.0,80.0,116.0,5.0,7.0,33.0,6.0,0.0,16.5,2.0,17.9,2.0,2.0,0.0,8.0,1.0,9.5,1.0,7.0,1.0,2.0,3.81231,17.78405,1133.645,1852.72,15.96,63.01135,14.0819,3.73714,17.53585,2.0,29.4704,16.40245,59.19905,26.33775,46.60885,1.04,2.34,37.5,53.5,2.0
75%,12.25,1.0,71.0,21.079065,57.75,85.6,27.0,73.0,90.25,119.5,5.5,8.0,35.0,12.0,1.0,19.2,2.0,18.4,2.0,6.0,0.0,10.5,1.0,11.0,1.0,11.0,1.0,3.0,4.125535,20.017525,1194.895,1941.6925,20.450875,69.5351,14.939925,5.077595,22.444175,2.0,31.398725,17.674625,65.22205,30.4211,51.860475,1.04,3.02,39.75,55.75,2.0
max,19.0,1.0,80.0,30.094649,60.0,121.6,30.0,123.0,97.0,163.0,6.0,9.0,37.0,24.0,1.0,22.3,2.0,21.6,2.0,20.0,1.0,12.0,1.0,15.0,1.0,12.5,1.0,5.0,5.08025,30.1865,1330.97,2974.71,30.2124,84.0285,16.6877,13.4988,67.9715,2.0,36.0572,20.902,79.6982,36.2232,63.1265,1.04,4.11,46.0,64.0,3.0


In [None]:
X_train= pd.read_csv('/content/drive/MyDrive/comptetion/child-mind-institute-problematic-internet-use/train.csv')
X_train.head(2)
print(X_train.shape)
print(X_train.columns)


(3960, 82)
Index(['id', 'Basic_Demos-Enroll_Season', 'Basic_Demos-Age', 'Basic_Demos-Sex',
       'CGAS-Season', 'CGAS-CGAS_Score', 'Physical-Season', 'Physical-BMI',
       'Physical-Height', 'Physical-Weight', 'Physical-Waist_Circumference',
       'Physical-Diastolic_BP', 'Physical-HeartRate', 'Physical-Systolic_BP',
       'Fitness_Endurance-Season', 'Fitness_Endurance-Max_Stage',
       'Fitness_Endurance-Time_Mins', 'Fitness_Endurance-Time_Sec',
       'FGC-Season', 'FGC-FGC_CU', 'FGC-FGC_CU_Zone', 'FGC-FGC_GSND',
       'FGC-FGC_GSND_Zone', 'FGC-FGC_GSD', 'FGC-FGC_GSD_Zone', 'FGC-FGC_PU',
       'FGC-FGC_PU_Zone', 'FGC-FGC_SRL', 'FGC-FGC_SRL_Zone', 'FGC-FGC_SRR',
       'FGC-FGC_SRR_Zone', 'FGC-FGC_TL', 'FGC-FGC_TL_Zone', 'BIA-Season',
       'BIA-BIA_Activity_Level_num', 'BIA-BIA_BMC', 'BIA-BIA_BMI',
       'BIA-BIA_BMR', 'BIA-BIA_DEE', 'BIA-BIA_ECW', 'BIA-BIA_FFM',
       'BIA-BIA_FFMI', 'BIA-BIA_FMI', 'BIA-BIA_Fat', 'BIA-BIA_Frame_num',
       'BIA-BIA_ICW', 'BIA-BIA_LDM', 'B

**Since the "SII" family is drawn by the PICAT family, I am excluding all family-related features associated with PICAT from the training dataset. This adjustment is also made to ensure consistency with the provided test.csv file.**



In [None]:
 X_train= X_train.drop(
    columns=[
        'PCIAT-Season', 'PCIAT-PCIAT_01', 'PCIAT-PCIAT_02', 'PCIAT-PCIAT_03',
        'PCIAT-PCIAT_04', 'PCIAT-PCIAT_05', 'PCIAT-PCIAT_06', 'PCIAT-PCIAT_07',
        'PCIAT-PCIAT_08', 'PCIAT-PCIAT_09', 'PCIAT-PCIAT_10', 'PCIAT-PCIAT_11',
        'PCIAT-PCIAT_12', 'PCIAT-PCIAT_13', 'PCIAT-PCIAT_14', 'PCIAT-PCIAT_15',
        'PCIAT-PCIAT_16', 'PCIAT-PCIAT_17', 'PCIAT-PCIAT_18', 'PCIAT-PCIAT_19',
        'PCIAT-PCIAT_20', 'PCIAT-PCIAT_Total'
    ]
 )

X_train.shape


(3960, 60)

In [None]:
X_train.columns

Index(['id', 'Basic_Demos-Enroll_Season', 'Basic_Demos-Age', 'Basic_Demos-Sex',
       'CGAS-Season', 'CGAS-CGAS_Score', 'Physical-Season', 'Physical-BMI',
       'Physical-Height', 'Physical-Weight', 'Physical-Waist_Circumference',
       'Physical-Diastolic_BP', 'Physical-HeartRate', 'Physical-Systolic_BP',
       'Fitness_Endurance-Season', 'Fitness_Endurance-Max_Stage',
       'Fitness_Endurance-Time_Mins', 'Fitness_Endurance-Time_Sec',
       'FGC-Season', 'FGC-FGC_CU', 'FGC-FGC_CU_Zone', 'FGC-FGC_GSND',
       'FGC-FGC_GSND_Zone', 'FGC-FGC_GSD', 'FGC-FGC_GSD_Zone', 'FGC-FGC_PU',
       'FGC-FGC_PU_Zone', 'FGC-FGC_SRL', 'FGC-FGC_SRL_Zone', 'FGC-FGC_SRR',
       'FGC-FGC_SRR_Zone', 'FGC-FGC_TL', 'FGC-FGC_TL_Zone', 'BIA-Season',
       'BIA-BIA_Activity_Level_num', 'BIA-BIA_BMC', 'BIA-BIA_BMI',
       'BIA-BIA_BMR', 'BIA-BIA_DEE', 'BIA-BIA_ECW', 'BIA-BIA_FFM',
       'BIA-BIA_FFMI', 'BIA-BIA_FMI', 'BIA-BIA_Fat', 'BIA-BIA_Frame_num',
       'BIA-BIA_ICW', 'BIA-BIA_LDM', 'BIA-BIA_LST'

In [None]:
missing_values = X_train.isnull().sum()

missing_values

Unnamed: 0,0
id,0
Basic_Demos-Enroll_Season,0
Basic_Demos-Age,0
Basic_Demos-Sex,0
CGAS-Season,1405
CGAS-CGAS_Score,1539
Physical-Season,650
Physical-BMI,938
Physical-Height,933
Physical-Weight,884


In [None]:
X_train.describe().style.background_gradient(cmap='viridis')

Unnamed: 0,Basic_Demos-Age,Basic_Demos-Sex,CGAS-CGAS_Score,Physical-BMI,Physical-Height,Physical-Weight,Physical-Waist_Circumference,Physical-Diastolic_BP,Physical-HeartRate,Physical-Systolic_BP,Fitness_Endurance-Max_Stage,Fitness_Endurance-Time_Mins,Fitness_Endurance-Time_Sec,FGC-FGC_CU,FGC-FGC_CU_Zone,FGC-FGC_GSND,FGC-FGC_GSND_Zone,FGC-FGC_GSD,FGC-FGC_GSD_Zone,FGC-FGC_PU,FGC-FGC_PU_Zone,FGC-FGC_SRL,FGC-FGC_SRL_Zone,FGC-FGC_SRR,FGC-FGC_SRR_Zone,FGC-FGC_TL,FGC-FGC_TL_Zone,BIA-BIA_Activity_Level_num,BIA-BIA_BMC,BIA-BIA_BMI,BIA-BIA_BMR,BIA-BIA_DEE,BIA-BIA_ECW,BIA-BIA_FFM,BIA-BIA_FFMI,BIA-BIA_FMI,BIA-BIA_Fat,BIA-BIA_Frame_num,BIA-BIA_ICW,BIA-BIA_LDM,BIA-BIA_LST,BIA-BIA_SMM,BIA-BIA_TBW,PAQ_A-PAQ_A_Total,PAQ_C-PAQ_C_Total,SDS-SDS_Total_Raw,SDS-SDS_Total_T,PreInt_EduHx-computerinternet_hoursday,sii
count,3960.0,3960.0,2421.0,3022.0,3027.0,3076.0,898.0,2954.0,2967.0,2954.0,743.0,740.0,740.0,2322.0,2282.0,1074.0,1062.0,1074.0,1063.0,2310.0,2271.0,2305.0,2267.0,2307.0,2269.0,2324.0,2285.0,1991.0,1991.0,1991.0,1991.0,1991.0,1991.0,1991.0,1991.0,1991.0,1991.0,1991.0,1991.0,1991.0,1991.0,1991.0,1991.0,475.0,1721.0,2609.0,2606.0,3301.0,2736.0
mean,10.433586,0.372727,65.454771,19.331929,55.946713,89.038615,27.278508,69.648951,81.597236,116.983074,4.989233,7.37027,27.581081,11.25969,0.476337,22.420438,1.829567,23.518622,1.904045,5.579654,0.330251,8.694924,0.61888,8.805635,0.620097,9.252775,0.785558,2.651431,6.719826,19.367048,1237.018187,2064.693747,20.825346,74.021708,15.030554,4.336495,16.85502,1.745354,33.17338,20.02299,67.301883,34.389466,53.998726,2.178853,2.58955,41.088923,57.763622,1.060588,0.580409
std,3.574648,0.483591,22.341862,5.113934,7.473764,44.56904,5.567287,13.611226,13.665196,17.061225,2.014072,3.189662,17.707751,11.807781,0.499549,10.833995,0.612585,11.148951,0.612344,7.390161,0.470407,3.429301,0.485769,3.422167,0.485469,2.988863,0.410525,1.028267,92.586325,5.047848,1872.383246,2836.246272,73.266287,199.433753,5.792505,6.356402,199.372119,0.680635,56.272346,70.21561,108.705918,84.050607,129.362539,0.849476,0.783937,10.427433,13.196091,1.094875,0.771122
min,5.0,0.0,25.0,0.0,33.0,0.0,18.0,0.0,27.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,-7.78961,0.048267,813.397,1073.45,1.78945,28.9004,7.86485,-194.163,-8745.08,1.0,14.489,4.63581,23.6201,4.65573,20.5892,0.66,0.58,17.0,38.0,0.0,0.0
25%,8.0,0.0,59.0,15.86935,50.0,57.2,23.0,61.0,72.0,107.0,4.0,6.0,12.75,3.0,0.0,15.1,1.0,16.2,2.0,0.0,0.0,7.0,0.0,7.0,0.0,7.0,1.0,2.0,2.966905,15.9136,1004.71,1605.785,11.10955,49.2781,13.408,2.306915,8.602395,1.0,24.4635,12.98315,45.2041,21.14155,35.887,1.49,2.02,33.0,47.0,0.0,0.0
50%,10.0,0.0,65.0,17.937682,55.0,77.0,26.0,68.0,81.0,114.0,5.0,7.0,28.0,9.0,0.0,20.05,2.0,21.2,2.0,3.0,0.0,9.0,1.0,9.0,1.0,10.0,1.0,3.0,3.92272,17.9665,1115.38,1863.98,15.928,61.0662,14.0925,3.69863,16.1746,2.0,28.8558,16.4388,56.9964,27.4151,44.987,2.01,2.54,39.0,55.0,1.0,0.0
75%,13.0,1.0,75.0,21.571244,62.0,113.8,30.0,76.0,90.5,125.0,6.0,9.0,43.0,15.75,1.0,26.6,2.0,28.175,2.0,9.0,1.0,11.0,1.0,11.0,1.0,12.0,1.0,3.0,5.460925,21.4611,1310.36,2218.145,25.1622,81.8338,15.43095,5.98769,30.2731,2.0,35.4757,22.1676,77.10565,38.1794,60.27105,2.78,3.16,46.0,64.0,2.0,1.0
max,22.0,1.0,999.0,59.132048,78.5,315.0,50.0,179.0,138.0,203.0,28.0,20.0,59.0,115.0,1.0,124.0,3.0,123.8,3.0,51.0,1.0,21.7,1.0,21.0,1.0,22.0,1.0,5.0,4115.36,53.9243,83152.2,124728.0,3233.0,8799.08,217.771,28.2515,153.82,3.0,2457.91,3108.17,4683.71,3607.69,5690.91,4.71,4.79,96.0,100.0,3.0,3.0


In [None]:
X= X_train.drop(columns=['sii','id'])
y= X_train['sii']

   **I am uploading the time series parquet file using Dask DataFrame and PyArrow for efficient handling of large datasets. After loading the data, I will merge the time series data with the main CSV file based on the ID column to ensure proper alignment of the datasets for further analysis.**



In [None]:
def get_ts_feature(id_path):

    df = pl.read_parquet(id_path / 'part-0.parquet')
    ts_feature = df.describe().filter(
        ~pl.col("statistic").is_in(["count", "null_count"])
    ).select(
        pl.all().exclude(["statistic", "step"])
    ).to_numpy().reshape(-1)


    patient_id = id_path.name.split("=")[1]

    return ts_feature, patient_id


def get_all_ts_feature(parquet_dir) -> pd.DataFrame:
    items = list(Path(parquet_dir).iterdir())
    features = []
    ids = []


    for id_path in tqdm(items):
        feature, patient_id = get_ts_feature(id_path)
        features.append(feature)
        ids.append(patient_id)


    columns = [f"stat_{i}" for i in range(len(features[0]))]
    df = pd.DataFrame(features, columns=columns, index=ids)

    return df

train_df = get_all_ts_feature("/content/drive/MyDrive/comptetion/child-mind-institute-problematic-internet-use/series_train.parquet")
print(train_df.head(1))

test_df = get_all_ts_feature("/content/drive/MyDrive/comptetion/child-mind-institute-problematic-internet-use/series_test.parquet")
print(test_df.head(1))








100%|██████████| 996/996 [03:09<00:00,  5.27it/s]


            stat_0    stat_1    stat_2    stat_3     stat_4    stat_5  \
cefdb7fe  0.047866  0.003234 -0.249981  0.023465 -18.722479  0.216525   

             stat_6       stat_7        stat_8    stat_9   stat_10    stat_11  \
cefdb7fe  68.818008  3841.463867  4.323937e+13  3.809581  2.475539  24.915834   

           stat_12   stat_13   stat_14   stat_15    stat_16   stat_17  \
cefdb7fe  0.523529  0.441043  0.646242  0.052369  49.600792  0.407972   

             stat_18     stat_19       stat_20   stat_21   stat_22   stat_23  \
cefdb7fe  278.520935  165.178589  2.500622e+13  1.971711  0.499402  6.899615   

           stat_24   stat_25   stat_26  stat_27    stat_28  stat_29  stat_30  \
cefdb7fe -1.777734 -2.433394 -1.005808      0.0 -89.819664      0.0      0.0   

              stat_31  stat_32  stat_33  stat_34  stat_35   stat_36   stat_37  \
cefdb7fe  3098.166748      0.0      1.0      2.0     13.0 -0.266008 -0.277724   

           stat_38   stat_39   stat_40  stat_41  stat_42  

100%|██████████| 2/2 [00:00<00:00,  7.74it/s]

            stat_0    stat_1   stat_2    stat_3     stat_4  stat_5    stat_6  \
00115b9f -0.316384  0.016009 -0.16789  0.047388 -10.580416     0.0  42.29631   

               stat_7        stat_8    stat_9  stat_10    stat_11   stat_12  \
00115b9f  4053.578857  5.046215e+13  4.470182      3.0  53.201683  0.453665   

           stat_13  stat_14   stat_15   stat_16  stat_17     stat_18  \
00115b9f  0.502702  0.58571  0.106351  42.94717      0.0  208.168976   

             stat_19       stat_20   stat_21  stat_22    stat_23   stat_24  \
00115b9f  112.404037  1.942842e+13  1.931421      0.0  14.244915 -1.746094   

           stat_25   stat_26  stat_27    stat_28  stat_29  stat_30  stat_31  \
00115b9f -2.905339 -1.048372      0.0 -89.833092      0.0      0.0   3824.0   

               stat_32  stat_33  stat_34  stat_35   stat_36  stat_37  \
00115b9f  5.500000e+10      1.0      3.0     41.0 -0.684193 -0.30987   

           stat_38   stat_39    stat_40  stat_41   stat_42      stat_43  \




In [None]:
#mege xtrain with train df

df_train = pd.merge(X_train, train_df, how='left', left_on='id', right_index=True)
df_test = pd.merge(X_test, test_df, how='left', left_on='id', right_index=True)
df_train.shape, df_test.shape


((3960, 144), (20, 143))

In [None]:
df_train.head(1)

Unnamed: 0,id,Basic_Demos-Enroll_Season,Basic_Demos-Age,Basic_Demos-Sex,CGAS-Season,CGAS-CGAS_Score,Physical-Season,Physical-BMI,Physical-Height,Physical-Weight,Physical-Waist_Circumference,Physical-Diastolic_BP,Physical-HeartRate,Physical-Systolic_BP,Fitness_Endurance-Season,Fitness_Endurance-Max_Stage,Fitness_Endurance-Time_Mins,Fitness_Endurance-Time_Sec,FGC-Season,FGC-FGC_CU,FGC-FGC_CU_Zone,FGC-FGC_GSND,FGC-FGC_GSND_Zone,FGC-FGC_GSD,FGC-FGC_GSD_Zone,FGC-FGC_PU,FGC-FGC_PU_Zone,FGC-FGC_SRL,FGC-FGC_SRL_Zone,FGC-FGC_SRR,FGC-FGC_SRR_Zone,FGC-FGC_TL,FGC-FGC_TL_Zone,BIA-Season,BIA-BIA_Activity_Level_num,BIA-BIA_BMC,BIA-BIA_BMI,BIA-BIA_BMR,BIA-BIA_DEE,BIA-BIA_ECW,BIA-BIA_FFM,BIA-BIA_FFMI,BIA-BIA_FMI,BIA-BIA_Fat,BIA-BIA_Frame_num,BIA-BIA_ICW,BIA-BIA_LDM,BIA-BIA_LST,BIA-BIA_SMM,BIA-BIA_TBW,PAQ_A-Season,PAQ_A-PAQ_A_Total,PAQ_C-Season,PAQ_C-PAQ_C_Total,SDS-Season,SDS-SDS_Total_Raw,SDS-SDS_Total_T,PreInt_EduHx-Season,PreInt_EduHx-computerinternet_hoursday,sii,stat_0,stat_1,stat_2,stat_3,stat_4,stat_5,stat_6,stat_7,stat_8,stat_9,stat_10,stat_11,stat_12,stat_13,stat_14,stat_15,stat_16,stat_17,stat_18,stat_19,stat_20,stat_21,stat_22,stat_23,stat_24,stat_25,stat_26,stat_27,stat_28,stat_29,stat_30,stat_31,stat_32,stat_33,stat_34,stat_35,stat_36,stat_37,stat_38,stat_39,stat_40,stat_41,stat_42,stat_43,stat_44,stat_45,stat_46,stat_47,stat_48,stat_49,stat_50,stat_51,stat_52,stat_53,stat_54,stat_55,stat_56,stat_57,stat_58,stat_59,stat_60,stat_61,stat_62,stat_63,stat_64,stat_65,stat_66,stat_67,stat_68,stat_69,stat_70,stat_71,stat_72,stat_73,stat_74,stat_75,stat_76,stat_77,stat_78,stat_79,stat_80,stat_81,stat_82,stat_83
0,00008ff9,Fall,5,0,Winter,51.0,Fall,16.877316,46.0,50.8,,,,,,,,,Fall,0.0,0.0,,,,,0.0,0.0,7.0,0.0,6.0,0.0,6.0,1.0,Fall,2.0,2.66855,16.8792,932.498,1492.0,8.25598,41.5862,13.8177,3.06143,9.21377,1.0,24.4349,8.89536,38.9177,19.5413,32.6909,,,,,,,,Fall,3.0,2.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [None]:
df_train.columns

Index(['id', 'Basic_Demos-Enroll_Season', 'Basic_Demos-Age', 'Basic_Demos-Sex',
       'CGAS-Season', 'CGAS-CGAS_Score', 'Physical-Season', 'Physical-BMI',
       'Physical-Height', 'Physical-Weight',
       ...
       'stat_74', 'stat_75', 'stat_76', 'stat_77', 'stat_78', 'stat_79',
       'stat_80', 'stat_81', 'stat_82', 'stat_83'],
      dtype='object', length=144)

In [None]:
df_train.head(2)

Unnamed: 0,id,Basic_Demos-Enroll_Season,Basic_Demos-Age,Basic_Demos-Sex,CGAS-Season,CGAS-CGAS_Score,Physical-Season,Physical-BMI,Physical-Height,Physical-Weight,Physical-Waist_Circumference,Physical-Diastolic_BP,Physical-HeartRate,Physical-Systolic_BP,Fitness_Endurance-Season,Fitness_Endurance-Max_Stage,Fitness_Endurance-Time_Mins,Fitness_Endurance-Time_Sec,FGC-Season,FGC-FGC_CU,FGC-FGC_CU_Zone,FGC-FGC_GSND,FGC-FGC_GSND_Zone,FGC-FGC_GSD,FGC-FGC_GSD_Zone,FGC-FGC_PU,FGC-FGC_PU_Zone,FGC-FGC_SRL,FGC-FGC_SRL_Zone,FGC-FGC_SRR,FGC-FGC_SRR_Zone,FGC-FGC_TL,FGC-FGC_TL_Zone,BIA-Season,BIA-BIA_Activity_Level_num,BIA-BIA_BMC,BIA-BIA_BMI,BIA-BIA_BMR,BIA-BIA_DEE,BIA-BIA_ECW,BIA-BIA_FFM,BIA-BIA_FFMI,BIA-BIA_FMI,BIA-BIA_Fat,BIA-BIA_Frame_num,BIA-BIA_ICW,BIA-BIA_LDM,BIA-BIA_LST,BIA-BIA_SMM,BIA-BIA_TBW,PAQ_A-Season,PAQ_A-PAQ_A_Total,PAQ_C-Season,PAQ_C-PAQ_C_Total,SDS-Season,SDS-SDS_Total_Raw,SDS-SDS_Total_T,PreInt_EduHx-Season,PreInt_EduHx-computerinternet_hoursday,sii,stat_0,stat_1,stat_2,stat_3,stat_4,stat_5,stat_6,stat_7,stat_8,stat_9,stat_10,stat_11,stat_12,stat_13,stat_14,stat_15,stat_16,stat_17,stat_18,stat_19,stat_20,stat_21,stat_22,stat_23,stat_24,stat_25,stat_26,stat_27,stat_28,stat_29,stat_30,stat_31,stat_32,stat_33,stat_34,stat_35,stat_36,stat_37,stat_38,stat_39,stat_40,stat_41,stat_42,stat_43,stat_44,stat_45,stat_46,stat_47,stat_48,stat_49,stat_50,stat_51,stat_52,stat_53,stat_54,stat_55,stat_56,stat_57,stat_58,stat_59,stat_60,stat_61,stat_62,stat_63,stat_64,stat_65,stat_66,stat_67,stat_68,stat_69,stat_70,stat_71,stat_72,stat_73,stat_74,stat_75,stat_76,stat_77,stat_78,stat_79,stat_80,stat_81,stat_82,stat_83
0,00008ff9,Fall,5,0,Winter,51.0,Fall,16.877316,46.0,50.8,,,,,,,,,Fall,0.0,0.0,,,,,0.0,0.0,7.0,0.0,6.0,0.0,6.0,1.0,Fall,2.0,2.66855,16.8792,932.498,1492.0,8.25598,41.5862,13.8177,3.06143,9.21377,1.0,24.4349,8.89536,38.9177,19.5413,32.6909,,,,,,,,Fall,3.0,2.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,000fd460,Summer,9,0,,,Fall,14.03559,48.0,46.0,22.0,75.0,70.0,122.0,,,,,Fall,3.0,0.0,,,,,5.0,0.0,11.0,1.0,11.0,1.0,3.0,0.0,Winter,2.0,2.57949,14.0371,936.656,1498.65,6.01993,42.0291,12.8254,1.21172,3.97085,1.0,21.0352,14.974,39.4497,15.4107,27.0552,,,Fall,2.34,Fall,46.0,64.0,Summer,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [None]:
df_test.head(2)

Unnamed: 0,id,Basic_Demos-Enroll_Season,Basic_Demos-Age,Basic_Demos-Sex,CGAS-Season,CGAS-CGAS_Score,Physical-Season,Physical-BMI,Physical-Height,Physical-Weight,Physical-Waist_Circumference,Physical-Diastolic_BP,Physical-HeartRate,Physical-Systolic_BP,Fitness_Endurance-Season,Fitness_Endurance-Max_Stage,Fitness_Endurance-Time_Mins,Fitness_Endurance-Time_Sec,FGC-Season,FGC-FGC_CU,FGC-FGC_CU_Zone,FGC-FGC_GSND,FGC-FGC_GSND_Zone,FGC-FGC_GSD,FGC-FGC_GSD_Zone,FGC-FGC_PU,FGC-FGC_PU_Zone,FGC-FGC_SRL,FGC-FGC_SRL_Zone,FGC-FGC_SRR,FGC-FGC_SRR_Zone,FGC-FGC_TL,FGC-FGC_TL_Zone,BIA-Season,BIA-BIA_Activity_Level_num,BIA-BIA_BMC,BIA-BIA_BMI,BIA-BIA_BMR,BIA-BIA_DEE,BIA-BIA_ECW,BIA-BIA_FFM,BIA-BIA_FFMI,BIA-BIA_FMI,BIA-BIA_Fat,BIA-BIA_Frame_num,BIA-BIA_ICW,BIA-BIA_LDM,BIA-BIA_LST,BIA-BIA_SMM,BIA-BIA_TBW,PAQ_A-Season,PAQ_A-PAQ_A_Total,PAQ_C-Season,PAQ_C-PAQ_C_Total,SDS-Season,SDS-SDS_Total_Raw,SDS-SDS_Total_T,PreInt_EduHx-Season,PreInt_EduHx-computerinternet_hoursday,stat_0,stat_1,stat_2,stat_3,stat_4,stat_5,stat_6,stat_7,stat_8,stat_9,stat_10,stat_11,stat_12,stat_13,stat_14,stat_15,stat_16,stat_17,stat_18,stat_19,stat_20,stat_21,stat_22,stat_23,stat_24,stat_25,stat_26,stat_27,stat_28,stat_29,stat_30,stat_31,stat_32,stat_33,stat_34,stat_35,stat_36,stat_37,stat_38,stat_39,stat_40,stat_41,stat_42,stat_43,stat_44,stat_45,stat_46,stat_47,stat_48,stat_49,stat_50,stat_51,stat_52,stat_53,stat_54,stat_55,stat_56,stat_57,stat_58,stat_59,stat_60,stat_61,stat_62,stat_63,stat_64,stat_65,stat_66,stat_67,stat_68,stat_69,stat_70,stat_71,stat_72,stat_73,stat_74,stat_75,stat_76,stat_77,stat_78,stat_79,stat_80,stat_81,stat_82,stat_83
0,00008ff9,Fall,5,0,Winter,51.0,Fall,16.877316,46.0,50.8,,,,,,,,,Fall,0.0,0.0,,,,,0.0,0.0,7.0,0.0,6.0,0.0,6.0,1.0,Fall,2.0,2.66855,16.8792,932.498,1492.0,8.25598,41.5862,13.8177,3.06143,9.21377,1.0,24.4349,8.89536,38.9177,19.5413,32.6909,,,,,,,,Fall,3.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,000fd460,Summer,9,0,,,Fall,14.03559,48.0,46.0,22.0,75.0,70.0,122.0,,,,,Fall,3.0,0.0,,,,,5.0,0.0,11.0,1.0,11.0,1.0,3.0,0.0,Winter,2.0,2.57949,14.0371,936.656,1498.65,6.01993,42.0291,12.8254,1.21172,3.97085,1.0,21.0352,14.974,39.4497,15.4107,27.0552,,,Fall,2.34,Fall,46.0,64.0,Summer,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [None]:
missing_values = df_train.isnull().sum()
missing_values



Unnamed: 0,0
id,0
Basic_Demos-Enroll_Season,0
Basic_Demos-Age,0
Basic_Demos-Sex,0
CGAS-Season,1405
...,...
stat_79,2964
stat_80,2964
stat_81,2964
stat_82,2964


In [None]:
df_train.duplicated().sum()

0

In [None]:
df_test.columns

Index(['id', 'Basic_Demos-Enroll_Season', 'Basic_Demos-Age', 'Basic_Demos-Sex',
       'CGAS-Season', 'CGAS-CGAS_Score', 'Physical-Season', 'Physical-BMI',
       'Physical-Height', 'Physical-Weight',
       ...
       'stat_74', 'stat_75', 'stat_76', 'stat_77', 'stat_78', 'stat_79',
       'stat_80', 'stat_81', 'stat_82', 'stat_83'],
      dtype='object', length=143)

In [None]:
y.isnull().sum()

1224

In [None]:
X= df_train.drop(columns=['sii','id'])
y= df_train['sii']
df_test= df_test.drop(columns=['id'])

 **Separating the numerical and categorical columns based on their data types**

In [None]:
numerical_cols =  X.select_dtypes(include=['int', 'float']).columns
categorical_cols = X.select_dtypes(include=['object']).columns
print(numerical_cols)
print(len(numerical_cols))
print(categorical_cols)
print(len(categorical_cols))

Index(['Basic_Demos-Age', 'Basic_Demos-Sex', 'CGAS-CGAS_Score', 'Physical-BMI',
       'Physical-Height', 'Physical-Weight', 'Physical-Waist_Circumference',
       'Physical-Diastolic_BP', 'Physical-HeartRate', 'Physical-Systolic_BP',
       ...
       'stat_74', 'stat_75', 'stat_76', 'stat_77', 'stat_78', 'stat_79',
       'stat_80', 'stat_81', 'stat_82', 'stat_83'],
      dtype='object', length=132)
132
Index(['Basic_Demos-Enroll_Season', 'CGAS-Season', 'Physical-Season',
       'Fitness_Endurance-Season', 'FGC-Season', 'BIA-Season', 'PAQ_A-Season',
       'PAQ_C-Season', 'SDS-Season', 'PreInt_EduHx-Season'],
      dtype='object')
10


In [None]:
print(len(X.columns))
print(len(df_test.columns))

142
142


**Missing values are handled based on the distribution from the X and df_test data. For numerical features with a normal distribution, the mean is used for imputation, while for skewed distributions, the median is preferred. For categorical variables, missing values are imputed with the mode, representing the most frequent category. These methods ensure that imputation aligns with the underlying data patterns, minimizing bias and preserving data integrity.**

In [None]:
def is_normal_distributed(data):
    if data.isnull().all():
        return False
    _, p_value = stats.shapiro(data.dropna())  # Shapiro-Wilk test for normality
    return p_value > 0.05  # If p-value > 0.05, data is considered normal

# Loop through numerical columns and fill missing values in both training and test data
for column in numerical_cols:
    if is_normal_distributed(X[column]):
        # Impute with the mean from the training data
        X[column] = X[column].fillna(X[column].mean())
        df_test[column] = df_test[column].fillna(X[column].mean())  # Use mean from training for test data
    else:
        # Impute with the median from the training data
        X[column] = X[column].fillna(X[column].median())
        df_test[column] = df_test[column].fillna(X[column].median())  # Use median from training for test data

# Loop through categorical columns and fill missing values in both training and test data
for column in categorical_cols:
    mode_value = X[column].mode()[0]  # Get the most frequent value from the training data
    X[column] = X[column].fillna(mode_value)  # Impute with mode in training data
    df_test[column] = df_test[column].fillna(mode_value)  # Impute with mode in test data


**The function removes outliers from both the training dataset (X) and the test dataset (df_test) using the Interquartile Range (IQR) method, replacing outlier values with NaN. It then imputes missing values by filling NaNs with the mean for normally distributed columns and the median for non-normally distributed columns. Columns that are entirely NaN after outlier removal are skipped from the imputation process.**

In [None]:
def remove_outliers_and_impute(X, df_test, numerical_cols):
    # Calculate IQR for outlier detection
    q1 = X[numerical_cols].quantile(0.25)
    q3 = X[numerical_cols].quantile(0.75)
    iqr = q3 - q1
    lower_bound = q1 - 1.5 * iqr
    upper_bound = q3 + 1.5 * iqr


    X[numerical_cols] = X[numerical_cols].mask((X[numerical_cols] < lower_bound) | (X[numerical_cols] > upper_bound), np.nan)
    df_test[numerical_cols] = df_test[numerical_cols].mask((df_test[numerical_cols] < lower_bound) | (df_test[numerical_cols] > upper_bound), np.nan)

    for column in numerical_cols:
        if X[column].isnull().all():
            continue
        if is_normal_distributed(X[column]):

            X[column].fillna(X[column].mean(), inplace=True)
            df_test[column].fillna(X[column].mean(), inplace=True)
        else:

            X[column].fillna(X[column].median(), inplace=True)
            df_test[column].fillna(X[column].median(), inplace=True)


numerical_cols = X.select_dtypes(include=[np.number]).columns.tolist()
remove_outliers_and_impute(X, df_test, numerical_cols)

In [None]:
X.columns



Index(['Basic_Demos-Enroll_Season', 'Basic_Demos-Age', 'Basic_Demos-Sex',
       'CGAS-Season', 'CGAS-CGAS_Score', 'Physical-Season', 'Physical-BMI',
       'Physical-Height', 'Physical-Weight', 'Physical-Waist_Circumference',
       ...
       'stat_74', 'stat_75', 'stat_76', 'stat_77', 'stat_78', 'stat_79',
       'stat_80', 'stat_81', 'stat_82', 'stat_83'],
      dtype='object', length=142)

In [None]:
print(X.columns.shape)
print(df_test.columns.shape)

(142,)
(142,)


**I performed feature engineering by applying Standard Scaling to all numerical columns to normalize the data. For columns with categorical values, I applied One-Hot Encoding to transform them into a suitable numerical format. Additionally, some columns were already in a label form and did not require any transformation, as they were naturally suitable for the model without further encoding.**

In [None]:
# Existing list of columns for standard scaling
standard_scaler = ['Physical-Diastolic_BP', 'BIA-BIA_BMI', 'FGC-FGC_GSND', 'Physical-Weight', 'FGC-FGC_CU',
                   'BIA-BIA_LDM', 'Fitness_Endurance-Time_Sec', 'Basic_Demos-Age', 'BIA-BIA_Fat',
                   'BIA-BIA_LST', 'Physical-HeartRate', 'BIA-BIA_DEE', 'BIA-BIA_FFMI', 'PAQ_A-PAQ_A_Total',
                   'FGC-FGC_SRR', 'FGC-FGC_PU', 'FGC-FGC_SRL', 'SDS-SDS_Total_T', 'Physical-Systolic_BP',
                   'BIA-BIA_SMM', 'Physical-Waist_Circumference', 'Fitness_Endurance-Time_Mins',
                   'Physical-Height', 'BIA-BIA_FMI', 'SDS-SDS_Total_Raw', 'PAQ_C-PAQ_C_Total',
                   'CGAS-CGAS_Score', 'BIA-BIA_ICW', 'BIA-BIA_TBW', 'BIA-BIA_FFM',
                   'Fitness_Endurance-Max_Stage', 'BIA-BIA_BMR', 'Physical-BMI', 'BIA-BIA_ECW',
                   'BIA-BIA_BMC', 'FGC-FGC_GSD', 'FGC-FGC_TL']


standard_scaler += [col for col in X.columns if col.startswith('stat')]
print(len(standard_scaler))



121


In [None]:
one_hot_encoder = [
    'Basic_Demos-Enroll_Season', 'CGAS-Season', 'Physical-Season',
    'Fitness_Endurance-Season', 'FGC-Season', 'BIA-Season', 'PAQ_A-Season',
    'PAQ_C-Season', 'SDS-Season', 'PreInt_EduHx-Season'
]

passthrough = [
    'Basic_Demos-Sex', 'FGC-FGC_CU_Zone', 'FGC-FGC_GSND_Zone', 'FGC-FGC_GSD_Zone',
    'FGC-FGC_PU_Zone', 'FGC-FGC_SRL_Zone', 'FGC-FGC_SRR_Zone', 'FGC-FGC_TL_Zone',
    'BIA-BIA_Activity_Level_num', 'BIA-BIA_Frame_num', 'PreInt_EduHx-computerinternet_hoursday'
]

**I applied the ColumnTransformer to preprocess all the features using the preprocessor. The transformer applies different preprocessing steps based on the feature types: StandardScaler for numerical features, OneHotEncoder for categorical features, and the passthrough option for columns that do not require any transformation. After setting up the transformer, I applied the preprocessor to transform the dataset, ensuring that each feature is processed appropriately before feeding it into the model.**

In [None]:
preprocessor = ColumnTransformer(
    transformers=[
        ("standard_scaler", StandardScaler(), standard_scaler),
        ("onehot_encoder", OneHotEncoder(), one_hot_encoder),
        ("passthrough", "passthrough", passthrough)
    ]
)


In [None]:
preprocessor.fit(X)

X_transformed_train = preprocessor.transform(X)
X_transformed_test = preprocessor.transform(df_test)

In [None]:
X_transformed_known = X_transformed_train[y.notnull()]
y_known = y[y.notnull()]  # Known target values
X_transformed_missing = X_transformed_train[y.isnull()]

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_transformed_known, y_known)
y_missing_pred = knn.predict(X_transformed_missing)

y.loc[y.isnull()] = y_missing_pred

print("Missing values in completed target:", y.isnull().sum())




Missing values in completed target: 0


In [None]:
# target vale
y.value_counts()

Unnamed: 0_level_0,count
sii,Unnamed: 1_level_1
0.0,2395
1.0,1014
2.0,516
3.0,35


In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_transformed_train, y, test_size=0.2, random_state=42)

In [None]:
print("X_train:", X_train.shape)
print("X_test:", X_test.shape)
print("y_train:", y_train.shape)
print("y_test:", y_test.shape)

X_train: (3168, 172)
X_test: (792, 172)
y_train: (3168,)
y_test: (792,)


**To address class imbalance in the dataset, I applied a combination of SMOTE (Synthetic Minority Over-sampling Technique) and Random Under Sampling using a pipeline. This pipeline first oversamples the minority class using SMOTE and then undersamples the majority class, resulting in a balanced dataset for training.**

In [None]:
smote = SMOTE(random_state=42)
rus = RandomUnderSampler(random_state=42)

pipeline = Pipeline(steps=[('smote', smote), ('rus', rus)])

X_resampled, y_resampled = pipeline.fit_resample(X_train, y_train)


**For prediction on the Child Mind Institute dataset, I applied three different classifiers: Gradient Boosting, LightGBM (LGBM), and CatBoost. Among these models, CatBoost achieved the highest performance, outperforming the other classifiers in terms of prediction accuracy, demonstrating its effectiveness for this particular task.**








In [None]:
param_dist = {
    'learning_rate': [0.01, 0.05],
    'n_estimators': [50, 100],
    'max_depth': [3, 5],
}
gbm = GradientBoostingClassifier()

random_search = RandomizedSearchCV(
    estimator=gbm,
    param_distributions=param_dist,
    n_iter=50,
    cv=3,
    scoring='accuracy',
    verbose=1,
    n_jobs=-1,
    random_state=42
)


random_search.fit(X_resampled, y_resampled)



Fitting 3 folds for each of 8 candidates, totalling 24 fits


In [None]:
y_pred=random_search.predict(X_test)

print(len(y_test))
print(len(y_pred))

792
792


In [None]:
from sklearn.metrics import cohen_kappa_score
qwk = cohen_kappa_score(y_test, y_pred, weights='quadratic')
print(f'Quadratic Weighted Kappa: {qwk}')


Quadratic Weighted Kappa: 0.4522163412902901


In [None]:
catboost_params = {
    'iterations': 1000,
    'learning_rate': 0.05,
    'depth': 7,
    'loss_function': 'MultiClass',
    'random_state': 42
}

catboost_model = CatBoostClassifier(**catboost_params)
catboost_model.fit(X_train, y_train, cat_features=[])

y_pred = catboost_model.predict(X_test)


from sklearn.metrics import cohen_kappa_score

qwk = cohen_kappa_score(y_test, y_pred, weights='quadratic')
print(f'Quadratic Weighted Kappa: {qwk}')

0:	learn: 1.3410596	total: 20.5ms	remaining: 20.5s
1:	learn: 1.3019073	total: 37.6ms	remaining: 18.8s
2:	learn: 1.2659071	total: 53.5ms	remaining: 17.8s
3:	learn: 1.2353060	total: 73.9ms	remaining: 18.4s
4:	learn: 1.2069750	total: 105ms	remaining: 21s
5:	learn: 1.1794842	total: 122ms	remaining: 20.2s
6:	learn: 1.1554934	total: 137ms	remaining: 19.5s
7:	learn: 1.1337015	total: 154ms	remaining: 19s
8:	learn: 1.1150404	total: 169ms	remaining: 18.6s
9:	learn: 1.0960913	total: 185ms	remaining: 18.4s
10:	learn: 1.0802770	total: 201ms	remaining: 18.1s
11:	learn: 1.0656081	total: 221ms	remaining: 18.2s
12:	learn: 1.0519480	total: 238ms	remaining: 18.1s
13:	learn: 1.0392676	total: 254ms	remaining: 17.9s
14:	learn: 1.0256394	total: 270ms	remaining: 17.7s
15:	learn: 1.0140714	total: 287ms	remaining: 17.6s
16:	learn: 1.0031113	total: 303ms	remaining: 17.5s
17:	learn: 0.9924784	total: 319ms	remaining: 17.4s
18:	learn: 0.9817579	total: 335ms	remaining: 17.3s
19:	learn: 0.9724211	total: 351ms	remaini

In [None]:
y_pred = catboost_model.predict(X_transformed_test)



In [None]:
y_pred.shape

(20, 1)

**Final submission file:**

In [None]:
submission = pd.read_csv('/kaggle/input/child-mind-institute-problematic-internet-use/sample_submission.csv')

submission = pd.DataFrame({
    'id': submission['id'],
    'prediction': y_pred.ravel()
})

submission.rename(columns={'prediction': 'sii'}, inplace=True)

submission.to_csv('submission.csv', index=False)

print("Submission file created successfully.")

In [None]:
submission

Unnamed: 0,id,prediction
0,00008ff9,2.0
1,000fd460,0.0
2,00105258,0.0
3,00115b9f,1.0
4,0016bb22,2.0
5,001f3379,1.0
6,0038ba98,0.0
7,0068a485,0.0
8,0069fbed,2.0
9,0083e397,2.0
