This notebook demonstrates a complete machine learning pipeline, including:
- Preprocessing CSV and time-series data.
- Using Optuna to tune hyperparameters for LightGBM, XGBoost, and CatBoost.
- Training multiple models with optimized parameters.
- Combining predictions using an ensemble method.
- Generating a Kaggle submission.
    

# **Table of Contents**
1. [Introduction](#introduction)
2. [Libraries and Utilities](#libraries-and-utilities)
3. [Preprocessing](#preprocessing)
    - [Preprocessing CSV Data](#preprocessing-csv-data)
    - [Preprocessing Time-Series Data](#preprocessing-time-series-data)
    - [Merging Preprocessed Data](#merging-preprocessed-data)
4. [Hyperparameter Tuning with Optuna](#hyperparameter-tuning)
    - [LightGBM](#lightgbm)
    - [XGBoost](#xgboost)
    - [CatBoost](#catboost)
5. [Model Training](#model-training)
6. [Ensemble Predictions](#ensemble-predictions)
7. [Submission](#submission)
8. [Main Pipeline](#main-pipeline)
    

<a id="libraries-and-utilities"></a>
# **Libraries and Utilities**
Import the required libraries, including Optuna for hyperparameter optimization.
    

## **Install packages and import libraries**

In [3]:
!pip uninstall protobuf -y
!pip install /kaggle/input/ft-transformer/protobuf-3.19.6-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
!pip install /kaggle/input/ft-transformer/tabtransformertf-0.0.8-py3-none-any.whl
!pip install /kaggle/input/ft-transformer/tensorflow_addons-0.19.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl

Found existing installation: protobuf 3.20.3
Uninstalling protobuf-3.20.3:
  Successfully uninstalled protobuf-3.20.3
[0mProcessing /kaggle/input/ft-transformer/protobuf-3.19.6-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Installing collected packages: protobuf
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
cudf 21.12.2 requires cupy-cuda115, which is not installed.
tfx-bsl 1.12.0 requires google-api-python-client<2,>=1.7.11, but you have google-api-python-client 2.79.0 which is incompatible.
tfx-bsl 1.12.0 requires pyarrow<7,>=6, but you have pyarrow 5.0.0 which is incompatible.
tensorflow-transform 1.12.0 requires pyarrow<7,>=6, but you have pyarrow 5.0.0 which is incompatible.
onnx 1.13.1 requires protobuf<4,>=3.20.2, but you have protobuf 3.19.6 which is incompatible.
apache-beam 2.44.0 requires dill<0.3.2,>=0.3.1.1, but you h

In [4]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import tensorflow as tf
import tensorflow_addons as tfa
from tabtransformertf.utils.preprocessing import df_to_dataset, build_categorical_prep
from tabtransformertf.models.fttransformer import FTTransformerEncoder, FTTransformer
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [5]:
# Import necessary libraries
import numpy as np  # Linear algebra
import pandas as pd  # Data processing, CSV file I/O
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.metrics import cohen_kappa_score
from sklearn.ensemble import VotingRegressor, StackingRegressor
from sklearn.linear_model import LinearRegression

from lightgbm import LGBMRegressor
from xgboost import XGBRegressor
from catboost import CatBoostRegressor

import tensorflow as tf
import tensorflow_addons as tfa
from tabtransformertf.utils.preprocessing import df_to_dataset, build_categorical_prep
from tabtransformertf.models.fttransformer import FTTransformerEncoder, FTTransformer

from keras.models import Model
from keras.layers import Input, Dense
from keras.optimizers import Adam

import torch
import torch.nn as nn
import torch.optim as optim

from scipy.optimize import minimize
import optuna
from tqdm import tqdm
import warnings
import os

from concurrent.futures import ThreadPoolExecutor

warnings.filterwarnings("ignore")

## **Utilities**

In [6]:
def quadratic_weighted_kappa(y_true, y_pred):
    return cohen_kappa_score(y_true, y_pred, weights='quadratic')

def threshold_Rounder(oof_non_rounded, thresholds):
    return np.where(oof_non_rounded < thresholds[0], 0,
                    np.where(oof_non_rounded < thresholds[1], 1,
                             np.where(oof_non_rounded < thresholds[2], 2, 3)))

def evaluate_predictions(thresholds, y_true, oof_non_rounded):
    rounded_p = threshold_Rounder(oof_non_rounded, thresholds)
    return -quadratic_weighted_kappa(y_true, rounded_p)

<a id="preprocessing"></a>
# **Preprocessing**
This section contains the preprocessing functions for CSV and time-series data, as well as merging them.
    

## **Imputer**

In [7]:
from sklearn.impute import SimpleImputer
from sklearn.impute import KNNImputer

def impute(df, method="knn", n_neighbors=5):
    # impute categorical columns
    if method == "cat":
        categorical_cols = df.select_dtypes(include=["object", "category"]).columns
        cat_imputer = SimpleImputer(strategy="most_frequent")
        df[categorical_cols] = cat_imputer.fit_transform(df[categorical_cols])
        return df

    # get numeric columns instead of 'sii'
    numeric_cols = df.select_dtypes(include=['number', 'float64', 'int64']).columns
    numeric_cols = list(numeric_cols)
    if "sii" in numeric_cols:
        numeric_cols.remove("sii")
    numeric_cols = pd.Index(numeric_cols)
    
    if method == "knn":
        imputer = KNNImputer(n_neighbors=n_neighbors)
        # impute_data = imputer.fit_transform(df[numeric_cols])
    if method == "mean":
        imputer = SimpleImputer(strategy="mean")
    if method == "median":
        imputer = SimpleImputer(strategy="median")
    
    imputed_data = imputer.fit_transform(df[numeric_cols])
    df_imputed = pd.DataFrame(imputed_data, columns=numeric_cols)
    for col in df.columns:
        if col not in numeric_cols:
            df_imputed[col] = df[col]          
    df = df_imputed
    
    return df

## **Preprocess CSV data**

### **Feature engineering for CSV data**

In [8]:
# Paths
train_path = '/kaggle/input/child-mind-institute-problematic-internet-use/train.csv'
test_path = '/kaggle/input/child-mind-institute-problematic-internet-use/test.csv'
sample_path = '/kaggle/input/child-mind-institute-problematic-internet-use/sample_submission.csv'
time_series_train = "/kaggle/input/child-mind-institute-problematic-internet-use/series_train.parquet"
time_series_test = "/kaggle/input/child-mind-institute-problematic-internet-use/series_test.parquet"

In [9]:
def csv_feature_engineering(df):
    season_cols = [col for col in df.columns if 'Season' in col]
    df = df.drop(season_cols, axis=1) 
    df['BMI_Age'] = df['Physical-BMI'] * df['Basic_Demos-Age']
    df['Internet_Hours_Age'] = df['PreInt_EduHx-computerinternet_hoursday'] * df['Basic_Demos-Age']
    df['BMI_Internet_Hours'] = df['Physical-BMI'] * df['PreInt_EduHx-computerinternet_hoursday']
    df['BFP_BMI'] = df['BIA-BIA_Fat'] / df['BIA-BIA_BMI']
    df['FFMI_BFP'] = df['BIA-BIA_FFMI'] / df['BIA-BIA_Fat']
    df['FMI_BFP'] = df['BIA-BIA_FMI'] / df['BIA-BIA_Fat']
    df['LST_TBW'] = df['BIA-BIA_LST'] / df['BIA-BIA_TBW']
    df['BFP_BMR'] = df['BIA-BIA_Fat'] * df['BIA-BIA_BMR']
    df['BFP_DEE'] = df['BIA-BIA_Fat'] * df['BIA-BIA_DEE']
    df['BMR_Weight'] = df['BIA-BIA_BMR'] / df['Physical-Weight']
    df['DEE_Weight'] = df['BIA-BIA_DEE'] / df['Physical-Weight']
    df['SMM_Height'] = df['BIA-BIA_SMM'] / df['Physical-Height']
    df['Muscle_to_Fat'] = df['BIA-BIA_SMM'] / df['BIA-BIA_FMI']
    df['Hydration_Status'] = df['BIA-BIA_TBW'] / df['Physical-Weight']
    df['ICW_TBW'] = df['BIA-BIA_ICW'] / df['BIA-BIA_TBW']

    df['Age_Weight'] = df['Basic_Demos-Age'] * df['Physical-Weight']
    df['Sex_BMI'] = df['Basic_Demos-Sex'] * df['Physical-BMI']
    df['Sex_HeartRate'] = df['Basic_Demos-Sex'] * df['Physical-HeartRate']
    df['Age_WaistCirc'] = df['Basic_Demos-Age'] * df['Physical-Waist_Circumference']
    df['BMI_FitnessMaxStage'] = df['Physical-BMI'] * df['Fitness_Endurance-Max_Stage']
    df['Weight_GripStrengthDominant'] = df['Physical-Weight'] * df['FGC-FGC_GSD']
    df['Weight_GripStrengthNonDominant'] = df['Physical-Weight'] * df['FGC-FGC_GSND']
    df['HeartRate_FitnessTime'] = df['Physical-HeartRate'] * (df['Fitness_Endurance-Time_Mins'] + df['Fitness_Endurance-Time_Sec'])
    df['Age_PushUp'] = df['Basic_Demos-Age'] * df['FGC-FGC_PU']
    df['FFMI_Age'] = df['BIA-BIA_FFMI'] * df['Basic_Demos-Age']
    df['InternetUse_SleepDisturbance'] = df['PreInt_EduHx-computerinternet_hoursday'] * df['SDS-SDS_Total_Raw']
    df['CGAS_BMI'] = df['CGAS-CGAS_Score'] * df['Physical-BMI']
    df['CGAS_FitnessMaxStage'] = df['CGAS-CGAS_Score'] * df['Fitness_Endurance-Max_Stage']
    
    return df

In [10]:
train_df = pd.read_csv(train_path)

In [11]:
len(train_df.columns)

82

In [12]:
fe_train_df = csv_feature_engineering(train_df)

In [13]:
len(fe_train_df.columns)

99

In [14]:
CATEGORICAL_FEATURES = fe_train_df.select_dtypes(include=['object', 'category']).columns
print(CATEGORICAL_FEATURES)

Index(['id'], dtype='object')


After we analyzed the dataset, we observed that there are lot of unreasonable values in the dataset. Therefore, we decided to prune some anomaly samples.



In [15]:
def remove_outliers(df):
    input_length = len(df)
    df = df.drop(df[df['Physical-BMI'] <= 0].index)
    df = df.drop(df[df['Physical-Diastolic_BP'] <= 0].index)
    df = df.drop(df[df['Physical-Systolic_BP'] <= 0].index)
    df = df.drop(df[df['Physical-Diastolic_BP'] > 160].index)

    children = df[df['Basic_Demos-Age'] <= 12]
    df = df.drop(children[children['FGC-FGC_CU'] > 80].index)
    df = df.drop(children[children['FGC-FGC_GSND'] > 80].index)

    df = df.drop(df[df['BIA-BIA_BMI'] <= 0].index)
    df = df.drop(df[df['BIA-BIA_BMC'] > 1000].index)
    df = df.drop(df[df['BIA-BIA_BMR'] > 40000].index)
    df = df.drop(df[df['BIA-BIA_DEE'] > 60000].index)
    df = df.drop(df[df['BIA-BIA_ECW'] > 2000].index)
    df = df.drop(df[df['BIA-BIA_FFM'] > 2000].index)
    df = df.drop(df[df['BIA-BIA_ICW'] > 2000].index)
    df = df.drop(df[df['BIA-BIA_LDM'] > 2000].index)
    df = df.drop(df[df['BIA-BIA_LST'] > 2000].index)
    df = df.drop(df[df['BIA-BIA_SMM'] > 2000].index)
    df = df.drop(df[df['BIA-BIA_TBW'] > 2000].index)
    output_length = len(df)
    print (input_length, output_length)
    
    return df

In [16]:
removed_train_df = remove_outliers(fe_train_df)

3960 3947


In [17]:
test_df = pd.read_csv(test_path)
test_cols = test_df.columns.to_list()

In [18]:
featuresCols = ['Basic_Demos-Enroll_Season', 'Basic_Demos-Age', 'Basic_Demos-Sex',
                'CGAS-Season', 'CGAS-CGAS_Score',
                'Physical-Season', 'Physical-Height', 'Physical-Weight', 'Physical-Diastolic_BP', 'Physical-HeartRate', 'Physical-Systolic_BP',
                'Fitness_Endurance-Season', 'Fitness_Endurance-Max_Stage', 'Fitness_Endurance-Time_Mins', 'Fitness_Endurance-Time_Sec',
                'FGC-Season', 'FGC-FGC_CU', 'FGC-FGC_CU_Zone', 'FGC-FGC_GSND_Zone', 'FGC-FGC_GSD', 'FGC-FGC_GSD_Zone', 'FGC-FGC_PU',
                'FGC-FGC_PU_Zone', 'FGC-FGC_SRL_Zone', 'FGC-FGC_SRR', 'FGC-FGC_SRR_Zone', 'FGC-FGC_TL', 'FGC-FGC_TL_Zone',
                'BIA-Season', 'BIA-BIA_Activity_Level_num', 'BIA-BIA_BMC', 'BIA-BIA_BMI', 'BIA-BIA_FFMI', 'BIA-BIA_FMI', 'BIA-BIA_Fat', 'BIA-BIA_Frame_num',
                'PAQ_A-Season','PAQ_A-PAQ_A_Total', 'PAQ_C-Season', 'PAQ_C-PAQ_C_Total',
                'SDS-Season', 'SDS-SDS_Total_Raw', 'SDS-SDS_Total_T',
                'PreInt_EduHx-Season', 'PreInt_EduHx-computerinternet_hoursday',
                'sii',
                'BMI_Age', 'Internet_Hours_Age', 'BMI_Internet_Hours', 'BFP_BMI',
                'FFMI_BFP', 'FMI_BFP', 'LST_TBW', 'BFP_BMR', 'BFP_DEE', 'BMR_Weight',
                'DEE_Weight', 'SMM_Height', 'Muscle_to_Fat', 'Hydration_Status', 'ICW_TBW'
               ]

removedCols = ['Basic_Demos-Enroll_Season', 'CGAS-Season', 'Physical-Season', 'Fitness_Endurance-Season', 'FGC-Season', 'BIA-Season', 'PAQ_A-Season', 'PAQ_C-Season', 'SDS-Season', 'PreInt_EduHx-Season']

featuresCols = [feature for feature in featuresCols if feature not in removedCols]

print(len(featuresCols))
removed_train_df_cols = removed_train_df.columns.to_list()
removed_train_df_cols = [col for col in removed_train_df_cols if col in test_cols]

print(set(removed_train_df_cols)-set(featuresCols))



51
{'FGC-FGC_SRL', 'BIA-BIA_LST', 'BIA-BIA_FFM', 'BIA-BIA_ECW', 'BIA-BIA_LDM', 'FGC-FGC_GSND', 'BIA-BIA_TBW', 'Physical-Waist_Circumference', 'BIA-BIA_DEE', 'BIA-BIA_SMM', 'Physical-BMI', 'id', 'BIA-BIA_BMR', 'BIA-BIA_ICW'}


### **Preprocess Function**

In [19]:
def preprocess_csv_data(train_path, test_path, sample_path):
    """
    Preprocess CSV data with proper handling of missing columns.
    Args:
        train_path (str): Path to the training CSV file.
        test_path (str): Path to the test CSV file.
        sample_path (str): Path to the sample submission CSV file
    Returns:
        tuple: Preprocessed training DataFrame, test DataFrame, and sample submission.
    """
    train = pd.read_csv(train_path)
    test = pd.read_csv(test_path)
    sample = pd.read_csv(sample_path)

    # # remove outlier from train
    # train = remove_outliers(train)
    # print("Train shape after removing outlier: ", train.shape)

    # feature engineering for both train and test
    train = csv_feature_engineering(train)
    test = csv_feature_engineering(test)
    print("Train shape after feature engineering: ", train.shape)
    print("Test shape after feature engineering: ", test.shape)

    # Only use columns in both train and test
    # Ensure that the columns in `train` and `test` match
    # Remove some columns
    common_columns = test.columns.to_list() 
    remove_columns = ['BIA-BIA_LDM', 'Physical-Waist_Circumference', 'FGC-FGC_SRL', 'FGC-FGC_GSND', 'BIA-BIA_ECW', 'Physical-BMI', 'BIA-BIA_LST', 'BIA-BIA_SMM', 'BIA-BIA_FFM', 'BIA-BIA_TBW', 'BIA-BIA_BMR', 'BIA-BIA_ICW', 'BIA-BIA_DEE']
    common_columns = [col for col in common_columns if col not in remove_columns]
    
    train = train[["sii"] + common_columns]
    test = test[common_columns]
    print("Train shape after removing columns: ", train.shape)
    print("Test shape after removing columns: ", test.shape)

    return train, test, sample

In [20]:
train, test, sample = preprocess_csv_data(train_path, test_path, sample_path)

Train shape after feature engineering:  (3960, 99)
Test shape after feature engineering:  (20, 77)
Train shape after removing columns:  (3960, 65)
Test shape after removing columns:  (20, 64)


## **Preprocess time series data**

In [22]:
# New version, nhưng chạy tận 10p??????
def process_file(filename, dirname):
    df= pd.read_parquet(os.path.join(dirname, filename, 'part-0.parquet'))
    data=extract_advanced_features(df)
    array_1=data.values[0]
    array_2=df.describe().values.reshape(-1), filename.split('=')[1]
    # Combine the two arrays
    combined_array = np.concatenate((array_1, array_2[0]))
    combined_tuple=(array_1,array_2[1])
    return combined_tuple

def load_time_series(dirname) -> pd.DataFrame:
    ids = os.listdir(dirname)
    
    with ThreadPoolExecutor() as executor:
        results = list(tqdm(executor.map(lambda fname: process_file(fname, dirname), ids), total=len(ids)))
    
    stats, indexes = zip(*results)
    
    df = pd.DataFrame(stats, columns=[f"stat_{i}" for i in range(len(stats[0]))])
    df['id'] = indexes
    return df

### **Feature engineering for time series data**

In [23]:
GLOBAL_TS_LENGTH=[]
import pandas as pd
import numpy as np
from scipy import stats

def extract_advanced_features(data):
    """
    Extract advanced features from actigraphy data for SII prediction.
    
    Parameters:
    data (pd.DataFrame): Input DataFrame with actigraphy measurements
    
    Returns:
    pd.DataFrame: Single row DataFrame with extracted features
    """
    # Initial data preprocessing
    data = data.copy()
    data['timestamp'] = pd.to_datetime(data['relative_date_PCIAT'], unit='D') + pd.to_timedelta(data['time_of_day'])
    data = data[data['non-wear_flag'] == 0]
    
    # Calculate basic metrics
    data['magnitude'] = np.sqrt(data['X']**2 + data['Y']**2 + data['Z']**2)
    data['velocity'] = data['magnitude']
    data['distance'] = data['velocity'] * 5  # 5 seconds per observation
    data['date'] = data['timestamp'].dt.date
    hour = pd.to_datetime(data['time_of_day']).dt.hour
    
    # Calculate aggregated distances
    distances = {
        'daily': data.groupby('date')['distance'].sum(),
        'monthly': data.groupby(data['timestamp'].dt.to_period('M'))['distance'].sum(),
        'quarterly': data.groupby('quarter')['distance'].sum()
    }
    
    # Initialize features dictionary
    features = {}
    
    # Time masks for different periods
    time_masks = {
        'morning': (hour >= 6) & (hour < 12),
        'afternoon': (hour >= 12) & (hour < 18),
        'evening': (hour >= 18) & (hour < 22),
        'night': (hour >= 22) | (hour < 6)
    }
    
    # 1. Activity Pattern Features
    for period, mask in time_masks.items():
        features.update({
            f'{period}_activity_mean': data.loc[mask, 'magnitude'].mean(),
            f'{period}_activity_std': data.loc[mask, 'magnitude'].std(),
            f'{period}_enmo_mean': data.loc[mask, 'enmo'].mean()
        })
    
    # 2. Sleep Quality Features
    sleep_hours = time_masks['night']
    magnitude_threshold = data['magnitude'].mean() + data['magnitude'].std()
    
    features.update({
        'sleep_movement_mean': data.loc[sleep_hours, 'magnitude'].mean(),
        'sleep_movement_std': data.loc[sleep_hours, 'magnitude'].std(),
        'sleep_disruption_count': len(data.loc[sleep_hours & (data['magnitude'] > 
            data['magnitude'].mean() + 2 * data['magnitude'].std())]),
        'light_exposure_during_sleep': data.loc[sleep_hours, 'light'].mean(),
        'sleep_position_changes': len(data.loc[sleep_hours & 
            (abs(data['anglez'].diff()) > 45)]),
        'good_sleep_cycle': int(data.loc[sleep_hours, 'light'].mean() < 50)
    })
    
    # 3. Activity Intensity Features
    features.update({
        'sedentary_time_ratio': (data['magnitude'] < magnitude_threshold * 0.5).mean(),
        'moderate_activity_ratio': ((data['magnitude'] >= magnitude_threshold * 0.5) & 
            (data['magnitude'] < magnitude_threshold * 1.5)).mean(),
        'vigorous_activity_ratio': (data['magnitude'] >= magnitude_threshold * 1.5).mean(),
        'activity_peaks_per_day': len(data[data['magnitude'] > 
            data['magnitude'].quantile(0.95)]) / len(data.groupby('relative_date_PCIAT'))
    })
    
    # 4. Circadian Rhythm Features
    hourly_activity = data.groupby(hour)['magnitude'].mean()
    features.update({
        'circadian_regularity': hourly_activity.std() / hourly_activity.mean(),
        'peak_activity_hour': hourly_activity.idxmax(),
        'trough_activity_hour': hourly_activity.idxmin(),
        'activity_range': hourly_activity.max() - hourly_activity.min()
    })
    
    # 5-11. Additional Feature Groups
    weekend_mask = data['weekday'].isin([6, 7])
    
    features.update({
        # Movement Patterns
        'movement_entropy': stats.entropy(pd.qcut(data['magnitude'], q=10, duplicates='drop').value_counts()),
        'direction_changes': len(data[abs(data['anglez'].diff()) > 30]) / len(data),
        'sustained_activity_periods': len(data[data['magnitude'].rolling(12).mean() > 
            magnitude_threshold]) / len(data),
        
        # Weekend vs Weekday
        'weekend_activity_ratio': data.loc[weekend_mask, 'magnitude'].mean() / 
            data.loc[~weekend_mask, 'magnitude'].mean(),
        'weekend_sleep_difference': data.loc[weekend_mask & sleep_hours, 'magnitude'].mean() - 
            data.loc[~weekend_mask & sleep_hours, 'magnitude'].mean(),
        
        # Non-wear Time
        'wear_time_ratio': (data['non-wear_flag'] == 0).mean(),
        'wear_consistency': len(data['non-wear_flag'].value_counts()),
        'longest_wear_streak': data['non-wear_flag'].eq(0).astype(int).groupby(
            data['non-wear_flag'].ne(0).cumsum()).sum().max(),
        
        # Device Usage
        'screen_time_proxy': (data['light'] > data['light'].quantile(0.75)).mean(),
        'dark_environment_ratio': (data['light'] < data['light'].quantile(0.25)).mean(),
        'light_variation': data['light'].std() / data['light'].mean() if data['light'].mean() != 0 else 0,
        
        # Battery Usage
        'battery_drain_rate': -np.polyfit(range(len(data)), data['battery_voltage'], 1)[0],
        'battery_variability': data['battery_voltage'].std(),
        'low_battery_time': (data['battery_voltage'] < data['battery_voltage'].quantile(0.1)).mean(),
        
        # Time-based
        'days_monitored': data['relative_date_PCIAT'].nunique(),
        'total_active_hours': len(data[data['magnitude'] > magnitude_threshold * 0.5]) * 5 / 3600,
        'activity_regularity': data.groupby('weekday')['magnitude'].mean().std()
    })
    
    # Variability Features for multiple columns
    for col in ['X', 'Y', 'Z', 'enmo', 'anglez']:
        features.update({
            f'{col}_skewness': data[col].skew(),
            f'{col}_kurtosis': data[col].kurtosis(),
            f'{col}_trend': np.polyfit(range(len(data)), data[col], 1)[0]
        })
    
    return pd.DataFrame([features])


### **Autoencoder**

In [24]:
class AutoEncoder(nn.Module):
    def __init__(self, input_dim, encoding_dim):
        super(AutoEncoder, self).__init__()
        self.encoder = nn.Sequential(
            nn.Linear(input_dim, encoding_dim*3),
            nn.ReLU(),
            nn.Linear(encoding_dim*3, encoding_dim*2),
            nn.ReLU(),
            nn.Linear(encoding_dim*2, encoding_dim),
            nn.ReLU()
        )
        self.decoder = nn.Sequential(
            nn.Linear(encoding_dim, input_dim*2),
            nn.ReLU(),
            nn.Linear(input_dim*2, input_dim*3),
            nn.ReLU(),
            nn.Linear(input_dim*3, input_dim),
            nn.Sigmoid()
        )
        
    def forward(self, x):
        encoded = self.encoder(x)
        decoded = self.decoder(encoded)
        return decoded


def perform_autoencoder(df, encoding_dim=50, epochs=50, batch_size=32):
    scaler = StandardScaler()
    df_scaled = scaler.fit_transform(df)
    
    data_tensor = torch.FloatTensor(df_scaled)
    
    input_dim = data_tensor.shape[1]
    autoencoder = AutoEncoder(input_dim, encoding_dim)
    
    criterion = nn.MSELoss()
    optimizer = optim.Adam(autoencoder.parameters())
    
    for epoch in range(epochs):
        for i in range(0, len(data_tensor), batch_size):
            batch = data_tensor[i : i + batch_size]
            optimizer.zero_grad()
            reconstructed = autoencoder(batch)
            loss = criterion(reconstructed, batch)
            loss.backward()
            optimizer.step()
            
        if (epoch + 1) % 10 == 0:
            print(f'Epoch [{epoch + 1}/{epochs}], Loss: {loss.item():.4f}]')
                 
    with torch.no_grad():
        encoded_data = autoencoder.encoder(data_tensor).numpy()
        
    df_encoded = pd.DataFrame(encoded_data, columns=[f'Enc_{i + 1}' for i in range(encoded_data.shape[1])])
    
    return df_encoded

### **Preprocess function**

In [25]:
from sklearn.impute import KNNImputer
from sklearn.preprocessing import MinMaxScaler, LabelEncoder

def preprocess_time_series_data(train_ts_path, test_ts_path, use_autoencoder=False, use_imputer=False, impute_method="mean"):
    train_ts = load_time_series(train_ts_path)
    test_ts = load_time_series(test_ts_path)
    
    if use_imputer:
        train_ts = impute(train_ts, method="mean")
        test_ts = impute(test_ts, method="mean")

    if use_autoencoder:
        df_train = train_ts.drop('id', axis=1)
        df_test = test_ts.drop('id', axis=1)
        
        train_ts_encoded = perform_autoencoder(df_train, encoding_dim=60, epochs=100, batch_size=32)
        test_ts_encoded = perform_autoencoder(df_test, encoding_dim=60, epochs=100, batch_size=32)

        train_ts_encoded['id']=train_ts["id"]
        test_ts_encoded['id']=test_ts["id"]

        train_ts = train_ts_encoded
        test_ts = test_ts_encoded

    return train_ts, test_ts

## **Merge CSV and time series data**

In [26]:
def merge_csv_and_time_series(train, test, train_ts, test_ts, use_time_series=False, use_numeric_imputation=False, numeric_impute_method="knn"):
    """
    Merge CSV and time-series data into a unified dataset and fill missing values using KNNImputer.

    Args:
        train (pd.DataFrame): Training data from CSV.
        test (pd.DataFrame): Test data from CSV.
        train_ts (pd.DataFrame): Aggregated training time-series data.
        test_ts (pd.DataFrame): Aggregated test time-series data.
        use_time_series (bool): Whether to include time-series data in the merged dataset.

    Returns:
        tuple: Merged training and test DataFrames.
    """
    if use_time_series:
        # Merge time-series data with train and test data on 'id'
        train = pd.merge(train, train_ts, how="inner", on='id')
        test = pd.merge(test, test_ts, how="inner", on='id')

    # Drop 'id' column after merging
    train = train.drop('id', axis=1)
    test = test.drop('id', axis=1)

    if use_time_series:
        # Feature selection
        time_series_cols = train_ts.columns.tolist()
        time_series_cols.remove("id")

    if np.any(np.isinf(train)):
        train = train.replace([np.inf, -np.inf], np.nan)

    if np.any(np.isinf(test)):
        test = test.replace([np.inf, -np.inf], np.nan)

    if use_numeric_imputation:
        train = impute(train, method=numeric_impute_method)
        test = impute(test, method=numeric_impute_method)

    featuresCols = train.columns.to_list()

    if use_time_series:
        featuresCols += time_series_cols

    # Dynamically filter features based on available columns
    featuresCols = [col for col in featuresCols if col in train.columns]
    print("Final features included in train:", featuresCols)

    # Filter features and drop rows with missing target 'sii'
    train = train[featuresCols]
    train = train.dropna(subset=['sii'])

    featuresCols.remove('sii')
    test = test[featuresCols]

    return train, test

<a id="hyperparameter-tuning"></a>
# **Hyperparameter Tuning with Optuna**
This section includes functions to optimize hyperparameters for LightGBM, XGBoost, and CatBoost.
    

<a id="lightgbm"></a>
### **LightGBM Hyperparameter Optimization**
Use Optuna to tune hyperparameters for LightGBM.

In [27]:
def optimize_lightgbm(train, target, n_trials=50):
    def objective(trial):
        params = {
            "learning_rate": trial.suggest_loguniform("learning_rate", 0.01, 0.3),
            "max_depth": trial.suggest_int("max_depth", 3, 12),
            "num_leaves": trial.suggest_int("num_leaves", 20, 3000),
            "min_child_samples": trial.suggest_int("min_child_samples", 5, 100),
            "feature_fraction": trial.suggest_uniform("feature_fraction", 0.4, 0.9),
            "bagging_fraction": trial.suggest_uniform("bagging_fraction", 0.4, 0.9),
            "bagging_freq": trial.suggest_int("bagging_freq", 1, 7),
            "lambda_l1": trial.suggest_loguniform("lambda_l1", 1e-8, 10.0),
            "lambda_l2": trial.suggest_loguniform("lambda_l2", 1e-8, 10.0),
            "random_state": 42,
            "n_estimators": 200
        }

        X_train, X_valid, y_train, y_valid = train_test_split(
            train, target, test_size=0.2, random_state=42, stratify=target
        )

        model = LGBMRegressor(**params)
        model.fit(
            X_train, y_train,
            eval_set=[(X_valid, y_valid)],
            eval_metric="rmse"
            # early_stopping_rounds=50
        )

        preds = model.predict(X_valid)
        score = quadratic_weighted_kappa(y_valid, preds.round(0).astype(int))
        return score

    study = optuna.create_study(direction="maximize")
    study.optimize(objective, n_trials=n_trials)

    print("Best Score for LightGBM:", study.best_value)
    print("Best Params for LightGBM:", study.best_params)
    return study.best_params

<a id="xgboost"></a>
### **XGBoost Hyperparameter Optimization**
Use Optuna to tune hyperparameters for XGBoost.

In [28]:
def optimize_xgboost(train, target, n_trials=50):
    import optuna
    from sklearn.model_selection import train_test_split
    from xgboost import XGBRegressor

    def is_gpu_available():
        try:
            import torch
            return torch.cuda.is_available()
        except ImportError:
            return False

    use_gpu = is_gpu_available()
    tree_method = "gpu_hist" if use_gpu else "hist"

    def objective(trial):
        params = {
            "learning_rate": trial.suggest_float("learning_rate", 0.01, 0.3),
            "max_depth": trial.suggest_int("max_depth", 3, 10),
            "subsample": trial.suggest_float("subsample", 0.5, 1.0),
            "colsample_bytree": trial.suggest_float("colsample_bytree", 0.5, 1.0),
            "reg_lambda": trial.suggest_float("lambda", 1e-5, 10.0),
            "reg_alpha": trial.suggest_float("alpha", 1e-5, 10.0),
            "tree_method": tree_method,
        }

        # Preprocess the training data to remove non-numeric columns
        X_train, X_valid, y_train, y_valid = train_test_split(
            train, target, test_size=0.2, random_state=42, stratify=target
        )

        model = XGBRegressor(**params)
        model.fit(
            X_train,
            y_train,
            eval_set=[(X_valid, y_valid)],
            eval_metric="rmse",
            verbose=0,
            early_stopping_rounds=50,
        )

        # Predict and evaluate
        preds = model.predict(X_valid)
        rounded_preds = np.round(preds).astype(int)  # Ensure preds are rounded here
        score = quadratic_weighted_kappa(y_valid, rounded_preds)
        return score

    # Run Optuna optimization
    study = optuna.create_study(direction="maximize")
    study.optimize(objective, n_trials=n_trials)

    print("Best Score for XGBoost:", study.best_value)
    print("Best Params for XGBoost:", study.best_params)

    return study.best_params


<a id="catboost"></a>
### **CatBoost Hyperparameter Optimization**
Use Optuna to tune hyperparameters for CatBoost.

In [29]:
def optimize_catboost(train, target, n_trials=50):
    def objective(trial):
        params = {
            "learning_rate": trial.suggest_loguniform("learning_rate", 0.01, 0.3),
            "depth": trial.suggest_int("depth", 4, 12),
            "l2_leaf_reg": trial.suggest_loguniform("l2_leaf_reg", 1e-8, 10.0),
            "iterations": 200,
            "task_type": "GPU",
            "random_state": 42
        }

        CatBoost_Params = {
    'learning_rate': 0.05,
    'depth': 6,
    'iterations': 200,
    'random_seed': SEED,
    'verbose': 0,
    'l2_leaf_reg': 10,  # Increase this value
    'task_type': 'GPU'

}

        X_train, X_valid, y_train, y_valid = train_test_split(
            train, target, test_size=0.2, random_state=42, stratify=target
        )

        model = CatBoostRegressor(**params, verbose=0)
        model.fit(X_train, y_train, eval_set=(X_valid, y_valid))

        preds = model.predict(X_valid)
        score = quadratic_weighted_kappa(y_valid, preds.round(0).astype(int))
        return score

    study = optuna.create_study(direction="maximize")
    study.optimize(objective, n_trials=n_trials)

    print("Best Score for CatBoost:", study.best_value)
    print("Best Params for CatBoost:", study.best_params)
    return study.best_params

### **Prepare Model Optimized Hyperparameter**
Create a dictionary of models' best hyperparameters

In [30]:
def models_best_params(use_lightgbm=False, use_xgboost=False, use_catboost= False):
    best_params_dict = {}
    if use_lightgbm:
        best_params_lgbm = optimize_lightgbm(train, target, n_trials=50)
        best_params_dict['LightGBM'] = best_params_lgbm
    if use_xgboost:
        best_params_xgb = optimize_xgboost(train, target, n_trials=50)
        best_params_dict['XGBoost'] = best_params_xgb
    if use_catboost:
        best_params_catboost = optimize_catboost(train, target, n_trials=50)
        best_params_dict['CatBoost'] = best_params_catboost
    return best_params_dict
        

<a id="model-training"></a>
# **Model Training**
Train multiple models with the optimized hyperparameters.
    

In [31]:
def train_all_models_with_cv(train, test, target, best_params_dict, ensemble_method=None, n_splits=5):
    """
    Train models using cross-validation, optionally using Voting or Stacking.

    Args:
        train (pd.DataFrame): Training data.
        test (pd.DataFrame): Test data.
        target (pd.Series): Target variable.
        best_params_dict (dict): Optimized hyperparameters for each model.
        ensemble_method (str): 'voting', 'stacking', or None for individual models.
        n_splits (int): Number of CV splits.

    Returns:
        dict: Aggregated predictions for each model or ensemble.
    """
    predictions = {}
    # target_binned = np.digitize(target, bins=np.linspace(target.min(), target.max(), 5))
    trained_models = {}

    models = []
    for model_name, params in best_params_dict.items():
        if model_name == "LightGBM":
            model = LGBMRegressor(**params)
        elif model_name == "XGBoost":
            model = XGBRegressor(**params)
        elif model_name == "CatBoost":
            model = CatBoostRegressor(**params)
        else:
            raise ValueError(f"Unsupported model: {model_name}")
        models.append((model_name, model))

    if ensemble_method == 'voting':
        # Create a Voting Regressor
        ensemble_model = VotingRegressor(estimators=models)
    elif ensemble_method == 'stacking':
        # Create a Stacking Regressor
        ensemble_model = StackingRegressor(estimators=models, final_estimator=LinearRegression())

    for model_name, model in ([('Ensemble', ensemble_model)] if ensemble_method else models):
        print(f"Training model with CV: {model_name}")
        cv = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)
        # fold_predictions = np.zeros(test.shape[0])

        train_S = []
        test_S = []
        
        oof_non_rounded = np.zeros(len(target), dtype=float) 
        oof_rounded = np.zeros(len(target), dtype=int) 
        test_preds = np.zeros((len(test), n_splits))

        for fold, (train_idx, test_idx) in enumerate(cv.split(train, target)):
            print(f"Training fold {fold + 1}/{n_splits}...")
            X_train, X_valid = train.iloc[train_idx], train.iloc[test_idx]
            y_train, y_valid = target.iloc[train_idx], target.iloc[test_idx]

            # Train the model
            model.fit(X_train, y_train)

            y_train_pred = model.predict(X_train)
            y_val_pred = model.predict(X_valid)

            oof_non_rounded[test_idx] = y_val_pred
            y_val_pred_rounded = y_val_pred.round(0).astype(int)
            oof_rounded[test_idx] = y_val_pred_rounded
    
            train_kappa = quadratic_weighted_kappa(y_train, y_train_pred.round(0).astype(int))
            val_kappa = quadratic_weighted_kappa(y_valid, y_val_pred_rounded)
    
            train_S.append(train_kappa)
            test_S.append(val_kappa)
            
            test_preds[:, fold] = model.predict(test)
            
            print(f"Fold {fold+1} - Train QWK: {train_kappa:.4f}, Validation QWK: {val_kappa:.4f}")
            # clear_output(wait=True)

        print(f"Mean Train QWK --> {np.mean(train_S):.4f}")
        print(f"Mean Validation QWK ---> {np.mean(test_S):.4f}")
    
        KappaOPtimizer = minimize(evaluate_predictions,
                                  x0=[0.5, 1.5, 2.5], args=(target, oof_non_rounded), 
                                  method='Nelder-Mead')

        print("KappaOPtimizer.x =",  KappaOPtimizer.x)
        assert KappaOPtimizer.success, "Optimization did not converge."
        
        oof_tuned = threshold_Rounder(oof_non_rounded, KappaOPtimizer.x)
        tKappa = quadratic_weighted_kappa(target, oof_tuned)
    
        print(f"----> || Optimized QWK SCORE :: {tKappa:.3f}")
    
        tpm = test_preds.mean(axis=1)
        tpTuned = threshold_Rounder(tpm, KappaOPtimizer.x)

        #     # Aggregate test predictions
        #     fold_predictions += model.predict(test) / n_splits

        predictions[model_name] = tpTuned
        trained_models[model_name] = model

    return predictions, trained_models

<a id="submission"></a>
# **Submission**
Generate and save the final submission file.
    

In [32]:
def generate_submission_file(predictions, sample):
    submission = pd.DataFrame({
        "id": sample["id"],
        "sii": np.round(predictions).astype(int)
    })
    submission.to_csv("submission.csv", index=False)
    print("Submission saved!")
    return submission

<a id="main-pipeline"></a>
# **Main Pipeline**
Run the entire pipeline.
    

In [33]:
import optuna
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

SEED = 42
n_folds = 5

# Set up logging for Optuna
optuna.logging.set_verbosity(optuna.logging.WARNING)  # Limit verbosity

# Paths
train_path = '/kaggle/input/child-mind-institute-problematic-internet-use/train.csv'
test_path = '/kaggle/input/child-mind-institute-problematic-internet-use/test.csv'
sample_path = '/kaggle/input/child-mind-institute-problematic-internet-use/sample_submission.csv'
time_series_train = "/kaggle/input/child-mind-institute-problematic-internet-use/series_train.parquet"
time_series_test = "/kaggle/input/child-mind-institute-problematic-internet-use/series_test.parquet"

# Toggle for using time-series data
use_time_series = False  # Set to False to skip time-series data

# Preprocess data
train, test, sample = preprocess_csv_data(train_path, test_path, sample_path)

if use_time_series:
    train_ts, test_ts = preprocess_time_series_data(time_series_train, time_series_test, use_autoencoder=True, use_imputer=True, impute_method="mean")
else:
    train_ts, test_ts = None, None
train, test = merge_csv_and_time_series(train, test, train_ts, test_ts, use_time_series=use_time_series, use_numeric_imputation=True, numeric_impute_method="knn")

train = train.dropna(subset=['sii'])

# Target and features
target = train["sii"]
train = train.drop(columns=["sii"])  # Drop `sii` from features


# Ensure `sii` is not in test data
if "sii" in test.columns:
    test = test.drop(columns=["sii"])

if "id" in train.columns:
    train = train.drop(columns=["id"])
    test = test.drop(columns=["id"])

Train shape after feature engineering:  (3960, 99)
Test shape after feature engineering:  (20, 77)
Train shape after removing columns:  (3960, 65)
Test shape after removing columns:  (20, 64)
Final features included in train: ['Basic_Demos-Age', 'Basic_Demos-Sex', 'CGAS-CGAS_Score', 'Physical-Height', 'Physical-Weight', 'Physical-Diastolic_BP', 'Physical-HeartRate', 'Physical-Systolic_BP', 'Fitness_Endurance-Max_Stage', 'Fitness_Endurance-Time_Mins', 'Fitness_Endurance-Time_Sec', 'FGC-FGC_CU', 'FGC-FGC_CU_Zone', 'FGC-FGC_GSND_Zone', 'FGC-FGC_GSD', 'FGC-FGC_GSD_Zone', 'FGC-FGC_PU', 'FGC-FGC_PU_Zone', 'FGC-FGC_SRL_Zone', 'FGC-FGC_SRR', 'FGC-FGC_SRR_Zone', 'FGC-FGC_TL', 'FGC-FGC_TL_Zone', 'BIA-BIA_Activity_Level_num', 'BIA-BIA_BMC', 'BIA-BIA_BMI', 'BIA-BIA_FFMI', 'BIA-BIA_FMI', 'BIA-BIA_Fat', 'BIA-BIA_Frame_num', 'PAQ_A-PAQ_A_Total', 'PAQ_C-PAQ_C_Total', 'SDS-SDS_Total_Raw', 'SDS-SDS_Total_T', 'PreInt_EduHx-computerinternet_hoursday', 'BMI_Age', 'Internet_Hours_Age', 'BMI_Internet_Hours',

In [34]:
# Model parameters for LightGBM
Params = {
    'learning_rate': 0.046,
    'max_depth': 12,
    'num_leaves': 478,
    'min_data_in_leaf': 13,
    'feature_fraction': 0.893,
    'bagging_fraction': 0.784,
    'bagging_freq': 4,
    'lambda_l1': 10,  # Increased from 6.59
    'lambda_l2': 0.01,  # Increased from 2.68e-06
    'device': 'cpu'

}


# XGBoost parameters
XGB_Params = {
    'learning_rate': 0.05,
    'max_depth': 6,
    'n_estimators': 200,
    'subsample': 0.8,
    'colsample_bytree': 0.8,
    'reg_alpha': 1,  # Increased from 0.1
    'reg_lambda': 5,  # Increased from 1
    'random_state': SEED,
    'tree_method': 'gpu_hist',

}


CatBoost_Params = {
    'learning_rate': 0.05,
    'depth': 6,
    'iterations': 200,
    'random_seed': SEED,
    'verbose': 0,
    'l2_leaf_reg': 10,  # Increase this value
    'task_type': 'GPU'

}

In [35]:
best_params_dict = {'LightGBM': Params, 'XGBoost': XGB_Params, 'CatBoost': CatBoost_Params}

In [36]:
# Train model and make predictions using cross-validation
predictions, trained_models = train_all_models_with_cv(train, test, target, best_params_dict, ensemble_method='voting')
# predictions, trained_models = train_all_models_with_cv(train, test, target, {'XGBoost':xgb_params_v1})

# Generate submission
final_predictions = next(iter(predictions.values()))
submission = generate_submission_file(final_predictions, sample)

Training model with CV: Ensemble
Training fold 1/5...
Fold 1 - Train QWK: 0.7491, Validation QWK: 0.3590
Training fold 2/5...
Fold 2 - Train QWK: 0.7478, Validation QWK: 0.4480
Training fold 3/5...
Fold 3 - Train QWK: 0.7437, Validation QWK: 0.4093
Training fold 4/5...
Fold 4 - Train QWK: 0.7605, Validation QWK: 0.3314
Training fold 5/5...
Fold 5 - Train QWK: 0.7623, Validation QWK: 0.3262
Mean Train QWK --> 0.7527
Mean Validation QWK ---> 0.3748
KappaOPtimizer.x = [0.52370001 0.91649573 2.92793475]
----> || Optimized QWK SCORE :: 0.453
Submission saved!
