This notebook demonstrates a complete machine learning pipeline, including:
- Preprocessing CSV and time-series data.
- Using Optuna to tune hyperparameters for LightGBM, XGBoost, and CatBoost.
- Training multiple models with optimized parameters.
- Combining predictions using an ensemble method.
- Generating a Kaggle submission.
    

# **Table of Contents**
1. [Introduction](#introduction)
2. [Libraries and Utilities](#libraries-and-utilities)
3. [Preprocessing](#preprocessing)
    - [Preprocessing CSV Data](#preprocessing-csv-data)
    - [Preprocessing Time-Series Data](#preprocessing-time-series-data)
    - [Merging Preprocessed Data](#merging-preprocessed-data)
4. [Hyperparameter Tuning with Optuna](#hyperparameter-tuning)
    - [LightGBM](#lightgbm)
    - [XGBoost](#xgboost)
    - [CatBoost](#catboost)
5. [Model Training](#model-training)
6. [Ensemble Predictions](#ensemble-predictions)
7. [Submission](#submission)
8. [Main Pipeline](#main-pipeline)
    

<a id="libraries-and-utilities"></a>
# **Libraries and Utilities**
Import the required libraries, including Optuna for hyperparameter optimization.
    

## **Install packages and import libraries**

In [1]:
!pip -q install /kaggle/input/pytorchtabnet/pytorch_tabnet-4.1.0-py3-none-any.whl

[0m

In [2]:
!pip uninstall protobuf -y
!pip install /kaggle/input/ft-transformer/protobuf-3.19.6-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
!pip install /kaggle/input/ft-transformer/tabtransformertf-0.0.8-py3-none-any.whl
!pip install /kaggle/input/ft-transformer/tensorflow_addons-0.19.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl

Found existing installation: protobuf 3.20.3
Uninstalling protobuf-3.20.3:
  Successfully uninstalled protobuf-3.20.3
[0mProcessing /kaggle/input/ft-transformer/protobuf-3.19.6-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Installing collected packages: protobuf
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
cudf 21.12.2 requires cupy-cuda115, which is not installed.
tfx-bsl 1.12.0 requires google-api-python-client<2,>=1.7.11, but you have google-api-python-client 2.79.0 which is incompatible.
tfx-bsl 1.12.0 requires pyarrow<7,>=6, but you have pyarrow 5.0.0 which is incompatible.
tensorflow-transform 1.12.0 requires pyarrow<7,>=6, but you have pyarrow 5.0.0 which is incompatible.
onnx 1.13.1 requires protobuf<4,>=3.20.2, but you have protobuf 3.19.6 which is incompatible.
apache-beam 2.44.0 requires dill<0.3.2,>=0.3.1.1, but you h

In [3]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import tensorflow as tf
import tensorflow_addons as tfa
from tabtransformertf.utils.preprocessing import df_to_dataset, build_categorical_prep
from tabtransformertf.models.fttransformer import FTTransformerEncoder, FTTransformer
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [4]:
# Import necessary libraries
import numpy as np  # Linear algebra
import pandas as pd  # Data processing, CSV file I/O
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.metrics import cohen_kappa_score
from sklearn.ensemble import VotingRegressor, StackingRegressor
from sklearn.linear_model import LinearRegression

from lightgbm import LGBMRegressor
from xgboost import XGBRegressor
from catboost import CatBoostRegressor

import tensorflow as tf
import tensorflow_addons as tfa
from tabtransformertf.utils.preprocessing import df_to_dataset, build_categorical_prep
from tabtransformertf.models.fttransformer import FTTransformerEncoder, FTTransformer

from keras.models import Model
from keras.layers import Input, Dense
from keras.optimizers import Adam

import torch
import torch.nn as nn
import torch.optim as optim

from scipy.optimize import minimize
import optuna
from tqdm import tqdm
import warnings
import os

from concurrent.futures import ThreadPoolExecutor

warnings.filterwarnings("ignore")

In [5]:
# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.metrics import cohen_kappa_score
from sklearn.model_selection import StratifiedKFold, train_test_split
from lightgbm import LGBMRegressor
from xgboost import XGBRegressor
from catboost import CatBoostRegressor
import optuna
from tqdm import tqdm
import warnings
import os
import shutil
from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler
import matplotlib.pyplot as plt
from keras.models import Model
from keras.layers import Input, Dense
from keras.optimizers import Adam
import torch
import torch.nn as nn
import torch.optim as optim

from sklearn.decomposition import PCA
from torch.utils.data import DataLoader, TensorDataset

from scipy.optimize import minimize

from sklearn.model_selection import StratifiedKFold
from sklearn.ensemble import VotingRegressor, StackingRegressor
from sklearn.linear_model import LinearRegression

import seaborn as sns



warnings.filterwarnings("ignore")


## **Utilities**

In [6]:
def quadratic_weighted_kappa(y_true, y_pred):
    return cohen_kappa_score(y_true, y_pred, weights='quadratic')

def threshold_Rounder(oof_non_rounded, thresholds):
    return np.where(oof_non_rounded < thresholds[0], 0,
                    np.where(oof_non_rounded < thresholds[1], 1,
                             np.where(oof_non_rounded < thresholds[2], 2, 3)))

def evaluate_predictions(thresholds, y_true, oof_non_rounded):
    rounded_p = threshold_Rounder(oof_non_rounded, thresholds)
    return -quadratic_weighted_kappa(y_true, rounded_p)

In [7]:
def clean_kaggle_working_directory(working_dir):
    """
    Cleans up all files and folders in the specified Kaggle working directory.

    Args:
        working_dir (str): Path to the Kaggle working directory.

    Returns:
        None
    """
    for item in os.listdir(working_dir):
        item_path = os.path.join(working_dir, item)
        try:
            if os.path.isfile(item_path) or os.path.islink(item_path):
                os.unlink(item_path)  # Remove file or symbolic link
            elif os.path.isdir(item_path):
                shutil.rmtree(item_path)  # Remove directory
            print(f"Deleted: {item_path}")
        except Exception as e:
            print(f"Failed to delete {item_path}: {e}")

<a id="preprocessing"></a>
# **Preprocessing**
This section contains the preprocessing functions for CSV and time-series data, as well as merging them.
    

## **Imputer**

In [8]:
from sklearn.impute import SimpleImputer
from sklearn.impute import KNNImputer

def impute(df, method="knn", n_neighbors=5):
    # impute categorical columns
    if method == "cat":
        categorical_cols = df.select_dtypes(include=["object", "category"]).columns
        cat_imputer = SimpleImputer(strategy="most_frequent")
        df[categorical_cols] = cat_imputer.fit_transform(df[categorical_cols])
        return df

    # get numeric columns instead of 'sii'
    numeric_cols = df.select_dtypes(include=['number', 'float64', 'int64']).columns
    numeric_cols = list(numeric_cols)
    if "sii" in numeric_cols:
        numeric_cols.remove("sii")
    numeric_cols = pd.Index(numeric_cols)
    
    if method == "knn":
        imputer = KNNImputer(n_neighbors=n_neighbors)
        # impute_data = imputer.fit_transform(df[numeric_cols])
    if method == "mean":
        imputer = SimpleImputer(strategy="mean")
    if method == "median":
        imputer = SimpleImputer(strategy="median")
    
    imputed_data = imputer.fit_transform(df[numeric_cols])
    df_imputed = pd.DataFrame(imputed_data, columns=numeric_cols)
    for col in df.columns:
        if col not in numeric_cols:
            df_imputed[col] = df[col]          
    df = df_imputed
    
    return df

## **Preprocess CSV data**

### **Feature engineering for CSV data**

In [9]:
def csv_feature_engineering(df):
    season_cols = [col for col in df.columns if 'Season' in col]
    df = df.drop(season_cols, axis=1) 
    df['BMI_Age'] = df['Physical-BMI'] * df['Basic_Demos-Age']
    df['Internet_Hours_Age'] = df['PreInt_EduHx-computerinternet_hoursday'] * df['Basic_Demos-Age']
    df['BMI_Internet_Hours'] = df['Physical-BMI'] * df['PreInt_EduHx-computerinternet_hoursday']
    df['BFP_BMI'] = df['BIA-BIA_Fat'] / df['BIA-BIA_BMI']
    df['FFMI_BFP'] = df['BIA-BIA_FFMI'] / df['BIA-BIA_Fat']
    df['FMI_BFP'] = df['BIA-BIA_FMI'] / df['BIA-BIA_Fat']
    df['LST_TBW'] = df['BIA-BIA_LST'] / df['BIA-BIA_TBW']
    df['BFP_BMR'] = df['BIA-BIA_Fat'] * df['BIA-BIA_BMR']
    df['BFP_DEE'] = df['BIA-BIA_Fat'] * df['BIA-BIA_DEE']
    df['BMR_Weight'] = df['BIA-BIA_BMR'] / df['Physical-Weight']
    df['DEE_Weight'] = df['BIA-BIA_DEE'] / df['Physical-Weight']
    df['SMM_Height'] = df['BIA-BIA_SMM'] / df['Physical-Height']
    df['Muscle_to_Fat'] = df['BIA-BIA_SMM'] / df['BIA-BIA_FMI']
    df['Hydration_Status'] = df['BIA-BIA_TBW'] / df['Physical-Weight']
    df['ICW_TBW'] = df['BIA-BIA_ICW'] / df['BIA-BIA_TBW']

    df['Age_Weight'] = df['Basic_Demos-Age'] * df['Physical-Weight']
    df['Sex_BMI'] = df['Basic_Demos-Sex'] * df['Physical-BMI']
    df['Sex_HeartRate'] = df['Basic_Demos-Sex'] * df['Physical-HeartRate']
    df['Age_WaistCirc'] = df['Basic_Demos-Age'] * df['Physical-Waist_Circumference']
    df['BMI_FitnessMaxStage'] = df['Physical-BMI'] * df['Fitness_Endurance-Max_Stage']
    df['Weight_GripStrengthDominant'] = df['Physical-Weight'] * df['FGC-FGC_GSD']
    df['Weight_GripStrengthNonDominant'] = df['Physical-Weight'] * df['FGC-FGC_GSND']
    df['HeartRate_FitnessTime'] = df['Physical-HeartRate'] * (df['Fitness_Endurance-Time_Mins'] + df['Fitness_Endurance-Time_Sec'])
    df['Age_PushUp'] = df['Basic_Demos-Age'] * df['FGC-FGC_PU']
    df['FFMI_Age'] = df['BIA-BIA_FFMI'] * df['Basic_Demos-Age']
    df['InternetUse_SleepDisturbance'] = df['PreInt_EduHx-computerinternet_hoursday'] * df['SDS-SDS_Total_Raw']
    df['CGAS_BMI'] = df['CGAS-CGAS_Score'] * df['Physical-BMI']
    df['CGAS_FitnessMaxStage'] = df['CGAS-CGAS_Score'] * df['Fitness_Endurance-Max_Stage']
    
    return df

After we analyzed the dataset, we observed that there are lot of unreasonable values in the dataset. Therefore, we decided to prune some anomaly samples.



In [10]:
def remove_outliers(df):
    input_length = len(df)
    df = df.drop(df[df['Physical-BMI'] <= 0].index)
    df = df.drop(df[df['Physical-Diastolic_BP'] <= 0].index)
    df = df.drop(df[df['Physical-Systolic_BP'] <= 0].index)
    df = df.drop(df[df['Physical-Diastolic_BP'] > 160].index)

    children = df[df['Basic_Demos-Age'] <= 12]
    df = df.drop(children[children['FGC-FGC_CU'] > 80].index)
    df = df.drop(children[children['FGC-FGC_GSND'] > 80].index)

    df = df.drop(df[df['BIA-BIA_BMI'] <= 0].index)
    df = df.drop(df[df['BIA-BIA_BMC'] > 1000].index)
    df = df.drop(df[df['BIA-BIA_BMR'] > 40000].index)
    df = df.drop(df[df['BIA-BIA_DEE'] > 60000].index)
    df = df.drop(df[df['BIA-BIA_ECW'] > 2000].index)
    df = df.drop(df[df['BIA-BIA_FFM'] > 2000].index)
    df = df.drop(df[df['BIA-BIA_ICW'] > 2000].index)
    df = df.drop(df[df['BIA-BIA_LDM'] > 2000].index)
    df = df.drop(df[df['BIA-BIA_LST'] > 2000].index)
    df = df.drop(df[df['BIA-BIA_SMM'] > 2000].index)
    df = df.drop(df[df['BIA-BIA_TBW'] > 2000].index)
    output_length = len(df)
    print (input_length, output_length)
    
    return df

### **Preprocess Function**

In [11]:
def preprocess_csv_data(train_path, test_path, sample_path):
    """
    Preprocess CSV data with proper handling of missing columns.
    Args:
        train_path (str): Path to the training CSV file.
        test_path (str): Path to the test CSV file.
        sample_path (str): Path to the sample submission CSV file
    Returns:
        tuple: Preprocessed training DataFrame, test DataFrame, and sample submission.
    """
    train = pd.read_csv(train_path)
    test = pd.read_csv(test_path)
    sample = pd.read_csv(sample_path)

    # # remove outlier from train
    # train = remove_outliers(train)
    # print("Train shape after removing outlier: ", train.shape)

    # feature engineering for both train and test
    train = csv_feature_engineering(train)
    test = csv_feature_engineering(test)
    print("Train shape after feature engineering: ", train.shape)
    print("Test shape after feature engineering: ", test.shape)

    # Only use columns in both train and test
    # Ensure that the columns in `train` and `test` match
    # Remove some columns
    common_columns = test.columns.to_list() 
    remove_columns = ['BIA-BIA_LDM', 'Physical-Waist_Circumference', 'FGC-FGC_SRL', 'FGC-FGC_GSND', 'BIA-BIA_ECW', 'Physical-BMI', 'BIA-BIA_LST', 'BIA-BIA_SMM', 'BIA-BIA_FFM', 'BIA-BIA_TBW', 'BIA-BIA_BMR', 'BIA-BIA_ICW', 'BIA-BIA_DEE']
    common_columns = [col for col in common_columns if col not in remove_columns]
    
    train = train[["sii"] + common_columns]
    test = test[common_columns]
    print("Train shape after removing columns: ", train.shape)
    print("Test shape after removing columns: ", test.shape)

    return train, test, sample

## **Preprocess time series data**

In [12]:
def feature_engineering_ts(worn_data):
    """
    Perform feature engineering on the time-series data for a single participant.
    Returns a DataFrame with all derived features for this participant.
    """
    mvpa_threshold = 0.1
    vig_threshold = 0.5
    window_size = 12

    # Filter worn data
    worn_data = worn_data[worn_data['non-wear_flag'] == 0].copy()

    # Time of day conversions
    worn_data['time_of_day_hours'] = worn_data['time_of_day'] / 1e9 / 3600
    worn_data['day_time'] = worn_data['relative_date_PCIAT'] + (worn_data['time_of_day_hours'] / 24)
    worn_data['day_period'] = np.where(
        (worn_data['time_of_day_hours'] >= 8) & (worn_data['time_of_day_hours'] < 21),
        'day', 'night'
    )

    # Time differences
    worn_data['time_diff'] = (worn_data['day_time'].diff() * 86400).round(0)
    worn_data['measurement_after_gap'] = worn_data['time_diff'] > 5

    # Classify activity levels
    worn_data['activity_type'] = pd.cut(
        worn_data['enmo'],
        bins=[-np.inf, mvpa_threshold, vig_threshold, np.inf],
        labels=['low', 'moderate', 'vigorous']
    )

    # Aggregate activity periods
    activity_group = (
        (worn_data['activity_type'] != worn_data['activity_type'].shift()) |
        (worn_data['measurement_after_gap'])
    ).cumsum()

    activity_periods = worn_data.groupby(activity_group).agg(
        min=('day_time', 'min'),
        max=('day_time', 'max'),
        activity_type=('activity_type', 'first')
    )
    activity_periods['duration_sec'] = (activity_periods['max'] - activity_periods['min']) * 86400 + 5
    activity_periods = activity_periods[activity_periods['duration_sec'] >= 60]

    # Add day and transition number
    activity_periods['day'] = activity_periods['min'].astype(int)
    activity_periods['transition_num'] = (
        activity_periods.groupby('day')['activity_type']
        .apply(lambda x: (x != x.shift()).cumsum())
        .reset_index(level=0, drop=True)
    )

    # Calculate activity-level summary statistics
    activity_summary = {}
    for act_type in ['low', 'moderate']:
        activity_data = activity_periods[activity_periods['activity_type'] == act_type]
        if activity_data.empty:
            for stat in ['median', 'max', 'std']:
                activity_summary[f'{act_type}_duration_{stat}'] = 0
                activity_summary[f'{act_type}_count_periods_{stat}'] = 0
            continue
        stats = activity_data.groupby('day').agg(
            total_duration=('duration_sec', 'sum'),
            count_periods=('duration_sec', 'size')
        ).agg(['median', 'max', 'std'])

        for stat in ['median', 'max', 'std']:
            activity_summary[f'{act_type}_duration_{stat}'] = stats.loc[stat, 'total_duration']
            activity_summary[f'{act_type}_count_periods_{stat}'] = stats.loc[stat, 'count_periods']

    # Add daily transition statistics
    daily_transitions = activity_periods.groupby('day')['transition_num'].max()
    trans_stats = daily_transitions.agg(['median', 'max', 'std'])
    for stat in ['median', 'max', 'std']:
        activity_summary[f'transitions_{stat}'] = trans_stats[stat]

    # Hourly activity features
    hourly_activity = worn_data.groupby(
        [worn_data['relative_date_PCIAT'].astype(int),
         worn_data['time_of_day_hours'].astype(int),
         worn_data['day_period']]
    )['enmo'].agg(['mean', 'max'])

    features = hourly_activity['mean'].groupby(
        ['relative_date_PCIAT', 'day_period']
    ).agg(
        std_across_hours='std',
        peak_hour=lambda x: x.idxmax()[1] if not x.empty else 0,
        entropy=lambda x: -(x / x.sum() * np.log(x / x.sum() + 1e-9)).sum() if not x.empty else 0
    )

    # Day and night features
    if 'day' in features.index.get_level_values('day_period'):
        day_features = features.xs('day', level='day_period').agg(['median', 'max', 'std']).stack().to_frame().T
        day_features.columns = [f'{stat}_{feature}_day' for stat, feature in day_features.columns]
    else:
        day_features = pd.DataFrame(columns=[f'{stat}_{feature}_day' for stat in ['median', 'max', 'std'] for feature in ['std_across_hours', 'peak_hour', 'entropy']])

    if 'night' in features.index.get_level_values('day_period'):
        night_features = features.xs('night', level='day_period').agg(['median', 'max', 'std']).stack().to_frame().T
        night_features.columns = [f'{stat}_{feature}_night' for stat, feature in night_features.columns]
    else:
        night_features = pd.DataFrame(columns=[f'{stat}_{feature}_night' for stat in ['median', 'max', 'std'] for feature in ['std_across_hours', 'peak_hour', 'entropy']])

    def merge_mvpa_groups(df, allowed_gap=60, merge_gap=60):
        last_mvpa_time = df['day_time'].where(df['is_mvpa']).ffill().shift()
        mvpa_time_diff = ((df['day_time'] - last_mvpa_time) * 86400).round(0)
        mvpa_group = (
            (df['is_mvpa'] != df['is_mvpa'].shift()) |
            (df['time_diff'] >= allowed_gap)
        ).cumsum()
        is_mvpa_start = (
            (mvpa_group != mvpa_group.shift()) &
            df['is_mvpa']
        )
        group_increment = is_mvpa_start & (
            (mvpa_time_diff >= merge_gap) | last_mvpa_time.isnull()
        )
        merged_group = group_increment.cumsum()
        merged_group.loc[~df['is_mvpa']] = np.nan
        return merged_group
    
    # Calculate daily MVPA statistics
    worn_data['is_mvpa'] = worn_data['enmo'] > mvpa_threshold
    worn_data['mvpa_merged_group'] = merge_mvpa_groups(worn_data)

    mvpa_periods = worn_data[worn_data['is_mvpa']].groupby('mvpa_merged_group')['day_time'].agg(['min', 'max'])
    mvpa_periods['duration_sec'] = (mvpa_periods['max'] - mvpa_periods['min']) * 86400
    mvpa_periods = mvpa_periods[mvpa_periods['duration_sec'] >= 60]

    mvpa_periods['day'] = mvpa_periods['min'].astype(int)
    daily_stats = mvpa_periods.groupby('day').agg(
        total_duration=('duration_sec', 'sum'),
        count_periods=('duration_sec', 'size')
    )

    # Extract daily stats features
    daily_stats_features = daily_stats.agg(
        ['median', 'max', 'std']
    ).unstack().to_frame().T

    daily_stats_features.columns = [
        f'{stat}_{feature}' for feature, stat in daily_stats_features.columns
    ]

    # Combine all features
    combined_features = pd.concat(
        [pd.DataFrame([activity_summary]), day_features, night_features, daily_stats_features],
        axis=1
    )
    # combined_features.fillna(0, inplace=True)  # Ensure no missing values

    return combined_features


In [13]:
def process_file_and_engineer_features(filename, dirname, output_dir):
    """
    Process a single participant's Parquet file, apply feature engineering, and save features to disk.
    """
    try:
        # Load the Parquet file
        file_path = os.path.join(dirname, filename, 'part-0.parquet')
        df = pd.read_parquet(file_path)

        # Add participant ID
        participant_id = filename.split('=')[1]
        df['id'] = participant_id

        # Apply feature engineering
        features = feature_engineering_ts(df)

        # Add participant ID to features
        features['id'] = participant_id

        # Save the engineered features to disk
        output_file = os.path.join(output_dir, f"{participant_id}_features.parquet")
        features.to_parquet(output_file, index=False)
    except Exception as e:
        print(f"Error processing file {filename}: {e}")


def load_time_series(dirname, output_dir) -> pd.DataFrame:
    """
    Process all Parquet files, save engineered features to disk, and return the combined features.
    """
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)

    # Process participants one by one
    for filename in tqdm(os.listdir(dirname), desc="Processing participants"):
        process_file_and_engineer_features(filename, dirname, output_dir)

    # Combine all saved features into a single DataFrame
    feature_files = [os.path.join(output_dir, f) for f in os.listdir(output_dir) if f.endswith("_features.parquet")]
    combined_features = pd.concat([pd.read_parquet(f) for f in tqdm(feature_files, desc="Combining features")], ignore_index=True)

    return combined_features

### **Autoencoder**

In [14]:
# Sparse Autoencoder
class SparseAutoencoder(nn.Module):
    def __init__(self, input_dim, sparsity_weight=1e-5):
        super(SparseAutoencoder, self).__init__()
        self.sparsity_weight = sparsity_weight
        self.encoder = nn.Sequential(
            nn.Linear(input_dim, 64),
            nn.ReLU(),
            nn.Linear(64, 32),
            nn.ReLU(),
            nn.Linear(32, 16),
            nn.ReLU()
        )
        self.decoder = nn.Sequential(
            nn.Linear(16, 32),
            nn.ReLU(),
            nn.Linear(32, 64),
            nn.ReLU(),
            nn.Linear(64, input_dim),
            nn.Sigmoid()  # Outputs in the range [0, 1]
        )

    def forward(self, x):
        encoded = self.encoder(x)
        decoded = self.decoder(encoded)
        return encoded, decoded


# Data preparation function
def prepare_data(data, scaler_type="MinMaxScaler"):
    """
    Prepares data for model training, ensuring only numeric columns are scaled.
    Returns PyTorch tensor.
    """
    numeric_cols = data.select_dtypes(include=[np.number]).columns
    if scaler_type == "RobustScaler":
        scaler = RobustScaler()
    else:
        scaler = MinMaxScaler()

    # Scale numeric columns
    data_scaled = scaler.fit_transform(data[numeric_cols])
    return torch.tensor(data_scaled, dtype=torch.float32), scaler


# PCA function
def apply_pca(data, n_components=0.95):
    pca = PCA(n_components=n_components)
    data_pca = pca.fit_transform(data)
    return data_pca, pca


# Early stopping
class EarlyStopping:
    def __init__(self, patience=10):
        self.patience = patience
        self.counter = 0
        self.best_loss = float("inf")
        self.early_stop = False

    def __call__(self, loss):
        if loss < self.best_loss:
            self.best_loss = loss
            self.counter = 0
        else:
            self.counter += 1
            if self.counter >= self.patience:
                self.early_stop = True


# Training function
def perform_sparse_autoencoder(data, epochs=100, batch_size=32, learning_rate=0.001, patience=10, scaler_type="MinMaxScaler", use_pca=False, sparsity_weight=1e-5):
    # Preprocess data
    if use_pca:
        data, pca = apply_pca(data)

    data_tensor, scaler = prepare_data(data, scaler_type=scaler_type)
    train_data, val_data = train_test_split(data_tensor, test_size=0.2, random_state=42)

    # Ensure data is PyTorch tensors
    assert isinstance(train_data, torch.Tensor), "train_data must be a PyTorch tensor"
    assert isinstance(val_data, torch.Tensor), "val_data must be a PyTorch tensor"

    # DataLoader setup
    train_loader = DataLoader(TensorDataset(train_data), batch_size=batch_size, shuffle=True)
    val_loader = DataLoader(TensorDataset(val_data), batch_size=batch_size, shuffle=False)

    # Model and optimizer
    model = SparseAutoencoder(input_dim=data_tensor.shape[1], sparsity_weight=sparsity_weight)
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)

    criterion = nn.SmoothL1Loss()
    optimizer = optim.Adam(model.parameters(), lr=learning_rate)
    stopper = EarlyStopping(patience=patience)

    # Training loop
    for epoch in range(epochs):
        model.train()
        train_loss = 0.0
        for batch in train_loader:
            batch = batch[0].to(device)
            optimizer.zero_grad()
            encoded, outputs = model(batch)

            # Reconstruction loss
            loss = criterion(outputs, batch)

            # Sparsity penalty
            l1_penalty = torch.mean(torch.abs(encoded))
            loss += sparsity_weight * l1_penalty

            loss.backward()
            optimizer.step()
            train_loss += loss.item() * batch.size(0)

        train_loss /= len(train_loader.dataset)

        # Validation
        model.eval()
        val_loss = 0.0
        with torch.no_grad():
            for batch in val_loader:
                batch = batch[0].to(device)
                _, outputs = model(batch)
                loss = criterion(outputs, batch)
                val_loss += loss.item() * batch.size(0)

        val_loss /= len(val_loader.dataset)
        print(f"Epoch {epoch + 1}, Train Loss: {train_loss:.4f}, Validation Loss: {val_loss:.4f}")

        # Early stopping
        stopper(val_loss)
        if stopper.early_stop:
            print(f"Early stopping at epoch {epoch + 1}")
            break

    # Return encoded features
    model.eval()
    with torch.no_grad():
        encoded_features, _ = model(data_tensor.to(device))
    encoded_features = encoded_features.cpu().numpy()

    # Convert back to DataFrame
    encoded_df = pd.DataFrame(encoded_features, columns=[f"feature_{i}" for i in range(encoded_features.shape[1])])
    return encoded_df

In [15]:
class AutoEncoder(nn.Module):
    def __init__(self, input_dim, encoding_dim):
        super(AutoEncoder, self).__init__()
        self.encoder = nn.Sequential(
            nn.Linear(input_dim, encoding_dim*3),
            nn.ReLU(),
            nn.Linear(encoding_dim*3, encoding_dim*2),
            nn.ReLU(),
            nn.Linear(encoding_dim*2, encoding_dim),
            nn.ReLU()
        )
        self.decoder = nn.Sequential(
            nn.Linear(encoding_dim, input_dim*2),
            nn.ReLU(),
            nn.Linear(input_dim*2, input_dim*3),
            nn.ReLU(),
            nn.Linear(input_dim*3, input_dim),
            nn.Sigmoid()
        )
        
    def forward(self, x):
        encoded = self.encoder(x)
        decoded = self.decoder(encoded)
        return decoded


def perform_autoencoder(df, encoding_dim=50, epochs=50, batch_size=32):
    # Keep only numeric columns
    numeric_cols = df.select_dtypes(include=[np.number]).columns
    df_numeric = df[numeric_cols]
    
    # Scale the numeric data
    scaler = StandardScaler()
    df_scaled = scaler.fit_transform(df_numeric)
    
    # Convert to a PyTorch tensor
    data_tensor = torch.FloatTensor(df_scaled)
    
    # Define the autoencoder model
    input_dim = data_tensor.shape[1]
    autoencoder = AutoEncoder(input_dim, encoding_dim)
    
    # Set up the loss function and optimizer
    criterion = nn.MSELoss()
    optimizer = optim.Adam(autoencoder.parameters())
    
    # Train the autoencoder
    for epoch in range(epochs):
        for i in range(0, len(data_tensor), batch_size):
            batch = data_tensor[i : i + batch_size]
            optimizer.zero_grad()
            reconstructed = autoencoder(batch)
            loss = criterion(reconstructed, batch)
            loss.backward()
            optimizer.step()
            
        if (epoch + 1) % 10 == 0:
            print(f'Epoch [{epoch + 1}/{epochs}], Loss: {loss.item():.4f}]')
                 
    # Extract encoded data
    with torch.no_grad():
        encoded_data = autoencoder.encoder(data_tensor).numpy()
        
    # Return the encoded data as a DataFrame
    df_encoded = pd.DataFrame(encoded_data, columns=[f'Enc_{i + 1}' for i in range(encoded_data.shape[1])])
    
    return df_encoded

### **Preprocess function**

In [16]:
def preprocess_time_series_data(train_ts_path, test_ts_path, use_autoencoder=False, use_imputer=False, impute_method="mean", outdir='/kaggle/working/intermediate_results'):
    """
    Preprocess time-series data, including feature engineering and optional autoencoder-based encoding.
    Args:
        train_ts_path (str): Path to training time-series data.
        test_ts_path (str): Path to test time-series data.
        use_autoencoder (bool): Whether to encode features with a sparse autoencoder.
        use_imputer (bool): Whether to impute missing values.
        impute_method (str): Imputation method ("mean", "knn", etc.).
    Returns:
        tuple: Preprocessed training and test time-series DataFrames.
    """
    
    train_ts = load_time_series(train_ts_path, '/kaggle/working/final')
    test_ts = load_time_series(test_ts_path, '/kaggle/working/final_test')

    # # Apply feature engineering
    # train_ts = feature_engineering_ts(train_ts)
    # test_ts = feature_engineering_ts(test_ts)

    # Impute missing values
    if use_imputer:
        train_ts = impute(train_ts, method=impute_method)
        test_ts = impute(test_ts, method=impute_method)

    # Encode features with a sparse autoencoder
    if use_autoencoder:
        train_ts_encoded = perform_autoencoder(
            train_ts, encoding_dim=60, epochs=100, batch_size=32
        )
        test_ts_encoded = perform_autoencoder(
            test_ts, 
            encoding_dim=60, epochs=100, batch_size=32
        )
        train_ts_encoded['id'] = train_ts["id"]
        test_ts_encoded['id'] = test_ts["id"]

        return train_ts_encoded, test_ts_encoded

    return train_ts, test_ts

## **Merge CSV and time series data**

In [17]:
def merge_csv_and_time_series(train, test, train_ts, test_ts, use_time_series=False, use_numeric_imputation=False, numeric_impute_method="knn"):
    """
    Merge CSV and time-series data into a unified dataset and fill missing values using KNNImputer.

    Args:
        train (pd.DataFrame): Training data from CSV.
        test (pd.DataFrame): Test data from CSV.
        train_ts (pd.DataFrame): Aggregated training time-series data.
        test_ts (pd.DataFrame): Aggregated test time-series data.
        use_time_series (bool): Whether to include time-series data in the merged dataset.

    Returns:
        tuple: Merged training and test DataFrames.
    """
    featuresCols = train.columns.to_list()
    if use_time_series:
        # Merge time-series data with train and test data on 'id'
        train = pd.merge(train, train_ts, how="left", on='id')
        test = pd.merge(test, test_ts, how="left", on='id')

    # Drop 'id' column after merging
    train = train.drop('id', axis=1)
    test = test.drop('id', axis=1)

    if use_time_series:
        # Feature selection
        time_series_cols = train_ts.columns.tolist()
        time_series_cols.remove("id")

    if np.any(np.isinf(train)):
        train = train.replace([np.inf, -np.inf], np.nan)

    if np.any(np.isinf(test)):
        test = test.replace([np.inf, -np.inf], np.nan)

    if use_numeric_imputation:
        train = impute(train, method=numeric_impute_method)
        test = impute(test, method=numeric_impute_method)

    if use_time_series:
        featuresCols += time_series_cols

    # Dynamically filter features based on available columns
    featuresCols = [col for col in featuresCols if col in train.columns]
    print("Final features included in train:", featuresCols)

    # Filter features and drop rows with missing target 'sii'
    train = train[featuresCols]
    train = train.dropna(subset=['sii'])

    featuresCols.remove('sii')
    test = test[featuresCols]

    return train, test

<a id="hyperparameter-tuning"></a>
# **Hyperparameter Tuning with Optuna**
This section includes functions to optimize hyperparameters for LightGBM, XGBoost, and CatBoost.
    

<a id="lightgbm"></a>
### **LightGBM Hyperparameter Optimization**
Use Optuna to tune hyperparameters for LightGBM.

In [18]:
def optimize_lightgbm(train, target, n_trials=50):
    def objective(trial):
        params = {
            "learning_rate": trial.suggest_loguniform("learning_rate", 0.01, 0.3),
            "max_depth": trial.suggest_int("max_depth", 3, 12),
            "num_leaves": trial.suggest_int("num_leaves", 20, 3000),
            "min_child_samples": trial.suggest_int("min_child_samples", 5, 100),
            "feature_fraction": trial.suggest_uniform("feature_fraction", 0.4, 0.9),
            "bagging_fraction": trial.suggest_uniform("bagging_fraction", 0.4, 0.9),
            "bagging_freq": trial.suggest_int("bagging_freq", 1, 7),
            "lambda_l1": trial.suggest_loguniform("lambda_l1", 1e-8, 10.0),
            "lambda_l2": trial.suggest_loguniform("lambda_l2", 1e-8, 10.0),
            "random_state": 42,
            "n_estimators": 200
        }

        X_train, X_valid, y_train, y_valid = train_test_split(
            train, target, test_size=0.2, random_state=42, stratify=target
        )

        model = LGBMRegressor(**params)
        model.fit(
            X_train, y_train,
            eval_set=[(X_valid, y_valid)],
            eval_metric="rmse"
            # early_stopping_rounds=50
        )

        preds = model.predict(X_valid)
        score = quadratic_weighted_kappa(y_valid, preds.round(0).astype(int))
        return score

    study = optuna.create_study(direction="maximize")
    study.optimize(objective, n_trials=n_trials)

    print("Best Score for LightGBM:", study.best_value)
    print("Best Params for LightGBM:", study.best_params)
    return study.best_params

<a id="xgboost"></a>
### **XGBoost Hyperparameter Optimization**
Use Optuna to tune hyperparameters for XGBoost.

In [19]:
def optimize_xgboost(train, target, n_trials=50):
    import optuna
    from sklearn.model_selection import train_test_split
    from xgboost import XGBRegressor

    def is_gpu_available():
        try:
            import torch
            return torch.cuda.is_available()
        except ImportError:
            return False

    use_gpu = is_gpu_available()
    tree_method = "gpu_hist" if use_gpu else "hist"

    def objective(trial):
        params = {
            "learning_rate": trial.suggest_float("learning_rate", 0.01, 0.3),
            "max_depth": trial.suggest_int("max_depth", 3, 10),
            "subsample": trial.suggest_float("subsample", 0.5, 1.0),
            "colsample_bytree": trial.suggest_float("colsample_bytree", 0.5, 1.0),
            "reg_lambda": trial.suggest_float("lambda", 1e-5, 10.0),
            "reg_alpha": trial.suggest_float("alpha", 1e-5, 10.0),
            "tree_method": tree_method,
        }

        # Preprocess the training data to remove non-numeric columns
        X_train, X_valid, y_train, y_valid = train_test_split(
            train, target, test_size=0.2, random_state=42, stratify=target
        )

        model = XGBRegressor(**params)
        model.fit(
            X_train,
            y_train,
            eval_set=[(X_valid, y_valid)],
            eval_metric="rmse",
            verbose=0,
            early_stopping_rounds=50,
        )

        # Predict and evaluate
        preds = model.predict(X_valid)
        rounded_preds = np.round(preds).astype(int)  # Ensure preds are rounded here
        score = quadratic_weighted_kappa(y_valid, rounded_preds)
        return score

    # Run Optuna optimization
    study = optuna.create_study(direction="maximize")
    study.optimize(objective, n_trials=n_trials)

    print("Best Score for XGBoost:", study.best_value)
    print("Best Params for XGBoost:", study.best_params)

    return study.best_params


<a id="catboost"></a>
### **CatBoost Hyperparameter Optimization**
Use Optuna to tune hyperparameters for CatBoost.

In [20]:
def optimize_catboost(train, target, n_trials=50):
    def objective(trial):
        params = {
            "learning_rate": trial.suggest_loguniform("learning_rate", 0.01, 0.3),
            "depth": trial.suggest_int("depth", 4, 12),
            "l2_leaf_reg": trial.suggest_loguniform("l2_leaf_reg", 1e-8, 10.0),
            "iterations": 200,
            "task_type": "GPU",
            "random_state": 42
        }

        CatBoost_Params = {
    'learning_rate': 0.05,
    'depth': 6,
    'iterations': 200,
    'random_seed': SEED,
    'verbose': 0,
    'l2_leaf_reg': 10,  # Increase this value
    'task_type': 'GPU'

}

        X_train, X_valid, y_train, y_valid = train_test_split(
            train, target, test_size=0.2, random_state=42, stratify=target
        )

        model = CatBoostRegressor(**params, verbose=0)
        model.fit(X_train, y_train, eval_set=(X_valid, y_valid))

        preds = model.predict(X_valid)
        score = quadratic_weighted_kappa(y_valid, preds.round(0).astype(int))
        return score

    study = optuna.create_study(direction="maximize")
    study.optimize(objective, n_trials=n_trials)

    print("Best Score for CatBoost:", study.best_value)
    print("Best Params for CatBoost:", study.best_params)
    return study.best_params

### **Prepare Model Optimized Hyperparameter**
Create a dictionary of models' best hyperparameters

In [21]:
def models_best_params(use_lightgbm=False, use_xgboost=False, use_catboost= False):
    best_params_dict = {}
    if use_lightgbm:
        best_params_lgbm = optimize_lightgbm(train, target, n_trials=50)
        best_params_dict['LightGBM'] = best_params_lgbm
    if use_xgboost:
        best_params_xgb = optimize_xgboost(train, target, n_trials=50)
        best_params_dict['XGBoost'] = best_params_xgb
    if use_catboost:
        best_params_catboost = optimize_catboost(train, target, n_trials=50)
        best_params_dict['CatBoost'] = best_params_catboost
    return best_params_dict
        

### **FT-Transformer wrapper**

In [22]:
import numpy as np
import tensorflow as tf
import tensorflow_addons as tfa
from sklearn.base import BaseEstimator, RegressorMixin
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
import random


class FTTransformerWrapper(BaseEstimator, RegressorMixin):
    def __init__(self, 
                 numerical_features=None,
                 categorical_features=None,
                 embedding_dim=64,
                 depth=4,
                 heads=8,
                 attn_dropout=0.3,
                 ff_dropout=0.3,
                 out_dim=1,
                 out_activation='relu',
                 lr=0.001,
                 weight_decay=0.0001,
                 loss=None,
                 metrics=None,
                 epochs=1000,
                 patience=10,
                 batch_size=1024,
                 shuffle=True,
                 seed=42):  # Add seed parameter
        # Model hyperparameters
        self.numerical_features = numerical_features
        self.categorical_features = categorical_features
        self.embedding_dim = embedding_dim
        self.depth = depth
        self.heads = heads
        self.attn_dropout = attn_dropout
        self.ff_dropout = ff_dropout
        self.out_dim = out_dim
        self.out_activation = out_activation
        
        # Training hyperparameters
        self.lr = lr
        self.weight_decay = weight_decay
        self.loss = loss or tf.keras.losses.MeanSquaredError()
        self.metrics = metrics or [tf.keras.metrics.RootMeanSquaredError()]
        self.epochs = epochs
        self.patience = patience
        self.batch_size = batch_size
        self.shuffle = shuffle
        self.seed = seed  # Save seed
        
        # Internal objects
        self.imputer = SimpleImputer(strategy='median')  # Handle missing values
        self.model = None
        self.history = None
        
        # Set random seeds for reproducibility
        self._set_random_seed(self.seed)
    
    def _set_random_seed(self, seed):
        """Set seeds for reproducibility."""
        np.random.seed(seed)
        tf.random.set_seed(seed)
        random.seed(seed)
        
    def _build_model(self):
        # Create the encoder
        encoder = FTTransformerEncoder(
            numerical_features=self.numerical_features,
            categorical_features=self.categorical_features,
            numerical_data=None,  # Placeholder, as data is processed separately
            categorical_data=None,  # Placeholder
            y=None,
            numerical_embedding_type='linear',
            embedding_dim=self.embedding_dim,
            depth=self.depth,
            heads=self.heads,
            attn_dropout=self.attn_dropout,
            ff_dropout=self.ff_dropout,
            explainable=True,
        )
        
        # Create the complete model
        model = FTTransformer(
            encoder=encoder,
            out_dim=self.out_dim,
            out_activation=self.out_activation,
        )
        
        # Compile the model
        optimizer = tfa.optimizers.AdamW(
            learning_rate=self.lr, 
            weight_decay=self.weight_decay
        )
        model.compile(
            optimizer=optimizer,
            loss=self.loss,
            metrics=self.metrics
        )
        
        return model
    
    def fit(self, X, y):
        self._set_random_seed(self.seed)
        
        # Handle missing values
        X_imputed = self.imputer.fit_transform(X)
        if hasattr(y, 'values'):
            y = y.values
        
        
        # Split data into train/validation sets with fixed seed
        train_data, val_data, train_target, val_target = train_test_split(
            X_imputed, y, test_size=0.2, random_state=self.seed
        )

        train_data = X_imputed
        train_target = y

        train_data_df = pd.DataFrame(train_data, columns = self.numerical_features)
        train_target_df = pd.DataFrame(train_target, columns=['sii'])
        
        train_data = pd.concat([train_data_df, train_target_df], axis=1)

        val_data_df = pd.DataFrame(val_data, columns = self.numerical_features)
        val_target_df = pd.DataFrame(val_target, columns=['sii'])
        
        val_data = pd.concat([val_data_df, val_target_df], axis=1)
           
        # Use df_to_dataset to create TensorFlow datasets
        train_dataset = df_to_dataset(train_data, 'sii', shuffle=self.shuffle, batch_size=self.batch_size)
        val_dataset = df_to_dataset(val_data, 'sii', shuffle=False, batch_size=self.batch_size)
        
        # Build and train the model
        self.model = self._build_model()
        early_stopping = tf.keras.callbacks.EarlyStopping(
            monitor="val_loss", mode="min", patience=self.patience
        )
        
        self.history = self.model.fit(
            train_dataset,
            epochs=self.epochs,
            validation_data=val_dataset,
            callbacks=[early_stopping],
            verbose=1
        )
        
        return self
    
    def predict(self, X):
        self._set_random_seed(self.seed)
        
        # Handle missing values
        X_imputed = self.imputer.transform(X)

        X_df = pd.DataFrame(X_imputed, columns = self.numerical_features)
        
        # Convert X to dataset using df_to_dataset
        predict_dataset = df_to_dataset(X_df, shuffle=False, batch_size=self.batch_size)
        
        # Predict using the trained model
        return self.model.predict(predict_dataset)['output'].flatten()


## **TabNet**

In [23]:
# New: TabNet
from pytorch_tabnet.tab_model import TabNetRegressor
import torch

import random
def seed_everything(seed):
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = True
seed_everything(42)

from sklearn.base import BaseEstimator, RegressorMixin
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from pytorch_tabnet.callbacks import Callback
import os
import torch
from pytorch_tabnet.callbacks import Callback

class TabNetWrapper(BaseEstimator, RegressorMixin):
    def __init__(self, **kwargs):
        self.model = TabNetRegressor(**kwargs)
        self.kwargs = kwargs
        self.imputer = SimpleImputer(strategy='median')
        self.best_model_path = 'best_tabnet_model.pt'
        
    def fit(self, X, y):
        # Handle missing values
        X_imputed = self.imputer.fit_transform(X)
        
        if hasattr(y, 'values'):
            y = y.values
            
        # Create internal validation set
        X_train, X_valid, y_train, y_valid = train_test_split(
            X_imputed, 
            y, 
            test_size=0.2,
            random_state=42
        )
        
        # Train TabNet model
        history = self.model.fit(
            X_train=X_train,
            y_train=y_train.reshape(-1, 1),
            eval_set=[(X_valid, y_valid.reshape(-1, 1))],
            eval_name=['valid'],
            eval_metric=['mse'],
            max_epochs=200,
            patience=20,
            batch_size=1024,
            virtual_batch_size=128,
            num_workers=0,
            drop_last=False,
            callbacks=[
                TabNetPretrainedModelCheckpoint(
                    filepath=self.best_model_path,
                    monitor='valid_mse',
                    mode='min',
                    save_best_only=True,
                    verbose=True
                )
            ]
        )
        
        # Load the best model
        if os.path.exists(self.best_model_path):
            self.model.load_model(self.best_model_path)
            os.remove(self.best_model_path)  # Remove temporary file
        
        return self
    
    def predict(self, X):
        X_imputed = self.imputer.transform(X)
        return self.model.predict(X_imputed).flatten()
    
    def __deepcopy__(self, memo):
        # Add deepcopy support for scikit-learn
        cls = self.__class__
        result = cls.__new__(cls)
        memo[id(self)] = result
        for k, v in self.__dict__.items():
            setattr(result, k, deepcopy(v, memo))
        return result



class TabNetPretrainedModelCheckpoint(Callback):
    def __init__(self, filepath, monitor='val_loss', mode='min', 
                 save_best_only=True, verbose=1):
        super().__init__()  # Initialize parent class
        self.filepath = filepath
        self.monitor = monitor
        self.mode = mode
        self.save_best_only = save_best_only
        self.verbose = verbose
        self.best = float('inf') if mode == 'min' else -float('inf')
        
    def on_train_begin(self, logs=None):
        self.model = self.trainer  # Use trainer itself as model
        
    def on_epoch_end(self, epoch, logs=None):
        logs = logs or {}
        current = logs.get(self.monitor)
        if current is None:
            return
        
        # Check if current metric is better than best
        if (self.mode == 'min' and current < self.best) or \
           (self.mode == 'max' and current > self.best):
            if self.verbose:
                print(f'\nEpoch {epoch}: {self.monitor} improved from {self.best:.4f} to {current:.4f}')
            self.best = current
            if self.save_best_only:
                self.model.save_model(self.filepath)  # Save the entire model

<a id="model-training"></a>
# **Model Training**
Train multiple models with the optimized hyperparameters.
    

In [24]:
def train_all_models_with_cv(train, test, target, best_params_dict, ensemble_method=None, weights=None, n_splits=5):
    """
    Train models using cross-validation, optionally using Voting or Stacking.

    Args:
        train (pd.DataFrame): Training data.
        test (pd.DataFrame): Test data.
        target (pd.Series): Target variable.
        best_params_dict (dict): Optimized hyperparameters for each model.
        ensemble_method (str): 'voting', 'stacking', or None for individual models.
        n_splits (int): Number of CV splits.

    Returns:
        dict: Aggregated predictions for each model or ensemble.
    """
    predictions = {}
    # target_binned = np.digitize(target, bins=np.linspace(target.min(), target.max(), 5))
    trained_models = {}

    models = []
    for model_name, params in best_params_dict.items():
        if model_name == "LightGBM":
            model = LGBMRegressor(**params, verbose=-1)
        elif model_name == "XGBoost":
            model = XGBRegressor(**params)
        elif model_name == "CatBoost":
            model = CatBoostRegressor(**params)
        elif model_name == "FTTransformer":
            model = FTTransformerWrapper(**params)
        elif model_name == "TabNet":
            model = TabNetWrapper(**params)
        else:
            raise ValueError(f"Unsupported model: {model_name}")
        models.append((model_name, model))

    if ensemble_method == 'voting':
        if weights:
            ensemble_model = VotingRegressor(estimators=models, weights=weights)
        # Create a Voting Regressor
        else:
            ensemble_model = VotingRegressor(estimators=models)
    elif ensemble_method == 'stacking':
        # Create a Stacking Regressor
        ensemble_model = StackingRegressor(estimators=models, final_estimator=LinearRegression())

    for model_name, model in ([('Ensemble', ensemble_model)] if ensemble_method else models):
        print(f"Training model with CV: {model_name}")
        cv = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)
        # fold_predictions = np.zeros(test.shape[0])

        train_S = []
        test_S = []
        
        oof_non_rounded = np.zeros(len(target), dtype=float) 
        oof_rounded = np.zeros(len(target), dtype=int) 
        test_preds = np.zeros((len(test), n_splits))

        for fold, (train_idx, test_idx) in enumerate(cv.split(train, target)):
            print(f"Training fold {fold + 1}/{n_splits}...")
            X_train, X_valid = train.iloc[train_idx], train.iloc[test_idx]
            y_train, y_valid = target.iloc[train_idx], target.iloc[test_idx]

            # Train the model
            model.fit(X_train, y_train)

            y_train_pred = model.predict(X_train)
            y_val_pred = model.predict(X_valid)

            oof_non_rounded[test_idx] = y_val_pred
            y_val_pred_rounded = y_val_pred.round(0).astype(int)
            oof_rounded[test_idx] = y_val_pred_rounded
    
            train_kappa = quadratic_weighted_kappa(y_train, y_train_pred.round(0).astype(int))
            val_kappa = quadratic_weighted_kappa(y_valid, y_val_pred_rounded)
    
            train_S.append(train_kappa)
            test_S.append(val_kappa)
            
            test_preds[:, fold] = model.predict(test)
            
            print(f"Fold {fold+1} - Train QWK: {train_kappa:.4f}, Validation QWK: {val_kappa:.4f}")
            # clear_output(wait=True)

        print(f"Mean Train QWK --> {np.mean(train_S):.4f}")
        print(f"Mean Validation QWK ---> {np.mean(test_S):.4f}")
    
        KappaOPtimizer = minimize(evaluate_predictions,
                                  x0=[0.5, 1.5, 2.5], args=(target, oof_non_rounded), 
                                  method='Nelder-Mead')

        print("KappaOPtimizer.x =",  KappaOPtimizer.x)
        assert KappaOPtimizer.success, "Optimization did not converge."
        
        oof_tuned = threshold_Rounder(oof_non_rounded, KappaOPtimizer.x)
        tKappa = quadratic_weighted_kappa(target, oof_tuned)
    
        print(f"----> || Optimized QWK SCORE :: {tKappa:.3f}")
    
        tpm = test_preds.mean(axis=1)
        tpTuned = threshold_Rounder(tpm, KappaOPtimizer.x)

        predictions[model_name] = tpTuned
        trained_models[model_name] = model

    return predictions, trained_models

<a id="submission"></a>
# **Submission**
Generate and save the final submission file.
    

In [25]:
def generate_submission_file(predictions, sample):
    submission = pd.DataFrame({
        "id": sample["id"],
        "sii": np.round(predictions).astype(int)
    })
    submission.to_csv("submission.csv", index=False)
    print("Submission saved!")
    return submission

<a id="main-pipeline"></a>
# **Main Pipeline**
Run the entire pipeline.
    

### Model 1

In [26]:
import optuna
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

SEED = 42
n_folds = 5

# Set up logging for Optuna
optuna.logging.set_verbosity(optuna.logging.WARNING)  # Limit verbosity

# Paths
train_path = '/kaggle/input/child-mind-institute-problematic-internet-use/train.csv'
test_path = '/kaggle/input/child-mind-institute-problematic-internet-use/test.csv'
sample_path = '/kaggle/input/child-mind-institute-problematic-internet-use/sample_submission.csv'
time_series_train = "/kaggle/input/child-mind-institute-problematic-internet-use/series_train.parquet"
time_series_test = "/kaggle/input/child-mind-institute-problematic-internet-use/series_test.parquet"

# Toggle for using time-series data
use_time_series = True  # Set to False to skip time-series data

# Preprocess data
train, test, sample = preprocess_csv_data(train_path, test_path, sample_path)

if use_time_series:
    train_ts, test_ts = preprocess_time_series_data(time_series_train, time_series_test, use_autoencoder=True, use_imputer=True, impute_method="mean")
else:
    train_ts, test_ts = None, None
train, test = merge_csv_and_time_series(train, test, train_ts, test_ts, use_time_series=use_time_series, use_numeric_imputation=True, numeric_impute_method="knn")

train = train.dropna(subset=['sii'])

# Target and features
target = train["sii"]
train = train.drop(columns=["sii"])  # Drop `sii` from features


# Ensure `sii` is not in test data
if "sii" in test.columns:
    test = test.drop(columns=["sii"])

if "id" in train.columns:
    train = train.drop(columns=["id"])
    test = test.drop(columns=["id"])

working_dir = '/kaggle/working'
clean_kaggle_working_directory(working_dir)

Train shape after feature engineering:  (3960, 99)
Test shape after feature engineering:  (20, 77)
Train shape after removing columns:  (3960, 65)
Test shape after removing columns:  (20, 64)


Processing participants: 100%|██████████| 996/996 [17:16<00:00,  1.04s/it]
Combining features: 100%|██████████| 996/996 [00:04<00:00, 216.41it/s]
Processing participants: 100%|██████████| 2/2 [00:00<00:00,  2.76it/s]
Combining features: 100%|██████████| 2/2 [00:00<00:00, 199.85it/s]


Epoch [10/100], Loss: 0.2930]
Epoch [20/100], Loss: 0.2688]
Epoch [30/100], Loss: 0.2607]
Epoch [40/100], Loss: 0.2589]
Epoch [50/100], Loss: 0.2568]
Epoch [60/100], Loss: 0.2552]
Epoch [70/100], Loss: 0.2538]
Epoch [80/100], Loss: 0.2535]
Epoch [90/100], Loss: 0.2520]
Epoch [100/100], Loss: 0.2533]
Epoch [10/100], Loss: 1.1297]
Epoch [20/100], Loss: 0.8606]
Epoch [30/100], Loss: 0.5252]
Epoch [40/100], Loss: 0.4616]
Epoch [50/100], Loss: 0.4615]
Epoch [60/100], Loss: 0.4615]
Epoch [70/100], Loss: 0.4615]
Epoch [80/100], Loss: 0.4615]
Epoch [90/100], Loss: 0.4615]
Epoch [100/100], Loss: 0.4615]
Final features included in train: ['sii', 'Basic_Demos-Age', 'Basic_Demos-Sex', 'CGAS-CGAS_Score', 'Physical-Height', 'Physical-Weight', 'Physical-Diastolic_BP', 'Physical-HeartRate', 'Physical-Systolic_BP', 'Fitness_Endurance-Max_Stage', 'Fitness_Endurance-Time_Mins', 'Fitness_Endurance-Time_Sec', 'FGC-FGC_CU', 'FGC-FGC_CU_Zone', 'FGC-FGC_GSND_Zone', 'FGC-FGC_GSD', 'FGC-FGC_GSD_Zone', 'FGC-FGC_

In [27]:
train2, test2, target2 = train, test, target

In [28]:
train.shape

(2736, 123)

In [29]:
# Define the dataset
TARGET_FEATURE = 'sii'
ID_COLUMN = 'id'

# Automatically infer numeric and categorical features
NUMERIC_FEATURES = train.select_dtypes(include=['int64', 'float64']).columns.difference([TARGET_FEATURE, ID_COLUMN]).tolist()
CATEGORICAL_FEATURES = train.select_dtypes(include=['object', 'category']).columns.difference([TARGET_FEATURE, ID_COLUMN]).tolist()

In [30]:
CATEGORICAL_FEATURES

[]

In [31]:
# Old ver
FTTransformer_Params = {
    'numerical_features': NUMERIC_FEATURES,
    'categorical_features': CATEGORICAL_FEATURES,
    'embedding_dim': 64,
    'depth': 4,
    'heads': 8,
    'attn_dropout': 0.4,
    'ff_dropout': 0.4,
    'out_dim': 1,
    'out_activation': 'relu',
    'lr': 0.001,
    'weight_decay': 0.0001,
    'loss': tf.keras.losses.MeanSquaredError(),
    'metrics': [tf.keras.metrics.RootMeanSquaredError()],
    'epochs': 100,
    'patience': 5,
    'batch_size': 256,
    'shuffle': True,
    'seed': 64
}


In [32]:
# Model parameters for LightGBM
Params = {
    'learning_rate': 0.046,
    'max_depth': 12,
    'num_leaves': 478,
    'min_data_in_leaf': 13,
    'feature_fraction': 0.893,
    'bagging_fraction': 0.784,
    'bagging_freq': 4,
    'lambda_l1': 10,  # Increased from 6.59
    'lambda_l2': 0.01,  # Increased from 2.68e-06
    'device': 'cpu'

}


# XGBoost parameters
XGB_Params = {
    'learning_rate': 0.05,
    'max_depth': 6,
    'n_estimators': 200,
    'subsample': 0.8,
    'colsample_bytree': 0.8,
    'reg_alpha': 1,  # Increased from 0.1
    'reg_lambda': 5,  # Increased from 1
    'random_state': SEED,
    'tree_method': 'auto',

}


CatBoost_Params = {
    'learning_rate': 0.05,
    'depth': 6,
    'iterations': 200,
    'random_seed': SEED,
    'verbose': 0,
    'l2_leaf_reg': 10,  # Increase this value
    'task_type': 'CPU'

}

In [33]:
weights=[4.0,5.0,4.0,3.0]

In [34]:
best_params_dict = {'LightGBM': Params, 'XGBoost': XGB_Params, 'CatBoost': CatBoost_Params, 'FTTransformer': FTTransformer_Params }

In [35]:
# Train model and make predictions using cross-validation
predictions, trained_models = train_all_models_with_cv(train, test, target, best_params_dict, ensemble_method='voting', weights=weights

# Generate submission
final_predictions = next(iter(predictions.values()))
submission = generate_submission_file(final_predictions, sample)

Training model with CV: Ensemble
Training fold 1/5...
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Fold 1 - Train QWK: 0.7274, Validation QWK: 0.3649
Training fold 2/5...
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Fold 2 - Train QWK: 0.7433, Validation QWK: 0.4220
Training fold 3/5...
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoc

## Model 2

### **Tabnet**

In [36]:
# New: TabNet
from pytorch_tabnet.tab_model import TabNetRegressor
import torch

import random
def seed_everything(seed):
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = True
seed_everything(42)

from sklearn.base import BaseEstimator, RegressorMixin
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from pytorch_tabnet.callbacks import Callback
import os
import torch
from pytorch_tabnet.callbacks import Callback

class TabNetWrapper(BaseEstimator, RegressorMixin):
    def __init__(self, **kwargs):
        self.model = TabNetRegressor(**kwargs)
        self.kwargs = kwargs
        self.imputer = SimpleImputer(strategy='median')
        self.best_model_path = 'best_tabnet_model.pt'
        
    def fit(self, X, y):
        # Handle missing values
        X_imputed = self.imputer.fit_transform(X)
        
        if hasattr(y, 'values'):
            y = y.values
            
        # Create internal validation set
        X_train, X_valid, y_train, y_valid = train_test_split(
            X_imputed, 
            y, 
            test_size=0.2,
            random_state=42
        )
        
        # Train TabNet model
        history = self.model.fit(
            X_train=X_train,
            y_train=y_train.reshape(-1, 1),
            eval_set=[(X_valid, y_valid.reshape(-1, 1))],
            eval_name=['valid'],
            eval_metric=['mse'],
            max_epochs=200,
            patience=20,
            batch_size=1024,
            virtual_batch_size=128,
            num_workers=0,
            drop_last=False,
            callbacks=[
                TabNetPretrainedModelCheckpoint(
                    filepath=self.best_model_path,
                    monitor='valid_mse',
                    mode='min',
                    save_best_only=True,
                    verbose=True
                )
            ]
        )
        
        # Load the best model
        if os.path.exists(self.best_model_path):
            self.model.load_model(self.best_model_path)
            os.remove(self.best_model_path)  # Remove temporary file
        
        return self
    
    def predict(self, X):
        X_imputed = self.imputer.transform(X)
        return self.model.predict(X_imputed).flatten()
    
    def __deepcopy__(self, memo):
        # Add deepcopy support for scikit-learn
        cls = self.__class__
        result = cls.__new__(cls)
        memo[id(self)] = result
        for k, v in self.__dict__.items():
            setattr(result, k, deepcopy(v, memo))
        return result



class TabNetPretrainedModelCheckpoint(Callback):
    def __init__(self, filepath, monitor='val_loss', mode='min', 
                 save_best_only=True, verbose=1):
        super().__init__()  # Initialize parent class
        self.filepath = filepath
        self.monitor = monitor
        self.mode = mode
        self.save_best_only = save_best_only
        self.verbose = verbose
        self.best = float('inf') if mode == 'min' else -float('inf')
        
    def on_train_begin(self, logs=None):
        self.model = self.trainer  # Use trainer itself as model
        
    def on_epoch_end(self, epoch, logs=None):
        logs = logs or {}
        current = logs.get(self.monitor)
        if current is None:
            return
        
        # Check if current metric is better than best
        if (self.mode == 'min' and current < self.best) or \
           (self.mode == 'max' and current > self.best):
            if self.verbose:
                print(f'\nEpoch {epoch}: {self.monitor} improved from {self.best:.4f} to {current:.4f}')
            self.best = current
            if self.save_best_only:
                self.model.save_model(self.filepath)  # Save the entire model

In [37]:
# TabNet hyperparameters
TabNet_Params = {
    'n_d': 64,              # Width of the decision prediction layer
    'n_a': 64,              # Width of the attention embedding for each step
    'n_steps': 5,           # Number of steps in the architecture
    'gamma': 1.5,           # Coefficient for feature selection regularization
    'n_independent': 2,     # Number of independent GLU layer in each GLU block
    'n_shared': 2,          # Number of shared GLU layer in each GLU block
    'lambda_sparse': 1e-4,  # Sparsity regularization
    'optimizer_fn': torch.optim.Adam,
    'optimizer_params': dict(lr=2e-2, weight_decay=1e-5),
    'mask_type': 'entmax',
    'scheduler_params': dict(mode="min", patience=10, min_lr=1e-5, factor=0.5),
    'scheduler_fn': torch.optim.lr_scheduler.ReduceLROnPlateau,
    'verbose': 1,
    'device_name': 'cuda' if torch.cuda.is_available() else 'cpu'
}

In [38]:
# Model parameters for LightGBM
Params2 = {
    'learning_rate': 0.046,
    'max_depth': 12,
    'num_leaves': 478,
    'min_data_in_leaf': 13,
    'feature_fraction': 0.893,
    'bagging_fraction': 0.784,
    'bagging_freq': 4,
    'lambda_l1': 10,  # Increased from 6.59
    'lambda_l2': 0.01,  # Increased from 2.68e-06
    'device': 'cpu'

}


# XGBoost parameters
XGB_Params2 = {
    'learning_rate': 0.05,
    'max_depth': 6,
    'n_estimators': 200,
    'subsample': 0.8,
    'colsample_bytree': 0.8,
    'reg_alpha': 1,  # Increased from 0.1
    'reg_lambda': 5,  # Increased from 1
    'random_state': SEED,
    'tree_method': 'auto',

}


CatBoost_Params2 = {
    'learning_rate': 0.05,
    'depth': 6,
    'iterations': 200,
    'random_seed': SEED,
    'verbose': 0,
    'l2_leaf_reg': 10,  # Increase this value
    'task_type': 'CPU'

}

In [39]:
best_params_dict2 = {'LightGBM': Params2, 'XGBoost': XGB_Params2, 'CatBoost': CatBoost_Params2, 'TabNet': TabNet_Params}

In [40]:
weights2 = [4.0,5.0,4.0,4.0]

In [41]:
# Train model and make predictions using cross-validation
predictions2, trained_models2 = train_all_models_with_cv(train2, test2, target2, best_params_dict2, ensemble_method='voting', weights=weights2)

# Generate submission
final_predictions2 = next(iter(predictions.values()))
submission2 = generate_submission_file(final_predictions2, sample)

Training model with CV: Ensemble
Training fold 1/5...
epoch 0  | loss: 2.39934 | valid_mse: 51.45601|  0:00:00s

Epoch 0: valid_mse improved from inf to 51.4560
Successfully saved model at best_tabnet_model.pt.zip
epoch 1  | loss: 1.41375 | valid_mse: 26.73382|  0:00:00s

Epoch 1: valid_mse improved from 51.4560 to 26.7338
Successfully saved model at best_tabnet_model.pt.zip
epoch 2  | loss: 1.01562 | valid_mse: 11.47039|  0:00:00s

Epoch 2: valid_mse improved from 26.7338 to 11.4704
Successfully saved model at best_tabnet_model.pt.zip
epoch 3  | loss: 0.85514 | valid_mse: 3.77951 |  0:00:00s

Epoch 3: valid_mse improved from 11.4704 to 3.7795
Successfully saved model at best_tabnet_model.pt.zip
epoch 4  | loss: 0.75604 | valid_mse: 2.08962 |  0:00:01s

Epoch 4: valid_mse improved from 3.7795 to 2.0896
Successfully saved model at best_tabnet_model.pt.zip
epoch 5  | loss: 0.68189 | valid_mse: 1.45567 |  0:00:01s

Epoch 5: valid_mse improved from 2.0896 to 1.4557
Successfully saved model

## Model 3

In [42]:
import optuna
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

SEED = 42
n_folds = 5

# Set up logging for Optuna
optuna.logging.set_verbosity(optuna.logging.WARNING)  # Limit verbosity

# Paths
train_path = '/kaggle/input/child-mind-institute-problematic-internet-use/train.csv'
test_path = '/kaggle/input/child-mind-institute-problematic-internet-use/test.csv'
sample_path = '/kaggle/input/child-mind-institute-problematic-internet-use/sample_submission.csv'
time_series_train = "/kaggle/input/child-mind-institute-problematic-internet-use/series_train.parquet"
time_series_test = "/kaggle/input/child-mind-institute-problematic-internet-use/series_test.parquet"

# Toggle for using time-series data
use_time_series = False  # Set to False to skip time-series data

# Preprocess data
train3, test3, sample3 = preprocess_csv_data(train_path, test_path, sample_path)

if use_time_series:
    train_ts3, test_ts3 = preprocess_time_series_data(time_series_train, time_series_test, use_autoencoder=True, use_imputer=True, impute_method="mean")
else:
    train_ts3, test_ts3 = None, None
train3, test3 = merge_csv_and_time_series(train3, test3, train_ts3, test_ts3, use_time_series=use_time_series, use_numeric_imputation=True, numeric_impute_method="knn")

train3 = train3.dropna(subset=['sii'])

# Target and features
target3 = train3["sii"]
train3 = train3.drop(columns=["sii"])  # Drop `sii` from features


# Ensure `sii` is not in test data
if "sii" in test3.columns:
    test3 = test3.drop(columns=["sii"])

if "id" in train3.columns:
    train3 = train3.drop(columns=["id"])
    test3 = test3.drop(columns=["id"])

working_dir = '/kaggle/working'
clean_kaggle_working_directory(working_dir)

Train shape after feature engineering:  (3960, 99)
Test shape after feature engineering:  (20, 77)
Train shape after removing columns:  (3960, 65)
Test shape after removing columns:  (20, 64)
Final features included in train: ['sii', 'Basic_Demos-Age', 'Basic_Demos-Sex', 'CGAS-CGAS_Score', 'Physical-Height', 'Physical-Weight', 'Physical-Diastolic_BP', 'Physical-HeartRate', 'Physical-Systolic_BP', 'Fitness_Endurance-Max_Stage', 'Fitness_Endurance-Time_Mins', 'Fitness_Endurance-Time_Sec', 'FGC-FGC_CU', 'FGC-FGC_CU_Zone', 'FGC-FGC_GSND_Zone', 'FGC-FGC_GSD', 'FGC-FGC_GSD_Zone', 'FGC-FGC_PU', 'FGC-FGC_PU_Zone', 'FGC-FGC_SRL_Zone', 'FGC-FGC_SRR', 'FGC-FGC_SRR_Zone', 'FGC-FGC_TL', 'FGC-FGC_TL_Zone', 'BIA-BIA_Activity_Level_num', 'BIA-BIA_BMC', 'BIA-BIA_BMI', 'BIA-BIA_FFMI', 'BIA-BIA_FMI', 'BIA-BIA_Fat', 'BIA-BIA_Frame_num', 'PAQ_A-PAQ_A_Total', 'PAQ_C-PAQ_C_Total', 'SDS-SDS_Total_Raw', 'SDS-SDS_Total_T', 'PreInt_EduHx-computerinternet_hoursday', 'BMI_Age', 'Internet_Hours_Age', 'BMI_Internet_

In [43]:
# Model parameters for LightGBM
Params3 = {
    'learning_rate': 0.046,
    'max_depth': 12,
    'num_leaves': 478,
    'min_data_in_leaf': 13,
    'feature_fraction': 0.893,
    'bagging_fraction': 0.784,
    'bagging_freq': 4,
    'lambda_l1': 10,  # Increased from 6.59
    'lambda_l2': 0.01,  # Increased from 2.68e-06
    'device': 'cpu'

}


# XGBoost parameters
XGB_Params3 = {
    'learning_rate': 0.05,
    'max_depth': 6,
    'n_estimators': 200,
    'subsample': 0.8,
    'colsample_bytree': 0.8,
    'reg_alpha': 1,  # Increased from 0.1
    'reg_lambda': 5,  # Increased from 1
    'random_state': SEED,
    'tree_method': 'auto',

}


CatBoost_Params3 = {
    'learning_rate': 0.05,
    'depth': 6,
    'iterations': 200,
    'random_seed': SEED,
    'verbose': 0,
    'l2_leaf_reg': 10,  # Increase this value
    'task_type': 'CPU'

}

In [44]:
best_params_dict3 = {'LightGBM': Params3, 'XGBoost': XGB_Params3, 'CatBoost': CatBoost_Params3}

In [45]:
# Train model and make predictions using cross-validation
predictions3, trained_models3 = train_all_models_with_cv(train3, test3, target3, best_params_dict3, ensemble_method='stacking')

# Generate submission
final_predictions3 = next(iter(predictions.values()))
submission3 = generate_submission_file(final_predictions3, sample)

Training model with CV: Ensemble
Training fold 1/5...
Fold 1 - Train QWK: 0.5183, Validation QWK: 0.3435
Training fold 2/5...
Fold 2 - Train QWK: 0.4806, Validation QWK: 0.4508
Training fold 3/5...
Fold 3 - Train QWK: 0.5718, Validation QWK: 0.3856
Training fold 4/5...
Fold 4 - Train QWK: 0.5938, Validation QWK: 0.3079
Training fold 5/5...
Fold 5 - Train QWK: 0.5466, Validation QWK: 0.3549
Mean Train QWK --> 0.5422
Mean Validation QWK ---> 0.3685
KappaOPtimizer.x = [0.53873933 0.96465826 2.86979107]
----> || Optimized QWK SCORE :: 0.461
Submission saved!


# **FINAL**

In [46]:
sub1 = submission
sub2 = submission2
sub3 = submission3

sub1 = sub1.sort_values(by='id').reset_index(drop=True)
sub2 = sub2.sort_values(by='id').reset_index(drop=True)
sub3 = sub3.sort_values(by='id').reset_index(drop=True)

combined = pd.DataFrame({
    'id': sub1['id'],
    'sii_1': sub1['sii'],
    'sii_2': sub2['sii'],
    'sii_3': sub3['sii']
})

def majority_vote(row):
    return row.mode()[0]

combined['final_sii'] = combined[['sii_1', 'sii_2', 'sii_3']].apply(majority_vote, axis=1)

final_submission = combined[['id', 'final_sii']].rename(columns={'final_sii': 'sii'})

final_submission.to_csv('submission.csv', index=False)

print("Majority voting completed and saved to 'Final_Submission.csv'")

Majority voting completed and saved to 'Final_Submission.csv'


In [47]:
final_submission

Unnamed: 0,id,sii
0,00008ff9,0
1,000fd460,0
2,00105258,0
3,00115b9f,0
4,0016bb22,1
5,001f3379,0
6,0038ba98,1
7,0068a485,0
8,0069fbed,1
9,0083e397,1
