# **Predict Student Performance from Game Play**

### Ho Chi Minh City University of Science

#### 21KHDL - Intelligent Data Analysis

#### Lecturers:
- Mr. Nguyễn Tiến Huy
- Mr. Nguyễn Trần Duy Minh
- Mr. Lê Thanh Tùng

#### Student:
- 21127038 - Võ Phú Hãn

<h1 style="text-align: center;"> Student Performance from Game Play - Model Building</h1>

# 0. Main idea

I'll train GroupKFold models with XGBoost baseline for each question in the game. The models will use all previously data to predict the correctness of the session for the current question, including events that occurred in the corresponding levels. Additionally, new features will be created to improve the model.

# 1. Import the Required Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import gc
import polars as pl

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from xgboost import plot_importance
import lightgbm as lgbm
from sklearn.model_selection import KFold, GroupKFold
from xgboost import XGBClassifier

import warnings
warnings.filterwarnings('ignore')

# 2. Preprocessing Data for Training

## Load and Prepare Train Data

Load `train.csv` with defined scheme.

In [2]:
dtypes={
    'elapsed_time':np.int32,
    'event_name':'category',
    'name':'category',
    'level':np.uint8,
    'room_coor_x':np.float32,
    'room_coor_y':np.float32,
    'screen_coor_x':np.float32,
    'screen_coor_y':np.float32,
    'hover_duration':np.float32,
    'text':'str',
    'fqid':'category',
    'room_fqid':'category',
    'text_fqid':'category',
    'fullscreen':'category',
    'hq':'category',
    'music':'category',
    'level_group':'category'
    }

dataset_df = pd.read_csv('/kaggle/input/predict-student-performance-from-game-play/train.csv', dtype=dtypes) ###, nrows=500000)
print("Full train dataset shape is {}".format(dataset_df.shape))

Full train dataset shape is (26296946, 20)


In [3]:
gc.collect()

21

Load `train_labels.cv`

In [4]:
labels = pd.read_csv('/kaggle/input/predict-student-performance-from-game-play/train_labels.csv')

Split `session_id` into `session` and `q` features

In [5]:
labels['session'] = labels.session_id.apply(lambda x: int(x.split('_')[0]) )
labels['q'] = labels.session_id.apply(lambda x: int(x.split('_')[-1][1:]) )

In [6]:
# Display the first 5 examples
labels.head(5)

Unnamed: 0,session_id,correct,session,q
0,20090312431273200_q1,1,20090312431273200,1
1,20090312433251036_q1,0,20090312433251036,1
2,20090312455206810_q1,1,20090312455206810,1
3,20090313091715820_q1,0,20090313091715820,1
4,20090313571836404_q1,1,20090313571836404,1


There are 3 level groups. Every questions of each group will happen after a level checkpoint (4, 12, 22). The main idea is split data needed for training into 3 groups for each checkpoint:
- `level_group` == '0-4'   $\Rightarrow$ `question` 1->3
- `level_group` == '5-12'  $\Rightarrow$ `question` 4->13
- `level_group` == '13-22' $\Rightarrow$ `question` 14->18

And because, we're going to use all previous data in a session to predict the correction, therefore, I'll split the train data to 3 dataframe:
- `dataset_df_1` within `level` 0->4
- `dataset_df_2` within `level` 0->12
- `dataset_df_3` within `level` 0->22

In [7]:
dataset_df_1 = dataset_df[dataset_df.level_group == '0-4']
dataset_df_2 = dataset_df[dataset_df.level_group != '13-22']
dataset_df_3 = dataset_df
print(f"dataset_df_1: {dataset_df_1.shape}")
print(f"dataset_df_2: {dataset_df_2.shape}")
print(f"dataset_df_3: {dataset_df_3.shape}")

dataset_df_1: (3981005, 20)
dataset_df_2: (12825243, 20)
dataset_df_3: (26296946, 20)


In [8]:
del dataset_df
gc.collect()

42

In [9]:
dataset_df_1.columns

Index(['session_id', 'index', 'elapsed_time', 'event_name', 'name', 'level',
       'page', 'room_coor_x', 'room_coor_y', 'screen_coor_x', 'screen_coor_y',
       'hover_duration', 'text', 'fqid', 'room_fqid', 'text_fqid',
       'fullscreen', 'hq', 'music', 'level_group'],
      dtype='object')

## Load and Prepare Feature Selection Data

In [10]:
feats=pd.read_csv("/kaggle/input/featur/feature_sort.csv")
feats.head()

Unnamed: 0,nabor,tip,quest,kol_col,col1,val1,col2,val2,col3,val3,...,kach4,kach5,kach6,kach7,kach8,kach9,kach10,kach11,kach12,kach13
0,5,,18,1,text_fqid,tunic.flaghouse.entry.flag_girl.symbol_recap,0,0,0,0,...,0.551005,0.54813,0.55003,0.545479,0.543337,0.544326,0.558332,0.561411,0.563527,0.566573
1,1,,18,1,name,undefined,0,0,0,0,...,0.546731,0.550173,0.55003,0.543752,0.53985,0.546219,0.55664,0.560639,0.562642,0.566082
2,4,,18,1,room_fqid,tunic.drycleaner.frontdesk,0,0,0,0,...,0.546318,0.548125,0.547923,0.544786,0.537905,0.542566,0.559101,0.564395,0.563211,0.566013
3,1,,18,1,name,close,0,0,0,0,...,0.543156,0.547942,0.549349,0.542758,0.543017,0.547333,0.558748,0.559335,0.564256,0.566134
4,8,,18,1,text,Make sure to get some old photos for the exhib...,0,0,0,0,...,0.551282,0.54719,0.543475,0.54295,0.539692,0.543701,0.561205,0.561298,0.570363,0.565351


This dataset, sourced from [Vadim Kamaev's notebook](https://www.kaggle.com/code/vadimkamaev/catboost-new/notebook), is generated to support analysis and model training by capturing detailed relationships between features for this particular competition. For a deeper understanding of how this dataset was constructed, refer to [this link](https://www.kaggle.com/code/vadimkamaev/feature-sort).

I'll explain some *key features* used in this model building:
- `quest`: questions (from 1 to 18).  
- (`col1`, `val1`), (`col2`, `val2`), (`col3`, `val3`): pairs of column names and their corresponding unique values (`col1` and `col2` are categorical columns; `col3` is a numerical column).  
- `kol_col`: the number of columns involved in generating the feature (1 or 2).  
- `kach`: the quality score for each feature, likely based on F1-score, calculated during training or validation.

I'll select features with `kach>10` and split into 3 datasets for 3 train datasets above.

In [11]:

# Select features with kach > 10
feats_sel=feats[feats['kach']>10]
feats_sel_1=feats_sel[feats_sel['quest']<=3]
feats_sel_2=feats_sel[feats_sel['quest']<=12]
feats_sel_3=feats_sel

# Get unique feature-value pairs
cols=['kol_col','col1','val1','col2','val2']
feats_sel_1=feats_sel_1[cols].drop_duplicates().reset_index(drop=True)
feats_sel_2=feats_sel_2[cols].drop_duplicates().reset_index(drop=True)
feats_sel_3=feats_sel_3[cols].drop_duplicates().reset_index(drop=True)

print(len(feats_sel_1))
print(len(feats_sel_2))
print(len(feats_sel_3))
feats_sel_1.head()

31
113
172


Unnamed: 0,kol_col,col1,val1,col2,val2
0,1,name,basic,0,0
1,1,room_fqid,tunic.capitol_0.hall,0,0
2,1,room_fqid,tunic.historicalsociety.collection,0,0
3,1,room_fqid,tunic.historicalsociety.stacks,0,0
4,2,room_fqid,tunic.historicalsociety.entry,fqid,tostacks


# 3. Feature Engineering

Generate new features to prepare for model building

In [12]:
CATEGORICAL = ['event_name', 'name','fqid', 'room_fqid', 'text_fqid']
NUMERICAL = ['level','page','room_coor_x', 'room_coor_y', 'screen_coor_x', 'screen_coor_y','delt_time_next']

In [13]:
def feature_engineer(train, feats_sel):
    """
    This function performs feature engineering on a given training dataset by creating new features 
    based on session aggregations and conditional groupings.

    Parameters:
    train (DataFrame): The input dataset containing session data.
    feats_sel (DataFrame): A configuration DataFrame specifying conditions for additional features.

    Returns:
    DataFrame: A new DataFrame with engineered features.
    """

    # Sort by session and elapsed time to ensure chronological order
    train.sort_values(by=['session_id', 'elapsed_time'], inplace=True)

    # Calculate time differences between consecutive rows
    train['d_time'] = train['elapsed_time'].diff(1)
    train['d_time'].fillna(0, inplace=True)  # Replace NaN values in the first row
    train['delt_time'] = train['d_time'].clip(0, 103000)  # Clip large time differences
    train['delt_time_next'] = train['delt_time'].shift(-1)  # Shift for the next time delta

    # Create a new DataFrame to store session features
    new_train = pd.DataFrame(index=train['session_id'].unique())  
    new_train['session_id'] = new_train.index  

    # # Re-sort to ensure consistent ordering (this appears redundant but may be intentional for safety)
    # train.sort_values(by=['session_id', 'elapsed_time'], inplace=True)

    # # Recalculate time differences
    # train['d_time'] = train['elapsed_time'].diff(1)
    # train['d_time'].fillna(0, inplace=True)
    # train['delt_time'] = train['d_time'].clip(0, 103000)
    # train['delt_time_next'] = train['delt_time'].shift(-1)

    # Base feature generation (categorical and numerical aggregates)
    base = True  # Toggle for base feature generation
    if base:
        for c in CATEGORICAL:
            # Count unique values per session for categorical columns
            new_train[f'{c}_nunique'] = train.groupby(['session_id'])[c].agg('nunique')
        for c in NUMERICAL:
            # Compute mean and sum per session for numerical columns
            new_train[f'{c}_mean'] = train.groupby(['session_id'])[c].agg('mean')
            new_train[f'{c}_sum'] = train.groupby(['session_id'])[c].agg('sum')

    # Session counts and time quantiles
    new_train['session_index_count'] = train.groupby(['session_id'])['index'].count()
    new_train['d_time_q3'] = train.groupby(['session_id'])['d_time'].quantile(q=0.3)
    new_train['d_time_q8'] = train.groupby(['session_id'])['d_time'].quantile(q=0.8)
    new_train['d_time_q5'] = train.groupby(['session_id'])['d_time'].quantile(q=0.5)
    new_train['d_time_q65'] = train.groupby(['session_id'])['d_time'].quantile(q=0.65)

    # Hover duration statistics
    new_train['hover_duration_mean'] = train.groupby(['session_id'])['hover_duration'].agg('mean')
    new_train['hover_duration_std'] = train.groupby(['session_id'])['hover_duration'].agg('std') 

    # Delta time statistics
    new_train['delt_time_mean'] = train.groupby(['session_id'])['delt_time'].agg('mean')
    new_train['delt_time_std'] = train.groupby(['session_id'])['delt_time'].agg('std') 
    new_train['delt_time_max'] = train.groupby(['session_id'])['delt_time'].agg('max')
    new_train['delt_time_min'] = train.groupby(['session_id'])['delt_time'].agg('min') 

    # Extracting date and time components from session_id
    new_train['year'] = new_train['session_id'].apply(lambda x: int(str(x)[:2])).astype(np.uint8)  # Extract year
    new_train['month'] = new_train['session_id'].apply(lambda x: int(str(x)[2:4]) + 1).astype(np.uint8)  # Extract month
    new_train['day'] = new_train['session_id'].apply(lambda x: int(str(x)[4:6])).astype(np.uint8)  # Extract day
    new_train['sess_time'] = new_train['session_id'].apply(lambda x: int(str(x)[6:8])).astype(np.uint8) + \
                             new_train['session_id'].apply(lambda x: int(str(x)[8:10])).astype(np.uint8) / 60  # Hour and minute

    # Fill NaN values with -1 for consistency
    new_train = new_train.fillna(-1)

    # Conditional feature generation (based on feats_sel configuration)
    t1 = feats_sel[feats_sel['kol_col'] == 1]  # Single-condition features
    for i in range(len(t1)):
        col1 = t1['col1'].iloc[i]
        val1 = t1['val1'].iloc[i]

        # Create mask based on the condition
        maska1 = (train[col1] == val1)
        
        # Generate features for sessions matching the condition
        new_train[f'{col1}_{hash(val1)}_delt_time_next_sum'] = train[maska1].groupby(['session_id'])['delt_time_next'].sum()
        new_train[f'{col1}_{hash(val1)}_delt_time_mean'] = train[maska1].groupby(['session_id'])['delt_time'].mean()
        new_train[f'{col1}_{hash(val1)}_index_count'] = train[maska1].groupby(['session_id'])['index'].count()

    t2 = feats_sel[feats_sel['kol_col'] == 2]  # Two-condition features
    for i in range(len(t2)):
        col1 = t2['col1'].iloc[i]
        val1 = t2['val1'].iloc[i]
        col2 = t2['col2'].iloc[i]
        val2 = t2['val2'].iloc[i]

        # Create mask based on the combined conditions
        maska2 = (train[col1] == val1) & (train[col2] == val2)
        
        # Generate features for sessions matching the conditions
        new_train[f'{col1}_{hash(val1)}_{col2}_{hash(val2)}_delt_time_next_sum'] = train[maska2].groupby(['session_id'])['delt_time_next'].sum()
        new_train[f'{col1}_{hash(val1)}_{col2}_{hash(val2)}_delt_time_mean'] = train[maska2].groupby(['session_id'])['delt_time'].mean()
        new_train[f'{col1}_{hash(val1)}_{col2}_{hash(val2)}_index_count'] = train[maska2].groupby(['session_id'])['index'].count()

    return new_train


Apply function `feature_engineer()` for datasets

In [14]:
dataset_df_1 = feature_engineer(dataset_df_1,feats_sel_1)
print("Full prepared dataset_df_1 shape: {}".format(dataset_df_1.shape))
dataset_df_1.head()

Full prepared dataset_df_1 shape: (23562, 128)


Unnamed: 0,session_id,event_name_nunique,name_nunique,fqid_nunique,room_fqid_nunique,text_fqid_nunique,level_mean,level_sum,page_mean,page_sum,...,room_fqid_4040656509072539416_fqid_2491119185084437101_index_count,room_fqid_-3182185621819205830_fqid_-2831142335855386910_delt_time_next_sum,room_fqid_-3182185621819205830_fqid_-2831142335855386910_delt_time_mean,room_fqid_-3182185621819205830_fqid_-2831142335855386910_index_count,room_fqid_8677181303633425806_fqid_-9150430895823747627_delt_time_next_sum,room_fqid_8677181303633425806_fqid_-9150430895823747627_delt_time_mean,room_fqid_8677181303633425806_fqid_-9150430895823747627_index_count,room_fqid_-3182185621819205830_fqid_-6040392737047360567_delt_time_next_sum,room_fqid_-3182185621819205830_fqid_-6040392737047360567_delt_time_mean,room_fqid_-3182185621819205830_fqid_-6040392737047360567_index_count
20090312431273200,20090312431273200,10,3,30,7,17,1.945455,321.0,-1.0,0.0,...,2.0,0.0,30837.0,1,4085.0,710.333333,3.0,30837.0,1585.0,1
20090312433251036,20090312433251036,11,4,22,6,11,1.870504,260.0,0.0,0.0,...,3.0,0.0,37409.0,1,7119.0,912.9,10.0,37409.0,3288.0,1
20090312455206810,20090312455206810,9,3,22,6,12,1.604027,239.0,-1.0,0.0,...,1.0,0.0,28744.0,1,4382.0,816.0,2.0,28744.0,6004.0,1
20090313091715820,20090313091715820,11,4,24,6,14,1.789773,315.0,0.0,0.0,...,2.0,0.0,47849.0,1,15560.0,1519.714286,7.0,50213.0,2174.0,3
20090313571836404,20090313571836404,10,4,22,6,12,1.767857,198.0,0.0,0.0,...,2.0,0.0,31920.0,1,4801.0,607.5,2.0,31920.0,1683.0,1


In [15]:
gc.collect()

21

In [16]:
dataset_df_2 = feature_engineer(dataset_df_2,feats_sel_2)
print("Full prepared dataset_df_2 shape is {}".format(dataset_df_2.shape))
dataset_df_2.head()

Full prepared dataset_df_2 shape is (23562, 374)


Unnamed: 0,session_id,event_name_nunique,name_nunique,fqid_nunique,room_fqid_nunique,text_fqid_nunique,level_mean,level_sum,page_mean,page_sum,...,room_fqid_4040656509072539416_fqid_2491119185084437101_index_count,room_fqid_-3182185621819205830_fqid_-2831142335855386910_delt_time_next_sum,room_fqid_-3182185621819205830_fqid_-2831142335855386910_delt_time_mean,room_fqid_-3182185621819205830_fqid_-2831142335855386910_index_count,room_fqid_8677181303633425806_fqid_-9150430895823747627_delt_time_next_sum,room_fqid_8677181303633425806_fqid_-9150430895823747627_delt_time_mean,room_fqid_8677181303633425806_fqid_-9150430895823747627_index_count,room_fqid_-3182185621819205830_fqid_-6040392737047360567_delt_time_next_sum,room_fqid_-3182185621819205830_fqid_-6040392737047360567_delt_time_mean,room_fqid_-3182185621819205830_fqid_-6040392737047360567_index_count
20090312431273200,20090312431273200,10,3,62,14,41,5.867679,2705.0,-1.0,0.0,...,2.0,26625.0,30837.0,1,4085.0,710.333333,3.0,30837.0,1585.0,1
20090312433251036,20090312433251036,11,4,61,14,33,7.026217,3752.0,1.666667,15.0,...,3.0,33131.0,37409.0,1,7119.0,912.9,10.0,37409.0,3288.0,1
20090312455206810,20090312455206810,11,4,57,14,31,5.794416,2283.0,2.333333,14.0,...,1.0,17190.0,28744.0,1,4382.0,816.0,2.0,28744.0,6004.0,1
20090313091715820,20090313091715820,11,4,62,14,38,5.928287,2976.0,0.714286,5.0,...,2.0,40065.0,47849.0,1,15560.0,1519.714286,7.0,50213.0,2174.0,3
20090313571836404,20090313571836404,11,5,58,14,34,6.09887,2159.0,1.636364,18.0,...,2.0,24928.0,31920.0,1,4801.0,607.5,2.0,31920.0,1683.0,1


In [17]:
gc.collect()

21

In [18]:
dataset_df_3 = feature_engineer(dataset_df_3,feats_sel_3)
print("Full prepared dataset_df_3 shape is {}".format(dataset_df_3.shape))
dataset_df_3.head()

Full prepared dataset_df_3 shape is (23562, 551)


Unnamed: 0,session_id,event_name_nunique,name_nunique,fqid_nunique,room_fqid_nunique,text_fqid_nunique,level_mean,level_sum,page_mean,page_sum,...,room_fqid_4040656509072539416_fqid_2491119185084437101_index_count,room_fqid_-3182185621819205830_fqid_-2831142335855386910_delt_time_next_sum,room_fqid_-3182185621819205830_fqid_-2831142335855386910_delt_time_mean,room_fqid_-3182185621819205830_fqid_-2831142335855386910_index_count,room_fqid_8677181303633425806_fqid_-9150430895823747627_delt_time_next_sum,room_fqid_8677181303633425806_fqid_-9150430895823747627_delt_time_mean,room_fqid_8677181303633425806_fqid_-9150430895823747627_index_count,room_fqid_-3182185621819205830_fqid_-6040392737047360567_delt_time_next_sum,room_fqid_-3182185621819205830_fqid_-6040392737047360567_delt_time_mean,room_fqid_-3182185621819205830_fqid_-6040392737047360567_index_count
20090312431273200,20090312431273200,10,3,95,19,76,11.366629,10014.0,-1.0,0.0,...,2.0,26625.0,30837.0,1,4085.0,710.333333,3.0,30837.0,1585.0,1
20090312433251036,20090312433251036,11,6,101,19,75,14.631349,26790.0,4.576271,270.0,...,3.0,33131.0,37409.0,1,11520.0,870.214286,14.0,37409.0,3288.0,1
20090312455206810,20090312455206810,11,4,90,19,61,11.514512,8728.0,4.6875,150.0,...,1.0,17190.0,28744.0,1,4382.0,816.0,2.0,28744.0,6004.0,1
20090313091715820,20090313091715820,11,4,98,19,74,11.573011,11492.0,1.444444,13.0,...,4.0,40065.0,47849.0,1,15560.0,1519.714286,7.0,50213.0,2174.0,3
20090313571836404,20090313571836404,11,5,94,19,68,12.057641,8995.0,1.636364,18.0,...,2.0,24928.0,31920.0,1,4801.0,607.5,2.0,31920.0,1683.0,1


In [19]:
gc.collect()

21

Drop columns with high missing ratio or only have 1 value.

In [20]:
from tqdm import tqdm

# Calculate the percentage of missing values for each column in the datasets
null1 = dataset_df_1.isnull().sum().sort_values(ascending=False) / len(dataset_df_1)
null2 = dataset_df_2.isnull().sum().sort_values(ascending=False) / len(dataset_df_2)
null3 = dataset_df_3.isnull().sum().sort_values(ascending=False) / len(dataset_df_3)

# Identify columns with more than 90% missing values to drop
drop1 = list(null1[null1 > 0.9].index)
drop2 = list(null2[null2 > 0.9].index)
drop3 = list(null3[null3 > 0.9].index)

# # Print the number of columns to be dropped due to high missing values
# print(len(drop1), len(drop2), len(drop3))

# Loop through each column in the first dataset to identify columns with a single unique value
for col in tqdm(dataset_df_1.columns):
    if dataset_df_1[col].nunique() == 1:
        # Add columns with only one unique value to the drop list
        drop1.append(col)

# Repeat the process for the second dataset
for col in tqdm(dataset_df_2.columns):
    if dataset_df_2[col].nunique() == 1:
        drop2.append(col)

# Repeat the process for the third dataset
for col in tqdm(dataset_df_3.columns):
    if dataset_df_3[col].nunique() == 1:
        drop3.append(col)

# Create a list of features for training, excluding the dropped columns and 'level_group'
FEATURES1 = [c for c in dataset_df_1.columns if c not in drop1 + ['level_group']]
FEATURES2 = [c for c in dataset_df_2.columns if c not in drop2 + ['level_group']]
FEATURES3 = [c for c in dataset_df_3.columns if c not in drop3 + ['level_group']]

# Print the number of features that will be used for training for each dataset
# print('We will train with', len(FEATURES1), len(FEATURES2), len(FEATURES3), 'features')
print(f"We'll train 'dataset_df_1' with {len(FEATURES1)} features")
print(f"We'll train 'dataset_df_2' with {len(FEATURES2)} features")
print(f"We'll train 'dataset_df_3' with {len(FEATURES3)} features")

100%|██████████| 128/128 [00:00<00:00, 1418.16it/s]
100%|██████████| 374/374 [00:00<00:00, 1639.38it/s]
100%|██████████| 551/551 [00:00<00:00, 1679.88it/s]

We'll train 'dataset_df_1' with 128 features
We'll train 'dataset_df_2' with 353 features
We'll train 'dataset_df_3' with 523 features





# 4. Model Building & Training

Train one XGB model for each question with 5-fold cross-validation

### XGBoost Introduction
**XGBoost** is a highly efficient and scalable machine learning algorithm based on gradient boosting. It was developed by Tianqi Chen and has become one of the most popular algorithms for structured/tabular data. XGBoost is known for its performance and speed, making it a go-to tool for data science competitions (like Kaggle).

### How it work
**XGBoost** works by building an ensemble of decision trees sequentially, where each tree corrects the errors of the previous one. It uses gradient boosting, optimizing the model by minimizing a loss function through gradient descent. During training, XGBoost adjusts the weights of misclassified samples and adds regularization to avoid overfitting. The trees are built iteratively, with each new tree focusing on the residuals (errors) of the previous trees, ultimately improving the model’s predictions.

### Pros
- **High Performance:** XGBoost is known for its speed and accuracy, often winning Kaggle competitions.
- **Handles Missing Data:** It can handle missing values in the dataset naturally.
- **Parallelization:** Supports parallel and distributed computing, making it faster than other gradient boosting algorithms.
- **Regularization:** Built-in regularization (L1 and L2) helps prevent overfitting.
- **Flexibility:** Works for both regression and classification problems.
- **Scalability:** Can handle large datasets efficiently due to optimizations.

### Cons
- **Complexity:** Tuning the model can be complex with many hyperparameters.
- **Interpretability:** Like most ensemble methods, XGBoost models are less interpretable compared to simpler models (e.g., linear regression).
- **Memory Consumption:** For very large datasets, it can consume a significant amount of memory.
- **Overfitting Risk:** While regularization helps, overfitting can still occur if hyperparameters are not properly tuned.


Set XGBoost parameters

In [21]:
xgb_params = {
    'objective': 'binary:logistic',  # Specifies binary classification (logistic regression)
    'eval_metric': 'logloss',        # Evaluation metric used during training (log loss for binary classification)
    'learning_rate': 0.05,           # Step size during training (smaller values require more boosting rounds)
    'max_depth': 4,                  # Maximum depth of trees (helps control model complexity)
    'n_estimators': 6000,            # Number of boosting rounds (trees) to build
    'early_stopping_rounds': 50,     # Stop training early if validation score doesn't improve for 50 rounds
    'tree_method': 'hist',           # Tree-building method (histogram-based method for faster training)
    'subsample': 0.8,                # Fraction of samples to use for each tree (helps prevent overfitting)
    'colsample_bytree': 0.4,         # Fraction of features to use for each tree (helps with regularization)
    'use_label_encoder': False       # Disables label encoding for the target variable (to avoid warnings)
}


In [22]:
# from catboost import CatBoostClassifier, Pool
# cat_params = {
#         'iterations': 1000,
#         'early_stopping_rounds': 90,
#         'depth': 5,
#         'learning_rate': 0.02,
#         'loss_function': "Logloss",
#         'random_seed': 222222,
#         'metric_period': 1,
#         'subsample': 0.8,
#         'colsample_bylevel': 0.4,
#         'verbose': 0,
#         'l2_leaf_reg': 20,
#     }

Traing XGB models with 5-fold cross-validation for 18 questions

In [23]:
models_xgb = {}
valids_idx = {}

# Iterate through questions 1 to 18 to train models for each question
for q_no in range(1,19):
    # Select level group for the question based on the q_no.
    if q_no<=3: grp = '0-4'
    elif q_no<=13: grp = '5-12'
    elif q_no<=22: grp = '13-22'

    # Select the appropriate dataset and features based on the group (grp)
    if grp == '0-4':
        df = dataset_df_1
        FEATURES = FEATURES1
    if grp == '5-12':
        df = dataset_df_2
        FEATURES = FEATURES2
    if grp == '13-22':
        df = dataset_df_3
        FEATURES = FEATURES3
    print("### q_no", q_no, "grp", grp, "feats : ",len(FEATURES))

    # Generate indices for the training and validation sets for each fold
    split = list(GroupKFold(5).split(df.index.unique(), groups = df.index.unique()))
    
    # Perform 5-fold cross-validation using GroupKFold
    for fold, (train_idx, valid_idx) in enumerate(split):
        
        # Filter the rows in the datasets based on train and valid indices
        train_df = df.iloc[train_idx]
        train_users = train_df.index.values
        valid_df = df.iloc[valid_idx]
        valid_users = valid_df.index.values

        # Store the validation indices for later use
        valids_idx[f'{grp}_{q_no}_{fold}'] = valid_idx


        # Select the labels for the related q_no and session.
        train_labels = labels.loc[labels.q==q_no].set_index('session').loc[train_users]
        valid_labels = labels.loc[labels.q==q_no].set_index('session').loc[valid_users]

        # Prepare the features (X) and target (y) for training and validation
        X_train = train_df.loc[:, train_df.columns != 'level_group']
        y_train = train_labels["correct"]
        X_val = valid_df.loc[:, valid_df.columns != 'level_group']
        y_val = valid_labels["correct"]
        
        # Train model
        xgbm = XGBClassifier(**xgb_params)
        #catm = CatBoostClassifier(**cat_params)

        xgbm.fit(X_train[FEATURES].astype('float32'), y_train,
                    eval_set=[ (X_val[FEATURES].astype('float32'), y_val) ],verbose=0)
        # catm.fit(X_train[FEATURES].astype('float32'), y_train,
        #          eval_set=[ (X_val[FEATURES].astype('float32'), y_val) ],verbose=0)

        # Store the model
        models_xgb[f'{grp}_{q_no}_{fold}'] = xgbm
        print("Done for ",grp,q_no,fold)

### q_no 1 grp 0-4 feats :  128
Done for  0-4 1 0
Done for  0-4 1 1
Done for  0-4 1 2
Done for  0-4 1 3
Done for  0-4 1 4
### q_no 2 grp 0-4 feats :  128
Done for  0-4 2 0
Done for  0-4 2 1
Done for  0-4 2 2
Done for  0-4 2 3
Done for  0-4 2 4
### q_no 3 grp 0-4 feats :  128
Done for  0-4 3 0
Done for  0-4 3 1
Done for  0-4 3 2
Done for  0-4 3 3
Done for  0-4 3 4
### q_no 4 grp 5-12 feats :  353
Done for  5-12 4 0
Done for  5-12 4 1
Done for  5-12 4 2
Done for  5-12 4 3
Done for  5-12 4 4
### q_no 5 grp 5-12 feats :  353
Done for  5-12 5 0
Done for  5-12 5 1
Done for  5-12 5 2
Done for  5-12 5 3
Done for  5-12 5 4
### q_no 6 grp 5-12 feats :  353
Done for  5-12 6 0
Done for  5-12 6 1
Done for  5-12 6 2
Done for  5-12 6 3
Done for  5-12 6 4
### q_no 7 grp 5-12 feats :  353
Done for  5-12 7 0
Done for  5-12 7 1
Done for  5-12 7 2
Done for  5-12 7 3
Done for  5-12 7 4
### q_no 8 grp 5-12 feats :  353
Done for  5-12 8 0
Done for  5-12 8 1
Done for  5-12 8 2
Done for  5-12 8 3
Done for  5-1

## Out-of-Fold Prediction

In [24]:
ALL_USERS = dataset_df_1.index.values # get all users/sessions

# Initialize an empty DataFrame to store out-of-fold (OOF) predictions for all sessions and questions
oof = pd.DataFrame(data=np.zeros((len(ALL_USERS), 18)), index=ALL_USERS)

# Loop through each question number (1 to 18)
for q_no in range(1, 19):
    # Select level group for the question based on the q_no.
    if q_no <= 3: 
        grp = '0-4' 
    elif q_no <= 13: 
        grp = '5-12'
    elif q_no <= 22: 
        grp = '13-22'
    print("### q_no", q_no, "grp", grp)

    # Select the appropriate dataset and features based on the group (grp)
    if grp == '0-4':
        df = dataset_df_1 
        FEATURES = FEATURES1 
    if grp == '5-12':
        df = dataset_df_2  
        FEATURES = FEATURES2
    if grp == '13-22':
        df = dataset_df_3
        FEATURES = FEATURES3

    y_preds = [] # strore validation predicts

    # Loop through 5-fold cross-validation
    for fold in range(5):
        
        # Extract the validation dataframe using stored validation indices
        valid_idx = valids_idx[f'{grp}_{q_no}_{fold}']
        valid_df = df.iloc[valid_idx]
        valid_users = valid_df.index.values
        
        # # Get the true labels for the validation set
        # y_val = labels.loc[labels.q == q_no].set_index('session').loc[valid_users]
        
        # Load the pre-trained XGBoost model
        xgbm = models_xgb[f'{grp}_{q_no}_{fold}']

        # Predict probabilities for the validation set
        y_pred_val_xgb = xgbm.predict_proba(valid_df[FEATURES])
        y_pred_val = y_pred_val_xgb[:, 1]
        
        y_preds.append(y_pred_val)
        
        # Store the predictions in the OOF DataFrame for the valid sessions and current question number
        oof.loc[valid_users, q_no - 1] = y_pred_val

        # # Optionally, you could compute the average of the predictions across all folds
        # y_pred_val1 = np.mean(y_preds, axis=0)
        # oof.loc[valid_users, t - 1] = y_pred_val1
        
        # If using a threshold, you could classify the predictions as 0 or 1 based on the threshold
        # y_pred_val1 = (y_pred_val1 > best_threshold).astype(int).flatten()


### q_no 1 grp 0-4
### q_no 2 grp 0-4
### q_no 3 grp 0-4
### q_no 4 grp 5-12
### q_no 5 grp 5-12
### q_no 6 grp 5-12
### q_no 7 grp 5-12
### q_no 8 grp 5-12
### q_no 9 grp 5-12
### q_no 10 grp 5-12
### q_no 11 grp 5-12
### q_no 12 grp 5-12
### q_no 13 grp 5-12
### q_no 14 grp 13-22
### q_no 15 grp 13-22
### q_no 16 grp 13-22
### q_no 17 grp 13-22
### q_no 18 grp 13-22


In [25]:
oof

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17
20090312431273200,0.881475,0.992872,0.967973,0.891234,0.667184,0.898649,0.895482,0.680301,0.851501,0.631158,0.779715,0.912305,0.321548,0.828352,0.678517,0.806626,0.814679,0.987609
20090312433251036,0.815894,0.989989,0.957041,0.487681,0.168589,0.415500,0.487956,0.543411,0.444420,0.228523,0.481028,0.762087,0.120662,0.270547,0.059165,0.656044,0.568798,0.790520
20090312455206810,0.804497,0.981032,0.969205,0.558306,0.709395,0.794033,0.755876,0.696865,0.792604,0.658639,0.706066,0.890314,0.460072,0.519252,0.313371,0.757419,0.789142,0.912568
20090313091715820,0.536271,0.971299,0.926895,0.827294,0.533711,0.731084,0.791153,0.555691,0.696316,0.492798,0.675896,0.893706,0.129585,0.731431,0.499479,0.682918,0.679743,0.978622
20090313571836404,0.942762,0.998094,0.990665,0.979001,0.852007,0.952956,0.937087,0.842838,0.927818,0.777648,0.799010,0.942125,0.509552,0.882008,0.765273,0.769164,0.762779,0.994107
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
22100215342220508,0.826889,0.996679,0.959871,0.968835,0.725543,0.853032,0.825838,0.643127,0.823795,0.526427,0.575858,0.922747,0.202798,0.791189,0.620483,0.745951,0.597578,0.986432
22100215460321130,0.462524,0.976916,0.853742,0.885819,0.588612,0.816780,0.723744,0.594588,0.820762,0.542596,0.711282,0.905633,0.171986,0.770272,0.598185,0.745547,0.703264,0.982535
22100217104993650,0.754615,0.985878,0.974652,0.872002,0.693690,0.890691,0.879477,0.668096,0.846325,0.604576,0.707722,0.911637,0.218484,0.911347,0.725775,0.755988,0.641676,0.986674
22100219442786200,0.650519,0.988792,0.932136,0.816956,0.477832,0.835379,0.715597,0.648719,0.770222,0.579957,0.685784,0.904950,0.321814,0.789533,0.512183,0.754351,0.795987,0.983466


## Threshold Optimization
Find the best threshold performance

In [26]:
true = oof.copy()

# Populate the `true` DataFrame with the actual labels (ground truth) for all sessions
for k in range(18):
    # Extract true labels for each question and align them with the sessions
    tmp = labels.loc[labels.q == k+1].set_index('session').loc[ALL_USERS]
    true[k] = tmp.correct.values

##################
# Optimize the threshold for converting probabilities into binary predictions (0 or 1)

scores = []
thresholds = []

best_score = 0
best_threshold = 0

print("List of threshoolds: ")
# Iterate over thresholds
for threshold in np.arange(0.4, 0.81, 0.01):
    print(f'{threshold:.02f}, ', end='')

    # Apply the threshold to the OOF probabilities to generate binary predictions
    preds = (oof.values.reshape((-1)) > threshold).astype('int')

    # Calculate the F1 score for the binary predictions using the true labels
    m = f1_score(true.values.reshape((-1)), preds, average='macro')

    scores.append(m)
    thresholds.append(threshold)

    # Update the best score and threshold
    if m > best_score:
        best_score = m
        best_threshold = threshold


List of threshoolds: 
0.40, 0.41, 0.42, 0.43, 0.44, 0.45, 0.46, 0.47, 0.48, 0.49, 0.50, 0.51, 0.52, 0.53, 0.54, 0.55, 0.56, 0.57, 0.58, 0.59, 0.60, 0.61, 0.62, 0.63, 0.64, 0.65, 0.66, 0.67, 0.68, 0.69, 0.70, 0.71, 0.72, 0.73, 0.74, 0.75, 0.76, 0.77, 0.78, 0.79, 0.80, 

In [27]:
print("Best threshold ", best_threshold, "\tF1 score ", best_score)

Best threshold  0.6200000000000002 	F1 score  0.696613666809234


# 5. Evaluation

In [28]:
print('When using optimal threshold...')

# Evaluate F1 scores for each individual question
for k in range(18):
    # Apply the optimal threshold to the OOF predictions for question `k`
    # Convert probabilities to binary predictions (0 or 1) using the `best_threshold`
    # Compute the macro F1 score between the true labels and binary predictions
    m = f1_score(true[k].values, (oof[k].values > best_threshold).astype('int'), average='macro')
    
    print(f'Q{k}: F1 =', m)

# Evaluate the overall F1 score across all questions
# Flatten the true labels and predictions into 1D arrays for overall evaluation
m = f1_score(true.values.reshape((-1)), (oof.values.reshape((-1)) > best_threshold).astype('int'), average='macro')

print('==> Overall F1 =', m)

When using optimal threshold...
Q0: F1 = 0.6678643298891116
Q1: F1 = 0.5261196890476701
Q2: F1 = 0.5079421982478146
Q3: F1 = 0.6773461025874492
Q4: F1 = 0.6369255385736874
Q5: F1 = 0.6432572373431327
Q6: F1 = 0.6300663952001252
Q7: F1 = 0.5714563228701321
Q8: F1 = 0.6283704217674289
Q9: F1 = 0.5856005244390425
Q10: F1 = 0.6102647885269563
Q11: F1 = 0.5186331466589013
Q12: F1 = 0.45966236142531197
Q13: F1 = 0.6354450526291678
Q14: F1 = 0.6053883336992303
Q15: F1 = 0.49842194969282805
Q16: F1 = 0.5541053108013009
Q17: F1 = 0.4956494787254747
==> Overall F1 = 0.696613666809234


# 6. Submission

Here I'll use the `best_threshold` calculate in the previous cell

In [29]:
# Reference
# https://www.kaggle.com/code/philculliton/basic-submission-demo
# https://www.kaggle.com/code/cdeotte/random-forest-baseline-0-664/notebook

dfs = {}
import jo_wilder
env = jo_wilder.make_env()
iter_test = env.iter_test()

limits = {'0-4':(1,4), '5-12':(4,14), '13-22':(14,19)}

for (test, sample_submission) in iter_test:
    grp = test.level_group.values[0]
    session_id = test.session_id.values[0]
    ##test=dataText(test)
    
    feats_sel = feats_sel_1
    
    if grp == '0-4':
        FEATURES = FEATURES1
        feats_sel = feats_sel_1
    if grp == '5-12':
        FEATURES = FEATURES2
        feats_sel = feats_sel_2
    if grp == '13-22':
        FEATURES = FEATURES3
        feats_sel = feats_sel_3

    sess_ids = test["session_id"].unique()
    df = pd.DataFrame()

    for sess_id in sess_ids:
        df_sess = test[test['session_id']==sess_id]
        if grp == "0-4":
            dfs[sess_id] = df_sess
        else:
            if sess_id in dfs:
                dfs[sess_id] = pd.concat([dfs[sess_id],df_sess])
            else:
                dfs[sess_id] = df_sess
        df=df.append(dfs[sess_id])
        #print(len(df))

    gc.collect()
    df = df.sort_values(['session_id','index'])

    test_df = feature_engineer(df,feats_sel)
    
    
    a,b = limits[grp]
    for t in range(a,b):
    
        test_ds = test_df.loc[:, test_df.columns != 'level_group']
        
        preds = []
        for fold in range(5):
            xgbm = models_xgb[f'{grp}_{t}_{fold}']
            predictions_xgb = xgbm.predict_proba(test_ds[FEATURES])
            predictions_xgb=predictions_xgb[:,1]
            preds.append(predictions_xgb)
            
        predictions = np.mean(preds,axis=0)
        
        mask = sample_submission.session_id.str.contains(f'q{t}')
        n_predictions = (predictions > best_threshold).astype(int)
        sample_submission.loc[mask,'correct'] = n_predictions.flatten()
    
    env.predict(sample_submission)
    
    if grp == '13-22':
        for sess_id in sess_ids:
            if sess_id in dfs:
                del dfs[sess_id]
                
    del test,test_df,df
    gc.collect()

This version of the API is not optimized and should not be used to estimate the runtime of your code on the hidden test set.


In [30]:
! head submission.csv

session_id,correct
20090109393214576_q1,1
20090109393214576_q2,1
20090109393214576_q3,1
20090109393214576_q4,1
20090109393214576_q5,0
20090109393214576_q6,1
20090109393214576_q7,1
20090109393214576_q8,1
20090109393214576_q9,1
