# **IML Kaggle Challenge 1**
Instructor : Miss Solat Jabeen Sheikh


ERP: 25156


Name: Shahmeer Khan

Kaggle username: ShahmeerKhan10

##**Base description**
This notebook works to test the several different models given to us to achieve the best possible score on the provided dataset, the attempted models and their relevant code is present in the notebook but commented to clear confusion, the best performing model along with its relevant hyperparameters is left uncommented. Due to the work being done via google colab , the csv files were uploaded to drive and then accessed through drive.




In [4]:
from google.colab import drive
drive.mount("/content/drive")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Drive mounted to access the relevant csv files for the models

In [5]:
!pip install lightgbm          # For LightGB
!pip install imbalanced-learn  # For SMOTE
!pip install scikit-learn      # For metrics and model selection (f1_score, cross_val_score, StratifiedKFold, ExtraTreesClassifier)
!pip install xgboost           # For XGBClassifier
!pip install catboost          # For CatBoostClassifier




Installing all other relevant packages for model training/testing, some of these were needed for GPU usage.


In [6]:
pip install dask[dataframe]



Dask dataframe installed for parallel computing, attempted the use of GPU instead of this but time limits restricted much use.

In [7]:
# Importing the necessary libraries
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import RobustScaler
from sklearn.impute import KNNImputer
from sklearn.decomposition import PCA
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import roc_curve
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import AdaBoostClassifier
from lightgbm import LGBMClassifier
from imblearn.over_sampling import SMOTE
from sklearn.metrics import f1_score
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
from xgboost import XGBClassifier
from catboost import CatBoostClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import StackingClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import VotingClassifier


In [8]:
# Loading the train and test files
df1 = pd.read_csv("/content/drive/MyDrive/train_set.csv")
df2 = pd.read_csv("/content/drive/MyDrive/test_set.csv")

# Splitting data into features (F) and target variable (T)
F = df1.drop(columns=['Y'])
T = df1['Y']

For the above stated csv files, due to the usage of colab, they had to be read through drive

# **Models and parameters used for training and testing**
This section includes all the models, scaling, imputation techniques, feature reduction etc that didn't give contribute to the final best score

In [9]:
# Imputation techniques

#Option 1: Mean imputation
#imputer = SimpleImputer(strategy='mean')
#F = imputer.fit_transform(F)
#df2 = imputer.fit_transform(df2)

#Option 2: Median imputation
#imputer = SimpleImputer(strategy='median')
#F = imputer.fit_transform(F)
#df2 = imputer.fit_transform(df2)


While both of these imputation techniques did improve the final results, they were far too generalized as opposed to KNN which gave way better results, though significantly time efficient, the output scores weren't better than KNN

In [10]:
#Scaling techniques

# MinMax Scaler
#scaler = MinMaxScaler()
#F = scaler.fit_transform(F)
#df2 = scaler.fit_transform(df2)

#Option 2 : Standard Scaler
#scaler = StandardScaler()
#F = scaler.fit_transform(F)
#df2 = scaler.fit_transform(df2)

#Option 3: Robust Scaler
#scaler = RobustScaler()
#F = scaler.fit_transform(F)
#df2 = scaler.fit_transform(df2)

In this case for some reason scaling the data had a negative effect, it could be due to my preference for tree based models like LightGB and XGboost etc which dont focus on the absolute values, in turn scaling could change the relative differences in feature values that trees naturally capture therefore lowering the score, therefore no scaling was used for the final best result.

In [11]:
#PCA for feature reduction

# Applying PCA to reduce dimensionality
#n_components = 55
#pca = PCA(n_components=n_components)
#F = pca.fit_transform(F)
#df2 = pca.transform(df2)

While PCA was very efficient time-wise, it faced a similar issue where the reduced features were too diluted thus losing important relationships between features that tree based models usually rely on. It performed way faster but that loss of relationships ended up affecting the score for my final selected model.

In [12]:
#Algorithm based feature reduction

# Random Forest feature reduction
#rf = RandomForestClassifier(n_estimators=200, random_state=2)
#rf.fit(F, T)  # F is your feature matrix, T is the target variable

# Gradient Boosting feature reduction
#gb= GradientBoostingClassifier(n_estimators=200, random_state=2)
#gb.fit(F, T)

# LightGB reduction
#lgb = LGBMClassifier(num_leaves=60,learning_rate=0.1,n_estimators=1000,max_depth=12,subsample=0.9,colsample_bytree=1.0,random_state=2,reg_alpha=0.1,reg_lambda=0.1)
#lgb.fit(F, T)

# Feature Selection using CatBoost
#cb= CatBoostClassifier(max_depth=4,learning_rate=0.05,n_estimators=700,subsample=0.9,colsample_bylevel=1.0,random_seed=2,reg_lambda=0.5)
#cb.fit(F, T)

# Extra Trees feature reduction
#et = ExtraTreesClassifier(n_estimators=700,max_depth=4 ,max_features=0.8,min_samples_split=2,random_state=2, n_jobs=-1)
#et.fit(F, T)

# Decision Tree feature reduction
#dt_initial = DecisionTreeClassifier(criterion='gini', max_depth=100, min_samples_split=10, min_samples_leaf=5)
#dt_initial.fit(F, T_)


ALgorithm based feature reduction ended up proving way more fruitful as it captured the importance of features relative to the model they were to be trained on, with every new model i tried, i used it for feature selection aswell. While all of these performed well, they didnt make the final cut for the best performing model.

In [13]:
# Splitting the dataset into training and test sets (70/30 split)
#trainF, testF, trainT, testT = train_test_split(F, T, test_size=0.3, random_state=2, stratify=T)

Found out there was no need for splitting as we were already provided with the relevant train and test set, further splitting made it lose more training data which affected my score negatively

In [14]:
# Checking for class imbalance in the target variable
class_counts = T.value_counts()
print("Class distribution in target variable:")
print(class_counts)

# Displaying percentage distribution
class_percentages = T.value_counts(normalize=True) * 100
print("\nClass distribution percentages:")
print(class_percentages)


Class distribution in target variable:
Y
0    245473
1       649
Name: count, dtype: int64

Class distribution percentages:
Y
0    99.73631
1     0.26369
Name: proportion, dtype: float64


The classes were heavily imbalanced in this dataset as we can see here with the ratio of 245,473 to 649, this made me believe they needed to be balanced.


In [15]:
#Class imbalance

#class_weight='balanced'

# Applying SMOTE to the training data to handle class imbalance
#smote = SMOTE(random_state=2)
#F_r, T_r = smote.fit_resample(F_reduced, T)


The first cloice was to use the class_weight hyperparameter however it wasnt the best choice as this is used for slightly imbalanced classes, the second choice SMOTE is the go to standard for handling such a large class imbalance, however it ended up reducing the score, this could be due to it overfitting the minority class instead but i'm not too sure, either way my final result woked out better without balancing.


In [16]:
# Models with respective hyperparameters

# Decision tree
#dt_final = DecisionTreeClassifier(criterion='gini', max_depth=100, min_samples_split=10, min_samples_leaf=5)

#NAIVE BAYES
#nb_final = GaussianNB()

# RANDOM FOREST
#rf_final = RandomForestClassifier(n_estimators=700, criterion='gini', max_depth=25, min_samples_split=5, min_samples_leaf=2,max_features='sqrt', random_state=2,class_weight='balanced')

# Gradient Boosting
#gb_final = GradientBoostingClassifier(n_estimators=500, learning_rate=0.05, max_depth=8, random_state=2)

# KNN classifier
#knn_final = KNeighborsClassifier(n_neighbors=15, weights='distance', algorithm='auto',metric='manhattan')

# AdaBoost classifier with a Decision Tree as the base estimator
#base_estimator = DecisionTreeClassifier(max_depth=2)  # Shallow tree as base estimator
#ada_final = AdaBoostClassifier(estimator=base_estimator, n_estimators=200, learning_rate=0.05, random_state=2, algorithm="SAMME")

# Catboost classifier
#cb_final = CatBoostClassifier(max_depth=8,learning_rate=0.01,n_estimators=1500,subsample=0.8,colsample_bylevel=0.8,random_seed=2,reg_lambda=2.0)

# BaggingClassifier with XGBoost as the base estimator
#bag_final = BaggingClassifier(estimator=estimator,n_estimators=15,max_samples=0.8,max_features=0.8,random_state=2,n_jobs=-1)

# Extra Trees Classifier
#et_final = ExtraTreesClassifier(n_estimators=1500,max_depth=15,max_features=0.7,min_samples_split=5,min_samples_leaf=3,random_state=2,n_jobs=-1)

# Stacking classifier with the base models and meta-model
#stacked_final = StackingClassifier(estimators=[('xgboost', xgb_final),('lightgbm', lgb_final),('catboost',cb_final)],final_estimator=LogisticRegression(max_iter=100),cv=5,n_jobs=-1)

Ignore the hyperparameters as they dont represent the respective best attempt for each algorithm, they act as placeholders. Algorithm analysis:


*   Decision tree, worked well and relatively fast, efficiency depended on the depth of the tree, however was't the highest scoring algorithm due to not capturing the relationships between features too well

*   Naive Bayes, again worked really efficiently, not many hyperparameters to tune , so couldn't play around with many possibilities and testing, used with PCA for feature reduction, less extensive code decent results

*   Random forest, not that fast of an algorithm, had really high hopes for it but was let down, attempted several hyperparameters got high ROC but really low scores, attempted PCA based feature reduction, didnt help much, this was before i tried algo based feature reduction so maybe could have gotten good results with that.

*   Gradient boosting, faster than random forest and performed significantly better too, at this point i was still referring to PCA for feature selection which i learned to improve later on.

*   KNN classifier, one of the worst performing models, took a lot of time and varying amounts of neighbors did barely anything to change the low scores, used with and without PCA, both leaving much to be desired.

*   Adaboost classifier, one of my first good results with a model buts till had alot to improve on, fast execution but not amazing results, reason explained later on.

*   Bagging classifier, expected higher performance as it took way longer than the base estimator used i.e. XGboost but not much improvement was seen

*   Extra trees classifier, expected alot more like random forest but again not great results

*   Stacking, had the highest hopes for this as it works on an ensemble of models but was let down as the excessive training time to process multiple models didnt give the strongest score at the end, performed well but not the best

*   Decent results but expected more like adaboost, took some time to process based on higher hyperparameters









# **Best performance code**
Includes best hyperparameters, model(s), imputation scaling etc for the best produced resulting code


In [17]:
#Option 3: KNN imputation (best submission)
imputer = KNNImputer(n_neighbors=5)
F = imputer.fit_transform(F)
df2 = imputer.fit_transform(df2)

Best imputation technique out of the three used, 5 was the perfect amount of neighbors, any higher or lower messed up the score, although it took longer to process, was worth it for the perforamnce gain in the end.

In [18]:
# XGBoost for feature selection (best submission)
xgb = XGBClassifier(max_depth=4,learning_rate=0.05,n_estimators=700,subsample=0.9,colsample_bytree=1.0,random_state=2,reg_alpha=0.5,reg_lambda=0.5,n_jobs=-1)
xgb.fit(F, T)

Best algo based feature selection, as i used this for XGboost on its own and in voting too.

In [19]:
# Get feature importances and select the top k most apply to reduced dataset

#F = F.values (needed for catboost feature reduction)
#df2 = df2.values (needed for catboost feature reduction)

importances = xgb.feature_importances_
indices = np.argsort(importances)[::-1]  # Sort in descending order of importance
k = 46 #(best submission)
top_k_indices = indices[:k]
F_reduced = F[:, top_k_indices]  # For training data
df2_reduced = df2[:, top_k_indices]  # For test data

46 was the perfect amount of features needed to get the optimal result, some models needed all the features to perform well but XGboost preferred 46. F and df2 dataframes had to be converted to arrays for catboost feature reduction.

In [20]:
# Key LightGBM parameters
#(best submission)
lgb_final = LGBMClassifier(num_leaves=50,learning_rate=0.01,n_estimators=1400,max_depth=12,subsample=0.5,colsample_bytree=0.7,random_state=2,reg_alpha=1.0,reg_lambda=1.5)

#(Best submission)
#XGboost main classifier
xgb_final = XGBClassifier(max_depth=8,learning_rate=0.006,n_estimators=2600,subsample=0.7,colsample_bytree=0.5,random_state=2,reg_alpha=2,reg_lambda=2,n_jobs=-1)

The base classifiers for the voting classifier, these hyperparameters individually gave the best performances across models, a combination of the two via voting further improved on that.

In [21]:
# Create a VotingClassifier using soft voting (averages predicted probabilities)
ensemble_final = VotingClassifier(estimators=[('lightgbm', lgb_final), ('xgboost', xgb_final)],voting='soft',weights=[1.3,1],n_jobs=-1) #(best submission)

Using LightGB and XGboost with soft voting and adjusted weights we got the best model

In [22]:
# Cross-validation setup (Stratified KFold to ensure balanced classes in each fold)
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=2)

# Perform cross-validation (ROC AUC scoring)
cv_scores = cross_val_score(ensemble_final, F_reduced, T, cv=cv, scoring='roc_auc', n_jobs=-1)

# Output cross-validation scores
print(f"Cross-validation ROC AUC scores: {cv_scores}")
print(f"Mean ROC AUC score from cross-validation: {cv_scores.mean():.4f}")

Cross-validation ROC AUC scores: [0.96888764 0.9665255  0.95386622 0.97368032 0.96171097]
Mean ROC AUC score from cross-validation: 0.9649


Initially focused on the best ROC score to get an idea of how to finetune the hyperparameters, eventually resorted to Cross Validation to get a better generalized understanding, way more time consuming but also significantly better for understanding the ROC scores.

In [23]:
#Fitting the chosen model on the dataset
ensemble_final.fit(F_reduced, T)

# Predict probabilities for the test dataset (df2)
test_probs = ensemble_final.predict_proba(df2_reduced)[:, 1]  # Get probabilities for class 1


Here we fit our chosen model on the data and get the probabilities for the target class, the variable ensemble_final was edited to match the model being used at time of testing.

In [24]:
# Prepare the submission file with predicted probabilities
df_sample = pd.read_csv("/content/drive/MyDrive/sample_submission.csv")
df_sample['Y'] = test_probs  # Use predicted probabilities

# Save to CSV for submission
df_sample.to_csv("/content/drive/MyDrive/submission_ensemble.csv", index=False)


Reading the sample submission file to enter the calculated probabilities to get the final submission file

# **Final thoughts and major issues faced**

*   My scores were initially struggling due to the fact that i misunderstood the submission part, instead of directly inputing the probabilities, i was using an ROC based threshold to assign every element a value of 1 and 0 as i thought the final submission's Y column had to have values 0 or 1 submitted, checking the test_set.csv file in more detail gave me a better understanding of what was needed.

*   Initially attempted a grid search but it took forever no matter how few hyperparameters i chose so referred to individually altering parameters as needed. Voting and stacking were still taking too long so looked for GPU options i.e. Colab in this case, which did help a bit but i reached my offered resource limit really soon, used the GPU offered by Kaggle itself but that also reached its limit really soon so had to utilize CPU resources efficiently.

*   To utilize CPU well i used the dask dataframe package and also the n_jobs=-1 hyperparameter for the models which utilizes all CPU cores for execution hence aiding in reducing processing time.

*   Overall score could probably be improved by testing out other models more often but i was confident in XGboost and LightGB's results and so had my main focus on them.