# **IML Kaggle Challenge 2**
Instructor : Miss Solat Jabeen Sheikh

ERP: 25156

Name: Shahmeer Khan

Kaggle username: ShahmeerKhan10

##**Base description**
This notebook works to test the several different models given to us to achieve the best possible score on the provided dataset, the attempted models and their relevant code is present in the notebook but commented to clear confusion, the best performing model along with its relevant hyperparameters is left uncommented. Due to the work being done via google colab , the csv files were uploaded to drive and then accessed through drive.

In [None]:
from google.colab import drive
drive.mount("/content/drive")

Drive mounted to access the relevant csv files for the models

In [None]:
!pip install dask[dataframe]
!pip install lightgbm
!pip install imbalanced-learn
!pip install scikit-learn
!pip install xgboost
!pip install catboost

Installing all other relevant packages for model training/testing, some of these were needed for GPU usage. Dask dataframe installed for parallel computing, attempted the use of GPU instead of this but time limits restricted much use.

In [None]:
# Importing the necessary libraries
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import RobustScaler
from sklearn.impute import KNNImputer
from sklearn.decomposition import PCA
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, AdaBoostRegressor, GradientBoostingRegressor, StackingRegressor
from xgboost import XGBRegressor
from sklearn.neural_network import MLPRegressor
import tensorflow as tf
import torch.nn as nn
import torch.optim as optim
from sklearn.metrics import accuracy_score
from sklearn.decomposition import PCA
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.linear_model import ElasticNet
from sklearn.linear_model import SGDRegressor
from sklearn.preprocessing import OneHotEncoder
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.callbacks import EarlyStopping

In [None]:
# Loading the train and test files
df1 = pd.read_csv("/content/drive/MyDrive/train.csv")
df2 = pd.read_csv("/content/drive/MyDrive/test.csv")
sample_submission = pd.read_csv("/content/drive/MyDrive/sample_submission.csv")

# Splitting data into features (F) and target variable (T)
F = df1.drop(columns=['price_doc'])  # Features (all columns except 'price_doc')
T = df1['price_doc']                # Target column

For the above stated csv files, due to the usage of colab, they had to be read through drive

In [None]:
# Identify categorical columns
categorical_columns = F.select_dtypes(include=['object', 'category', 'bool']).columns

# Apply one-hot encoding
encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
F_encoded = encoder.fit_transform(F[categorical_columns])
df2_encoded = encoder.transform(df2[categorical_columns])

# Convert encoded data into DataFrames and assign appropriate column names
F_encoded = pd.DataFrame(F_encoded, columns=encoder.get_feature_names_out(categorical_columns), index=F.index)
df2_encoded = pd.DataFrame(df2_encoded, columns=encoder.get_feature_names_out(categorical_columns), index=df2.index)

# Concatenate the one-hot encoded columns back with the non-categorical columns
F = pd.concat([F.drop(columns=categorical_columns), F_encoded], axis=1)
df2 = pd.concat([df2.drop(columns=categorical_columns), df2_encoded], axis=1)

# Separate the 'ID' column from the test set (df2), to be included in the submission file later
test_ids = df2['row ID']  # Save 'ID' column for later use in submission

# Align the test set columns with the training set columns
missing_columns = set(F.columns) - set(df2.columns)
for col in missing_columns:
    df2[col] = 0  # Add missing columns to the test set with 0 values

# Reorder columns in df2 to match the order of the training set
df2 = df2[F.columns]

This entire process was part of the best submission which used all feature columns and via one hot encoding converted the categorical ones too.

In [None]:
# Drop categorical columns from both train and test sets
#F = F.drop(columns=categorical_columns)
#df2 = df2.drop(columns=categorical_columns)

Tried some runs with simply dropping the categorical columns aswell

# **Models and parameters used for training and testing**
This section includes all the models, scaling, imputation techniques, feature reduction etc that didn't give contribute to the final best score

In [None]:
# Imputation techniques

#Option 1: Mean imputation
#imputer = SimpleImputer(strategy='mean')
#F = imputer.fit_transform(F)
#df2 = imputer.fit_transform(df2)

#Option 2: Median imputation
#imputer = SimpleImputer(strategy='median')
#F = imputer.fit_transform(F)
#df2 = imputer.fit_transform(df2)

While both of these imputation techniques did improve the final results, they were far too generalized as opposed to KNN which gave way better results, though significantly time efficient, the output scores weren't better than KNN

In [None]:
#Scaling techniques

#Option 2 : Standard Scaler
#scaler = StandardScaler()
#F = scaler.fit_transform(F)
#df2 = scaler.fit_transform(df2)

#Option 3: Robust Scaler
#scaler = RobustScaler()
#F = scaler.fit_transform(F)
#df2 = scaler.fit_transform(df2)

Scaling the data had a net positive effect, different scalers yielded different results but the best performing one ( minmax scaler ) is mentioned later on

In [None]:
#PCA for feature reduction

# Applying PCA to reduce dimensionality
#n_components = 150
#pca = PCA(n_components=n_components)
#F = pca.fit_transform(F)
#df2 = pca.transform(df2)

While feature selection using PCA did help with other models, it had no benefit to my final submission of random forest, this variant of PCA mentiones the components directly.

In [None]:
#PCA for feature reduction (variance)

# Applying PCA to reduce dimensionality
#pca = PCA(n_components=0.95)
#F = pca.fit_transform(F)
#df2 = pca.transform(df2)

This version of PCA uses variance as an input, again not too useful

In [None]:
#Algorithm based feature reduction

# GB boost for feature selection
#gb = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, max_depth=5, min_samples_split=5, min_samples_leaf=3, subsample=0.8, random_state=42);
#gb.fit(F, T);

# XG boost features
#xgb = XGBRegressor(n_estimators=80, learning_rate=0.05, max_depth=8, subsample=0.8, colsample_bytree=0.8, random_state=42, objective="reg:squarederror");
#xgb.fit(F, T);

# random forest feature selection
#rf = RandomForestRegressor(n_estimators=100, max_depth=5, min_samples_split=10, min_samples_leaf=3, max_features=0.5, random_state=42, n_jobs=-1 )
#rf.fit(F, T)


In [None]:
# Get feature importances and select the top k most apply to reduced dataset
#importances = xgb.feature_importances_
#indices = np.argsort(importances)[::-1]  # Sort in descending order of importance
#k = 200 #(best submission)
#top_k_indices = indices[:k]
#F = F[:, top_k_indices]  # For training data
#df2 = df2[:, top_k_indices]  # For test data

Using these algo specific feature reduction methods, i attempted feature selection by picking k top indices however as my best submission used all features, this wasnt too beneficial

In [None]:
# For polynomial regression

# Generate Polynomial Features
#poly = PolynomialFeatures(degree=3)  # Change degree as needed
#F_poly = poly.fit_transform(F)  # Apply polynomial transformation

# Generate Polynomial Features for the test set (after PCA transformation)
#df2_poly = poly.transform(df2)

# Train the model on the reduced feature set
#poly_final = LinearRegression()

# Cross-validation to evaluate RMSE
#cv_scores = cross_val_score(poly_final, F_poly, T, cv=5, scoring='neg_root_mean_squared_error')

# Convert negative RMSE to positive and compute the mean RMSE
#rmse_cv = -cv_scores.mean()

#print(f"Cross-validated RMSE for Polynomial Regression (Degree={2}): {rmse_cv}")

#poly_final.fit(F_poly, T)

#predicition for polynomial regression
#predictions = poly_final.predict(df2_poly)

Took up all my gpu available at colab and was not worth it in the end, results were mediocre for the time and effort spent to get a single score

In [None]:
#Linear regression with lasso for feature selection

# Lasso for feature selection
#lasso_selector = Lasso(alpha=0.5,max_iter=10000)  # Adjust alpha based on cross-validation
#lasso_selector.fit(F, T)

# Identify the selected features (non-zero coefficients)
#selected_features_mask = lasso_selector.coef_ != 0
#selected_features = F[:, selected_features_mask]

# Train Linear Regression on selected features
#lin_final = LinearRegression()
#lin_final.fit(selected_features, T)

# Cross-validation on the reduced dataset
#cv_scores = cross_val_score(lin_final, selected_features, T, cv=5, scoring='neg_root_mean_squared_error')
#rmse_cv = -cv_scores.mean()
#print(f"Cross-validated RMSE (after Lasso feature selection): {rmse_cv}")

# Transform the test set to include only the selected features
#df2_selected = df2[:, selected_features_mask]

# Make predictions on the test set
#predictions = lin_final.predict(df2_selected)


In [None]:
# Linear Regression Model
#lin_final = LinearRegression()
#lin_final.fit(F,T)

Linear regression didnt perform too badly, but the lack of hyperparameters limited what could be tested with it, used lasso for feature selection but didnt help much

In [None]:
# KNN Regressor Model
#knn_final = KNeighborsRegressor(n_neighbors=60)  # You can adjust the number of neighbors (n_neighbors) as needed
#knn_final.fit(F, T)

Compared to the last classification problem, KNN performed way better here which may be a sign its more suited for regression problems, but again the score wasnt too impressive

In [None]:
# Decision Tree Regressor Model
#tree_final = DecisionTreeRegressor(max_depth=10, min_samples_split=15,min_samples_leaf=3,criterion="friedman_mse",max_features=0.5, random_state=42)  # Adjust max_depth and other parameters as needed
#tree_final.fit(F, T)

In [None]:
Really liked decision tree due to its low execution time, also gave really good results but were greatly improved on by random forest

In [None]:
# Custom base estimator
#base_est = DecisionTreeRegressor(max_depth=5, min_samples_split=10)

# AdaBoost with a custom base estimator
#ada_final = AdaBoostRegressor(estimator=base_est,n_estimators=500,learning_rate=0.05,random_state=42)
#ada_final.fit(F, T)

Adaboost execution took way too long and didnt have enough tiome to test it thoroughly, but from what i could test of it, performed really well but not the best

In [None]:
# Gradient boosting
#gb_final = GradientBoostingRegressor(n_estimators=500, learning_rate=0.1, max_depth=5, min_samples_split=10 , min_samples_leaf=5, subsample=0.8, random_state=42);
#gb_final.fit(F, T);

Took even longer than adaboost, i had trouble configuring GPU usage so i couldnt do much with this, results not worth execution times

In [None]:
# XG boost model
#xgb_final = XGBRegressor(n_estimators=80, learning_rate=0.05, max_depth=7, subsample=0.8, colsample_bytree=0.8, random_state=42, objective="reg:squarederror");
#xgb_final.fit(F, T);

Performed really well, would have probably been my best algorithm if not for random forest, tuned the parameters to max via cv scores but wasnt that helpful in the end, low execution times aswell

In [None]:
# Define the Early Stopping callback
#early_stop = EarlyStopping(monitor='val_loss',patience=10,restore_best_weights=True)

#nn_final = Sequential([Dense(128, activation='relu', input_shape=(F.shape[1],)), Dropout(0.3), Dense(128, activation='relu'), Dropout(0.3), Dense(1)])
#nn_final.compile(optimizer='adam', loss='mse', metrics=[tf.keras.metrics.RootMeanSquaredError()])
#nn_final.fit(F, T, validation_data=(F, T), epochs=100, batch_size=128, verbose=1,callbacks=[early_stop])

Had alot of hope for this one, execution times also very favourable but couldnt get the most out of it in the end, maybe could have been the best one if given more time to execute and tune

In [None]:
# Define the Early Stopping callback
#early_stop = EarlyStopping(monitor='val_loss',patience=10,restore_best_weights=True )

# Neural Network
#nn_final = Sequential([Dense(256, activation='swish', input_shape=(F.shape[1],), kernel_regularizer=tf.keras.regularizers.l2(1e-4)),Dropout(0.3),Dense(128, activation='swish', kernel_regularizer=tf.keras.regularizers.l2(1e-4)),Dropout(0.3),Dense(64, activation='swish', kernel_regularizer=tf.keras.regularizers.l2(1e-4)),Dense(1)])

#nn_final.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.0005), loss='mse', metrics=[tf.keras.metrics.RootMeanSquaredError()])

# Train the model with Early Stopping
#nn_final.fit(F, T, validation_split=0.2, epochs=60, batch_size=64, verbose=1, callbacks=[early_stop])

Attempted neural networks with swish, not much better , but again fast execution times

# **Best performance code**
Includes best hyperparameters, model(s), imputation scaling etc for the best produced resulting code

In [None]:
#Option 3: KNN imputation (best submission)
imputer = KNNImputer(n_neighbors=10)
F = imputer.fit_transform(F)
df2 = imputer.fit_transform(df2)

Best imputation technique out of the three used, 10 was the perfect amount of neighbors, any higher or lower messed up the score.

In [None]:
# MinMax Scaler
scaler = MinMaxScaler()
F = scaler.fit_transform(F)
df2 = scaler.fit_transform(df2)

Best scaling technique out of the three used, the other two performed very well but not well enough to be selected for best submission.

In [None]:
#Random forest regressor
rf_final = RandomForestRegressor(n_estimators=500, max_depth=10, min_samples_split=15, min_samples_leaf=3, max_features=0.5, random_state=42, n_jobs=-1 )
rf_final.fit(F, T)

Best performing algorithm , close competer with XGboost. Took way too long for executions so couldn't fine tune the hyperparameters enough but still gave the best results in the end.

In [None]:
# Calculate RMSE using cross-validation
cv_scores = cross_val_score(rf_final, F, T, cv=5, scoring='neg_root_mean_squared_error')

# Convert negative RMSE to positive and compute the mean RMSE
rmse_cv = -cv_scores.mean()
print(f"Cross-validated RMSE: {rmse_cv}")

Used cross validation with negative rmse then converted to positive to get an estimate of performance of my hyperparameters.

In [None]:
# Make predictions on the test set (df2)
predictions = rf_final.predict(df2)

# Prepare the submission file
# The submission file should have the 'ID' column from df2 and the predicted 'price_doc'
submission = sample_submission.copy()

# Ensure that we include the 'ID' column from df2 in the submission
submission['row ID'] = test_ids  # Use 'ID' from the test set
submission['price_doc'] = predictions  # Add the predicted price_doc values

# Save the submission file to a CSV
submission.to_csv("/content/drive/MyDrive/submission_rf.csv", index=False)

Finally used the model to predict values and input them in the submission file

# **Final thoughts and major issues faced**


*   Time constraints, due to many of these codes taking hours to run, submissions wwere limited, GPU usage didnt always help with executuon times
*   GPU usage further caused issues when polynomial was run, it took up most to all of colab GPU so resorted to CPU usage in the end
*   Still think more time with neural networks cpould produce amazing results but again not enough time
*   Converting from categorical to one hot took up alot of cpu usage and would often crash so had to use alot of gpu for that aswell



