# Advanced Machine Learning - Assignment 1
### Rohan Rocky Britto - Student ID: 24610990

## Data Processing

Import required packages

In [1]:
import pandas as pd
import numpy as np
import re

Import the data processed and stored in previous experiments

In [2]:
X_train = pd.read_csv('../data/processed/X_train.csv')
X_val = pd.read_csv('../data/processed/X_val.csv')
X_test = pd.read_csv('../data/processed/X_test.csv')
y_train = pd.read_csv('../data/processed/y_train.csv')
y_val = pd.read_csv('../data/processed/y_val.csv')

Import the test data to retrieve player_id

In [3]:
df_test = pd.read_csv('../data/raw/test.csv')

## Model Building and Evaluation

Import fit_predict_proba function from the saved functions

In [4]:
import sys
sys.path.append('../src/models')
from functions import fit_predict_proba

### AdaBoost

As AdaBoost performed best, we will be using hyperparameter tuning on it for better performance and to reduce overfitting

In [5]:
from sklearn.ensemble import AdaBoostClassifier

In [6]:
model = AdaBoostClassifier(random_state=8)

In [7]:
model.fit(X_train, y_train.values.ravel())

### Feature Selection

In [8]:
model.feature_importances_

array([0.02, 0.02, 0.24, 0.04, 0.02, 0.02, 0.  , 0.02, 0.  , 0.  , 0.  ,
       0.  , 0.  , 0.02, 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  ,
       0.  , 0.  , 0.02, 0.02, 0.02, 0.06, 0.  , 0.02, 0.02, 0.  , 0.02,
       0.  , 0.  , 0.02, 0.  , 0.04, 0.04, 0.  , 0.02, 0.12, 0.  , 0.02,
       0.02, 0.  , 0.  , 0.02, 0.02, 0.  , 0.02, 0.02, 0.  , 0.  , 0.  ,
       0.04, 0.02])

Looking at the above list, we get to know that a lot of the features are not very important for the model. Removing these might reduce the noise that the model is trying to fit to and thus reduce overfitting.

Finding all the features that have non-zero feature_importance value and copying it to new dataframes

In [9]:
# function to filter non-zero elements
def filter_non_zero(elem):
    return elem[1] != 0
 
# Index of Non-Zero elements in Python list
# using filter() function
filtered_output = filter(filter_non_zero, enumerate(model.feature_importances_))
res = list(map(lambda x: x[0], filtered_output))

In [10]:
X_train_cleaned = X_train.iloc[:,res]
X_val_cleaned = X_val.iloc[:,res]
X_test_cleaned = X_test.iloc[:,res]

Let us check the performance of the model now

In [11]:
fit_predict_proba(model, X_train_cleaned, y_train.values.ravel(), X_val_cleaned, y_val.values.ravel())

The AUROC value for the training set is:  0.9965946779711462
The AUROC value for the validation set is:  0.9960520967401683


The model performance has remained almost same. We will continue with only these features.

### Automated Hyperparameter tuning

I will be using Randomized Search for hyperparameter tuning

In [12]:
from sklearn.model_selection import RandomizedSearchCV

From my manual executions, I have found that the range of hyperparameter values where the model performs best. I have excluded the manual executions from the final submission notebook.

In [26]:
from scipy.stats import uniform
from scipy.stats import randint
hyperparams_dist = {
    'n_estimators': randint(50, 300),
    'learning_rate': uniform(0.3, 0.7)
    }

scoring = 'roc_auc'

In [27]:
rs_model = RandomizedSearchCV(model, hyperparams_dist, random_state=8, verbose=1, scoring=scoring)

In [28]:
rs_model.fit(X_train_cleaned, y_train.values.ravel())

Fitting 5 folds for each of 10 candidates, totalling 50 fits


In [29]:
rs_model.best_params_

{'learning_rate': 0.9114005819542714, 'n_estimators': 291}

In [30]:
final_model = AdaBoostClassifier(random_state=8, n_estimators = rs_model.best_params_['n_estimators'], learning_rate=rs_model.best_params_['learning_rate'])

In [32]:
fit_predict_proba(final_model, X_train_cleaned, y_train.values.ravel(), X_val_cleaned, y_val.values.ravel())

The AUROC value for the training set is:  0.9989965086788659
The AUROC value for the validation set is:  0.9979736563069224


## Testing and submission file preparation

In [33]:
df_submission = pd.DataFrame({})
df_submission['player_id'] = df_test['player_id']
df_submission['drafted'] = final_model.predict_proba(X_test_cleaned)[:,1]
df_submission.to_csv('../data/processed/submission3.csv', index=False)