# Advanced Machine Learning - Assignment 1
### Rohan Rocky Britto - Student ID: 24610990

## Data Processing

As per the Canvas discussions, I understood that the incorrect data in the height feature could be caused by Kaggle and/or Pandas. Hence, I have decided to reprocess the data all over again. I have also made some other changes to the pre-processing and hence, I will retry Random Forest first and then proceed with other algorithms for a fair comparison.

Import required packages

In [1]:
import pandas as pd
import numpy as np
import re

Read the raw data file

In [2]:
df = pd.read_csv('../data/raw/train.csv')
df_test = pd.read_csv('../data/raw/test.csv')

  df = pd.read_csv('../data/raw/train.csv')


Copy the dataframe into a new dataframe for processing

In [3]:
df_cleaned = df.copy()

Move the target variable into a separate list

In [4]:
target = df_cleaned.pop('drafted')

We need to drop couple of features due to the following reasons:
<br>&emsp;&emsp;1. Rec_Rank, dunks_ratio and pick columns have a lot of null values. Filling them up with mean values will lead to deviation from the real world data and hence, I have decided to drop them.
<br>&emsp;&emsp;2. type feature has only 1 unique value and would not help the model in making predictions
<br>&emsp;&emsp;3. num and player_id are identifiers and can lead to overfitting

In [5]:
df_cleaned.drop(['Rec_Rank', 'dunks_ratio', 'pick', 'type', 'num', 'player_id'], inplace=True, axis=1)

Create separate lists of columns with numerical and categorical values

In [6]:
num_cols = list(df_cleaned.select_dtypes('number').columns)
cat_cols = list(set(df_cleaned.columns) - set(num_cols))

View the value count of categorical data before converting it

In [7]:
for col in cat_cols:
    print('Value counts for', col, ': ', df_cleaned[col].value_counts(), '\n')

Value counts for yr :  yr
Jr      14923
Fr      14906
So      13252
Sr      12711
0           5
57.1        1
42.9        1
Name: count, dtype: int64 

Value counts for ht :  ht
7-Jun     5578
8-Jun     5498
4-Jun     5363
5-Jun     5353
6-Jun     5126
3-Jun     5125
2-Jun     4648
9-Jun     3988
1-Jun     3539
Jun-00    2984
10-Jun    2491
11-May    1518
10-May    1378
11-Jun    1119
Jul-00     653
9-May      598
8-May      242
-          241
1-Jul      201
7-May       95
2-Jul       88
3-Jul       40
6-May       40
Apr-00      20
0           19
4-Jul       11
5-May        8
6-Jul        7
5-Jul        4
4-May        4
3-May        3
2-May        3
Jr           2
1-May        2
So           1
Fr           1
6'4          1
5-Apr        1
Name: count, dtype: int64 

Value counts for conf :  conf
ACC     2297
A10     2268
SEC     2199
B10     2123
CUSA    2113
MEAC    2027
Slnd    2008
BE      1977
MAC     1914
SB      1857
SWAC    1775
SC      1770
OVC     1769
BSth    1723
B12     1714

Replace abnormalities in yr feature with mode value

In [8]:
valid_yr_values = ['So', 'Sr', 'Jr', 'Fr']
df_cleaned['yr'].replace(list(set(df_cleaned['yr'].unique()) - set(valid_yr_values)),df_cleaned['yr'].mode()[0], inplace=True)

Replace abnormalities in ht feature with mode value

In [9]:
pattern = '^(\d{1,2}-[A-Z][a-z]{2})|([A-Z][a-z]{2}-\d{1,2})$'
replacement = df['ht'].mode()[0]

def replace_non_matching(item):
    return item if re.match(pattern, str(item)) else replacement

df_cleaned['ht'] = df_cleaned['ht'].apply(replace_non_matching)

Create a list of columns having null values

In [10]:
null_cols = list(df_cleaned.columns[df_cleaned.isnull().any()])

Replace null values with mean

In [11]:
for col in null_cols:
    if col in num_cols:
        df_cleaned[col].fillna(df_cleaned[col].mean(), inplace=True)
    else:
        df_cleaned[col].fillna(df_cleaned[col].mode(), inplace=True)

Define a function for processing test data

In [12]:
def process_data(df_data):
    pattern = '^(\d{1,2}-[A-Z][a-z]{2})|([A-Z][a-z]{2}-\d{1,2})$'
    replacement = df['ht'].mode()[0]
    def replace_non_matching(item):
        return item if re.match(pattern, str(item)) else replacement
    df_data['ht'] = df_data['ht'].apply(replace_non_matching)

    df_data.drop(['Rec_Rank', 'dunks_ratio', 'pick', 'type', 'num', 'player_id'], inplace=True, axis=1)
    valid_yr_values = ['So', 'Sr', 'Jr', 'Fr']
    df_data['yr'].replace(list(set(df_data['yr'].unique()) - set(valid_yr_values)),df_data['yr'].mode()[0], inplace=True)
    
    null_cols = list(df_data.columns[df_data.isnull().any()])
    for col in null_cols:
        if col in num_cols:
            df_data[col].fillna(df_data[col].mean(), inplace=True)
        else:
            df_data[col].fillna(df_data[col].mode(), inplace=True)
    return df_data

Import joblib to save models for future use

In [13]:
import joblib

In [14]:
joblib.dump(process_data, '../src/models/process_data.joblib')

['../src/models/process_data.joblib']

In [15]:
df_test_cleaned = df_test.copy()
df_test_cleaned = process_data(df_test_cleaned)

We will use Frequency Encoding as they have a lot of values. Transforming categorical features with a lot of values using OneHot Encoding can lead to Dimensionality crisis.

In [16]:
from feature_engine.encoding import CountFrequencyEncoder

In [17]:
freqenc = CountFrequencyEncoder(encoding_method='frequency', variables=cat_cols)

In [18]:
features = freqenc.fit_transform(df_cleaned[cat_cols])
X_test = freqenc.transform(df_test_cleaned[cat_cols])
X_test.fillna(0.0001, inplace=True)



In [19]:
joblib.dump(freqenc, '../models/freqenc.joblib')

['../models/freqenc.joblib']

Scaling the numerical features

In [20]:
scaler = joblib.load('../models/scaler.joblib')

In [21]:
features[num_cols] = scaler.fit_transform(df_cleaned[num_cols])
X_test[num_cols] = scaler.transform(df_test_cleaned[num_cols])

Let us check the class balance in the dataset

In [22]:
target.value_counts()

drafted
0.0    55555
1.0      536
Name: count, dtype: int64

The dataset looks very imbalanced. We will balance it using SMOTE to a 1:10 ratio so that it does not deviate very much from the real world scenario but also reduces model bias towards one class.

In [23]:
from imblearn.over_sampling import SMOTE

In [24]:
sm = SMOTE(random_state=8, sampling_strategy=0.1)

In [25]:
features, target = sm.fit_resample(features, target)

In [26]:
joblib.dump(sm, '../models/sm.joblib')

['../models/sm.joblib']

Train Test Split

As the testing dataset is separate, I have decided to split the dataset into 80:20 ratio.

In [27]:
from sklearn.model_selection import train_test_split

In [28]:
X_train, X_val, y_train, y_val = train_test_split(features, target, test_size=0.2, random_state=8)

In [29]:
print(X_train.shape)
print(X_val.shape)
print(X_test.shape)

(48888, 57)
(12222, 57)
(4970, 57)


In [30]:
print(y_train.shape)
print(y_val.shape)

(48888,)
(12222,)


In [31]:
X_train.to_csv('../data/processed/X_train.csv')
X_val.to_csv('../data/processed/X_val.csv')
X_test.to_csv('../data/processed/X_test.csv')
y_train.to_csv('../data/processed/y_train.csv')
y_val.to_csv('../data/processed/y_val.csv')

## Model Building and Evaluation

In [32]:
from sklearn.metrics import roc_auc_score

Define fit_predict_proba function to fit, predict probability and evaluate the performance of the model

In [33]:
def fit_predict_proba(model, X_train, y_train, X_val, y_val):
    model.fit(X_train, y_train)
    y_train_pred_prob = model.predict_proba(X_train)[:,1]
    y_val_pred_prob = model.predict_proba(X_val)[:,1]
    print('The AUROC value for the training set is: ', roc_auc_score(y_train, y_train_pred_prob))
    print('The AUROC value for the validation set is: ', roc_auc_score(y_val, y_val_pred_prob))

### Random Forest

We will use Random Forest Classifier as our prediction model and evaluate its performance

In [34]:
from sklearn.ensemble import RandomForestClassifier

In [35]:
rf1 = RandomForestClassifier(random_state=8)

In [36]:
fit_predict_proba(rf1, X_train, y_train, X_val, y_val)

The AUROC value for the training set is:  1.0
The AUROC value for the validation set is:  0.9991044346553366


The model seems to be slightly overfitting. Let us adjust some hyperparameters to reduce it

In [37]:
rf2 = RandomForestClassifier(max_depth=8, random_state=8)

In [38]:
fit_predict_proba(rf2, X_train, y_train, X_val, y_val)

The AUROC value for the training set is:  0.9979623012261046
The AUROC value for the validation set is:  0.9960464614286201


### Logistic Regression

In [39]:
from sklearn.linear_model import LogisticRegression

In [40]:
lr1 = LogisticRegression(random_state=8, max_iter=1000)

In [41]:
fit_predict_proba(lr1, X_train, y_train, X_val, y_val)

The AUROC value for the training set is:  0.9905439790025555
The AUROC value for the validation set is:  0.9908248144165317


### Support Vector Classifier

In [42]:
from sklearn.svm import SVC

In [43]:
svc1 = SVC(random_state=8, probability=True)

In [44]:
fit_predict_proba(svc1, X_train, y_train, X_val, y_val)

The AUROC value for the training set is:  0.9958669446502685
The AUROC value for the validation set is:  0.9953452326611877


### AdaBoost

In [45]:
from sklearn.ensemble import AdaBoostClassifier

In [46]:
adaboost1 = AdaBoostClassifier(random_state=8)

In [48]:
fit_predict_proba(adaboost1, X_train, y_train, X_val, y_val)

The AUROC value for the training set is:  0.9965946779711462
The AUROC value for the validation set is:  0.9960520967401683


## Testing and submission file preparation

I have moved the top two best performing models to the a different variable

In [49]:
model1 = adaboost1
model2 = svc1

In [50]:
df_submission1 = pd.DataFrame({})
df_submission2 = pd.DataFrame({})

Add the player ID from the testing dataset to the submission dataframe

In [51]:
df_submission1['player_id'] = df_test['player_id']
df_submission2['player_id'] = df_test['player_id']

Add the prediction probability to the drafted column

In [52]:
df_submission1['drafted'] = model1.predict_proba(X_test)[:,1]
df_submission2['drafted'] = model2.predict_proba(X_test)[:,1]

Save the dataframe to a csv file

In [53]:
df_submission1.to_csv('../data/processed/submission1.csv', index=False)
df_submission2.to_csv('../data/processed/submission2.csv', index=False)