# Experiment B - Kaggle competition with Adaboost classifier model

### Thirada Tiamklang 14337188

_24 August 2023_

In this experiment, we will predict the probability of American college players being drafted to join NBA teams' rosters. We will use an Adaboost classifier model with hyperparameter tuning to aid in predicting the target outcome.

__Tabel of contents__
1. Load and explore data
2. Train Adaboost classifier Model
3. Hyperparameter Tuning
4. Predict probability on Test Dataset

## 1. Load and explore dataset

In [1]:
import pandas as pd
import numpy as np
from joblib import dump

In [2]:
X_train = pd.read_csv('../data/processed/X_train.csv')
X_val   = pd.read_csv('../data/processed/X_val.csv'  )
X_test  = pd.read_csv('../data/processed/X_test.csv' )
y_train = pd.read_csv('../data/processed/y_train.csv')
y_val   = pd.read_csv('../data/processed/y_val.csv'  )
y_test  = pd.read_csv('../data/processed/y_test.csv' )

In [3]:
print(X_train.shape)
print(X_val.shape)
print(X_test.shape)

(35897, 59)
(8975, 59)
(11219, 59)


In [4]:
print(y_train.shape)
print(y_val.shape)
print(y_test.shape)

(35897, 1)
(8975, 1)
(11219, 1)


It seems like y is not in the right shape.

In [5]:
y_train.head()

Unnamed: 0,drafted
0,0.0
1,0.0
2,0.0
3,0.0
4,0.0


In [6]:
# Convert to a 1-dimensional array
y_train = y_train.squeeze() 
y_val = y_val.squeeze() 
y_test = y_test.squeeze() 

In [7]:
print(y_train.shape)
print(y_val.shape)
print(y_test.shape)

(35897,)
(8975,)
(11219,)


To compare the AUROC score on the testing set with the results from the previous experiment_A, we will select the same features that were used to train both the baseline model and the polynomial logistic regression model.

In [8]:
features_to_keep = ['adjoe', 'rimmade', 'dunks_ratio', 'adrtg']

# Create new feature sets with selected features for training, validation, and testing
X_train = X_train[features_to_keep]
X_val = X_val[features_to_keep]
X_test = X_test[features_to_keep]

In [9]:
print(X_train.shape)
print(X_val.shape)
print(X_test.shape)

(35897, 4)
(8975, 4)
(11219, 4)


There are 4 features that we will train with Adaboost model.

## 2. Train Adaboost classifier Model

__AdaBoost classifier with default hyperparameter__

In [10]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import roc_auc_score

In [11]:
adaboost_model = AdaBoostClassifier(random_state=None, n_estimators=50)

In [12]:
adaboost_model.fit(X_train, y_train)

In [13]:
# save model
dump(adaboost_model,  '../models/adaboost_default.joblib')

In [14]:
y_train_probs = adaboost_model.predict_proba(X_train)[:, 1]
y_val_probs = adaboost_model.predict_proba(X_val)[:, 1]
y_test_probs = adaboost_model.predict_proba(X_test)[:, 1]

In [15]:
auroc_train = roc_auc_score(y_train, y_train_probs)
auroc_val = roc_auc_score(y_val, y_val_probs)
auroc_test = roc_auc_score(y_test, y_test_probs)

In [16]:
print("AUROC Score on training set:", auroc_train)
print("AUROC Score on validation set:", auroc_val)
print("AUROC Score on testing set:", auroc_test)

AUROC Score on training set: 0.9793101534690553
AUROC Score on validation set: 0.9527849450628566
AUROC Score on testing set: 0.9530439222551397


The AUROC on testing set for AdaBoost classifier with defalt hyperparameter is __0.9530439222551397__.

## 3. Hyperparameter Tuning

### 3.1 Tuning the _random_state_

In [17]:
adaboost_Tune1 = AdaBoostClassifier(random_state=42, n_estimators=50)

In [18]:
adaboost_Tune1.fit(X_train, y_train)

In [19]:
y_train_probs1 = adaboost_Tune1.predict_proba(X_train)[:, 1]
y_val_probs1 = adaboost_Tune1.predict_proba(X_val)[:, 1]
y_test_probs1 = adaboost_Tune1.predict_proba(X_test)[:, 1]

In [20]:
print(roc_auc_score(y_train, y_train_probs1))
print(roc_auc_score(y_val, y_val_probs1))
print(roc_auc_score(y_test, y_test_probs1))

0.9793101534690553
0.9527849450628566
0.9530439222551397


The random_state seems to not affect the AUROC score.

### 3.2 Tuning the _n_estimators_

__n_estimators=100__

In [21]:
adaboost_Tune2 = AdaBoostClassifier(random_state=None, n_estimators=100)

In [22]:
adaboost_Tune2.fit(X_train, y_train)

In [23]:
y_train_probs2 = adaboost_Tune2.predict_proba(X_train)[:, 1]
y_val_probs2 = adaboost_Tune2.predict_proba(X_val)[:, 1]
y_test_probs2 = adaboost_Tune2.predict_proba(X_test)[:, 1]

In [24]:
print(roc_auc_score(y_train, y_train_probs2))
print(roc_auc_score(y_val, y_val_probs2))
print(roc_auc_score(y_test, y_test_probs2))

0.9814974625615549
0.9554226478983145
0.9578687494389083


The AUROC is now higher than the defalt and adaboost_Tune1.

__n_estimators=150__

In [25]:
adaboost_Tune3 = AdaBoostClassifier(random_state=None, n_estimators=150)

In [26]:
adaboost_Tune3.fit(X_train, y_train)

In [27]:
y_train_probs3 = adaboost_Tune3.predict_proba(X_train)[:, 1]
y_val_probs3 = adaboost_Tune3.predict_proba(X_val)[:, 1]
y_test_probs3 = adaboost_Tune3.predict_proba(X_test)[:, 1]

In [28]:
print(roc_auc_score(y_train, y_train_probs3))
print(roc_auc_score(y_val, y_val_probs3))
print(roc_auc_score(y_test, y_test_probs3))

0.9826306212926615
0.9557959789401811
0.9581509785438549


It's getting higher when n_estimators is increase.

__n_estimators=200__

In [29]:
adaboost_Tune4 = AdaBoostClassifier(random_state=None, n_estimators=200)

In [30]:
adaboost_Tune4.fit(X_train, y_train)

In [31]:
y_train_probs4 = adaboost_Tune4.predict_proba(X_train)[:, 1]
y_val_probs4 = adaboost_Tune4.predict_proba(X_val)[:, 1]
y_test_probs4 = adaboost_Tune4.predict_proba(X_test)[:, 1]

In [32]:
print(roc_auc_score(y_train, y_train_probs4))
print(roc_auc_score(y_val, y_val_probs4))
print(roc_auc_score(y_test, y_test_probs4))

0.9835687281110368
0.955498126958101
0.958140878893976


n_estimators=150 provided the highest score of AUROC.

### 3.3 Tuning _learning_rate_

__learning_rate=0.05__

In [33]:
adaboost_Tune5 = AdaBoostClassifier(random_state=None, n_estimators=150, learning_rate=0.05)

In [34]:
adaboost_Tune5.fit(X_train, y_train)

In [35]:
y_train_probs5 = adaboost_Tune5.predict_proba(X_train)[:, 1]
y_val_probs5 = adaboost_Tune5.predict_proba(X_val)[:, 1]
y_test_probs5 = adaboost_Tune5.predict_proba(X_test)[:, 1]

In [36]:
print(roc_auc_score(y_train, y_train_probs5))
print(roc_auc_score(y_val, y_val_probs5))
print(roc_auc_score(y_test, y_test_probs5))

0.9745993186155162
0.9683702096924401
0.9736331807164019


__learning_rate=0.1__

In [37]:
adaboost_Tune6 = AdaBoostClassifier(random_state=None, n_estimators=150, learning_rate=0.1)

In [38]:
adaboost_Tune6.fit(X_train, y_train)

In [39]:
y_train_probs6 = adaboost_Tune6.predict_proba(X_train)[:, 1]
y_val_probs6 = adaboost_Tune6.predict_proba(X_val)[:, 1]
y_test_probs6 = adaboost_Tune6.predict_proba(X_test)[:, 1]

In [40]:
print(roc_auc_score(y_train, y_train_probs6))
print(roc_auc_score(y_val, y_val_probs6))
print(roc_auc_score(y_test, y_test_probs6))

0.9764938734809504
0.9693903382158607
0.974367088607595


In [41]:
dump(adaboost_Tune6,  '../models/best_adaboost_Tune6.joblib')

__learing_rate=0.5__

In [42]:
adaboost_Tune7 = AdaBoostClassifier(random_state=None, n_estimators=150, learning_rate=0.5)

In [43]:
adaboost_Tune7.fit(X_train, y_train)

In [44]:
y_train_probs7 = adaboost_Tune7.predict_proba(X_train)[:, 1]
y_val_probs7 = adaboost_Tune7.predict_proba(X_val)[:, 1]
y_test_probs7 = adaboost_Tune7.predict_proba(X_test)[:, 1]

In [45]:
print(roc_auc_score(y_train, y_train_probs7))
print(roc_auc_score(y_val, y_val_probs7))
print(roc_auc_score(y_test, y_test_probs7))

0.9801183824415812
0.9607125687730356
0.9649631923871083


### 3.4 Conclusion

Training the Adaboost classification model with __random_state=None, n_estimators=150, learning_rate=0.1__ provide the best score of AUROC on testing set with less overfit ,0.974367088607595.

## 4. Predict probability on Test Dataset

In [46]:
raw_test = pd.read_csv('../data/raw/test.csv')

__Cleaned test dataset__

In [47]:
test = pd.read_csv('../data/processed/test_cleaned.csv')

In [48]:
test.head()

Unnamed: 0,team,conf,GP,Min_per,Ortg,usg,eFG,TS_per,ORB_per,DRB_per,...,ogbpm,dgbpm,oreb,dreb,treb,ast,stl,blk,pts,player_id
0,Morgan St.,MEAC,2,3.0,115.1,4.7,50.0,50.0,0.0,4.6,...,-2.46774,-2.27566,0.0,0.3333,0.3333,0.0,0.0,0.0,1.0,cf302b4d-84f7-4124-a25d-a75eed31978b
1,South Carolina St.,MEAC,11,17.6,61.1,18.6,34.7,35.18,2.5,15.7,...,-7.49472,-4.41253,0.2727,1.4545,1.7273,0.4545,0.1818,0.0,2.3636,f91837cd-4f49-4b70-963d-aeb82c6ce3da
2,Binghamton,AE,9,28.6,91.9,23.8,54.1,52.49,6.4,22.5,...,-2.92495,1.71789,1.3333,4.4444,5.7778,1.0,0.6667,1.8889,8.8889,53ec2a29-1e7d-4c6d-86d7-d60d02af8916
3,Illinois,B10,7,1.3,111.0,10.4,83.3,83.33,0.0,13.4,...,-0.767911,0.962469,0.0,0.2857,0.2857,0.0,0.0,0.0,0.7143,32402798-471c-4a54-8cb4-29cd95199014
4,Iowa St.,B12,23,78.5,103.1,21.5,54.0,56.12,3.6,10.2,...,2.89361,-1.019,1.0435,2.8696,3.913,1.1739,0.8261,0.087,14.3043,73b960f9-27b8-4431-9d23-a760e9bbc360


_Label encoding_

In [49]:
from sklearn.preprocessing import LabelEncoder, StandardScaler

In [50]:
num_test = list(test.select_dtypes('number').columns)
cat_test = list(set(test.columns) - set(num_test))

In [51]:
features_test = pd.DataFrame(test)

In [52]:
le = LabelEncoder()

In [53]:
for col in cat_test:
    features_test[col] = le.fit_transform(test[col])

_Scale data_

In [54]:
scaler = StandardScaler()

In [55]:
features_test[num_test] = scaler.fit_transform(test[num_test])

In [56]:
features_test.head()

Unnamed: 0,team,conf,GP,Min_per,Ortg,usg,eFG,TS_per,ORB_per,DRB_per,...,ogbpm,dgbpm,oreb,dreb,treb,ast,stl,blk,pts,player_id
0,178,16,-1.675458,-1.142856,0.72447,-1.942278,0.3062,0.17317,-0.735206,-0.913609,...,-0.047183,-0.446359,-1.019688,-1.057036,-1.117648,-0.927909,-1.114426,-0.684429,-0.973761,4025
1,258,16,-0.587953,-0.619906,-0.803748,0.108028,-0.432411,-0.568201,-0.37958,0.286694,...,-0.791983,-0.99152,-0.620751,-0.325595,-0.445316,-0.533705,-0.73152,-0.684429,-0.694412,4819
2,23,2,-0.829621,-0.225903,0.067903,0.875049,0.504129,0.297732,0.175197,1.022015,...,-0.114923,0.572481,0.930816,1.624937,1.508257,-0.060574,0.289774,4.280789,0.642372,1630
3,115,5,-1.071289,-1.203748,0.608439,-1.101505,1.913766,1.840504,-0.735206,0.037982,...,0.204665,0.379757,-1.019688,-1.088089,-1.140606,-0.927909,-1.114426,-0.684429,-1.03229,965
4,123,6,0.862054,1.561439,0.384866,0.53579,0.499301,0.479323,-0.223104,-0.308051,...,0.747158,-0.125758,0.506863,0.597579,0.608856,0.090255,0.625501,-0.455738,1.751779,2268


In [57]:
features_test.to_csv('../data/processed/features_test.csv', index=False)

_Select features_

In [58]:
# select features the same as Train dataset
features_to_keep = ['adjoe', 'rimmade', 'dunks_ratio', 'adrtg']
X_testdataset = features_test[features_to_keep]

In [59]:
print(test['player_id'].head())

0    4025
1    4819
2    1630
3     965
4    2268
Name: player_id, dtype: int64


__Predict probabilty on test dataset__

In [60]:
# Use trained models to predict probability on Dataset B
drafted_probability_poly = adaboost_Tune6.predict_proba(X_testdataset)[:, 1]

In [61]:
results_B = pd.DataFrame({'drafted': drafted_probability_poly})

In [62]:
results_B['player_id'] = raw_test['player_id']
results_B = results_B[['player_id', 'drafted']]
print(results_B)

                                 player_id   drafted
0     cf302b4d-84f7-4124-a25d-a75eed31978b  0.187184
1     f91837cd-4f49-4b70-963d-aeb82c6ce3da  0.183125
2     53ec2a29-1e7d-4c6d-86d7-d60d02af8916  0.231419
3     32402798-471c-4a54-8cb4-29cd95199014  0.388955
4     73b960f9-27b8-4431-9d23-a760e9bbc360  0.377473
...                                    ...       ...
4965  a25ee55f-02a3-4f8e-8194-a5f427e14e7c  0.279007
4966  d0d9f45e-7b01-44b3-8d40-514ec338611d  0.187184
4967  f8df22c4-1602-4fab-896d-8820951aae2f  0.218021
4968  b791c69a-f769-4163-afda-051a6fd20a9d  0.218021
4969  18b51f5d-4746-4121-88fd-c8d0a1399130  0.367796

[4970 rows x 2 columns]


In [64]:
results_B.to_csv('../data/processed/results_B.csv', index=False)

## References

So, A. (2023). _36120_AdvMLA-Lab2_Exercise3-Solutions.ipynb_. https://colab.research.google.com/drive/1LzgqM0bRDNL9hf0GiE2t2xCzpamOvwAJ?authuser=1#scrollTo=bQ530dp0MJJQ