## Day 35 Lecture 2 Assignment

In this assignment, we will combine what we have learned so far about classification algorithms this week.

In [31]:
%matplotlib inline

import time
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.svm import SVC
from xgboost import XGBClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import GradientBoostingClassifier, AdaBoostClassifier
from sklearn.metrics import confusion_matrix, plot_confusion_matrix, accuracy_score, classification_report, recall_score

In [2]:
admission = pd.read_csv('https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/Admission_Predict.csv')

In [3]:
admission.head()

Unnamed: 0,Serial No.,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
0,1,337,118,4,4.5,4.5,9.65,1,0.92
1,2,324,107,4,4.0,4.5,8.87,1,0.76
2,3,316,104,3,3.0,3.5,8.0,1,0.72
3,4,322,110,3,3.5,2.5,8.67,1,0.8
4,5,314,103,2,2.0,3.0,8.21,0,0.65


In this assignment, we will predict the probability of a student being admitted to a PhD program given their stats. To make the predictions, find the median for the Chance of Admit column. Create an admit column where all probabilities above the median will receive a 1 for that column and all probabilities below the median will be a zero.

Below you will process and clean the data, try the SVM classifier, the gradient boosted decision tree classifier and XGBoost, and compare your results.

Have fun!

In [5]:
coa_median = admission['Chance of Admit '].median()

In [7]:
admission['Admit'] = (admission['Chance of Admit '] > coa_median) + 0
admission.head()

Unnamed: 0,Serial No.,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit,Admit
0,1,337,118,4,4.5,4.5,9.65,1,0.92,1
1,2,324,107,4,4.0,4.5,8.87,1,0.76,1
2,3,316,104,3,3.0,3.5,8.0,1,0.72,0
3,4,322,110,3,3.5,2.5,8.67,1,0.8,1
4,5,314,103,2,2.0,3.0,8.21,0,0.65,0


In [8]:
admission.drop('Chance of Admit ', axis=1, inplace=True)

In [9]:
admission.isnull().mean()

Serial No.           0.0
GRE Score            0.0
TOEFL Score          0.0
University Rating    0.0
SOP                  0.0
LOR                  0.0
CGPA                 0.0
Research             0.0
Admit                0.0
dtype: float64

*No missing data.*

In [12]:
admission.Admit.value_counts()

0    209
1    191
Name: Admit, dtype: int64

In [13]:
admission.Research.value_counts()

1    219
0    181
Name: Research, dtype: int64

In [14]:
admission.drop('Serial No.', axis=1, inplace=True)

*Classes for target variable seem to be balanced and the only categorical independent variable is already encoded. Serial No. is dropped.*

### Train-Test Split

In [35]:
# splitting with 0.2 as test size
X = admission.drop('Admit', axis=1)
y = admission.Admit

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [36]:
scaler = StandardScaler()
X_train_ = scaler.fit_transform(X_train)
X_test_ = scaler.transform(X_test)

SVClassifier

In [37]:
start_time = time.time()

svc = SVC()
svc.fit(X_train_, y_train)
svctrain_predict = svc.predict(X_train_)
svctest_predict = svc.predict(X_test_)

print('execution time:', "--- %s seconds ---" % (time.time() - start_time))

print(f'\nSVC Training accuracy score: {accuracy_score(y_train, svctrain_predict)}')
print(f'SVC Test accuracy score: {accuracy_score(y_test, svctest_predict)}')
print(f'SVC Training recall score: {recall_score(y_train, svctrain_predict)}')
print(f'SVC Test recall score: {recall_score(y_test, svctest_predict)}\n')
print(f'{classification_report(y_test, svctest_predict)}')

execution time: --- 0.011095762252807617 seconds ---

SVC Training accuracy score: 0.896875
SVC Test accuracy score: 0.875
SVC Training recall score: 0.8846153846153846
SVC Test recall score: 0.8285714285714286

              precision    recall  f1-score   support

           0       0.87      0.91      0.89        45
           1       0.88      0.83      0.85        35

    accuracy                           0.88        80
   macro avg       0.88      0.87      0.87        80
weighted avg       0.88      0.88      0.87        80



GradientBoostedClassifier

In [38]:
start_time = time.time()

gbc = GradientBoostingClassifier(learning_rate=0.05, max_depth=5)
gbc.fit(X_train_, y_train)
gbctrain_predict = gbc.predict(X_train_)
gbctest_predict = gbc.predict(X_test_)

print('execution time:', "--- %s seconds ---" % (time.time() - start_time))

print(f'\nGBC Training accuracy score: {accuracy_score(y_train, gbctrain_predict)}')
print(f'GBC Test accuracy score: {accuracy_score(y_test, gbctest_predict)}')
print(f'GBC Training recall score: {recall_score(y_train, gbctrain_predict)}')
print(f'GBC Test recall score: {recall_score(y_test, gbctest_predict)}\n')
print(f'{classification_report(y_test, gbctest_predict)}')

execution time: --- 0.15767979621887207 seconds ---

GBC Training accuracy score: 1.0
GBC Test accuracy score: 0.8375
GBC Training recall score: 1.0
GBC Test recall score: 0.7428571428571429

              precision    recall  f1-score   support

           0       0.82      0.91      0.86        45
           1       0.87      0.74      0.80        35

    accuracy                           0.84        80
   macro avg       0.84      0.83      0.83        80
weighted avg       0.84      0.84      0.84        80



*Gradient boosting is not the best model due to overfitting - possibly caused by the miniscule covariance some features exhibit, causing it to generalize most of the test data without regularization.*

AdaBoostClassifier

In [39]:
start_time = time.time()

abc = AdaBoostClassifier(learning_rate=0.05)
abc.fit(X_train_, y_train)
abctrain_predict = abc.predict(X_train_)
abctest_predict = abc.predict(X_test_)

print('execution time:', "--- %s seconds ---" % (time.time() - start_time))

print(f'\nABC Training accuracy score: {accuracy_score(y_train, abctrain_predict)}')
print(f'ABC Test accuracy score: {accuracy_score(y_test, abctest_predict)}')
print(f'ABC Training recall score: {recall_score(y_train, abctrain_predict)}')
print(f'ABC Test recall score: {recall_score(y_test, abctest_predict)}\n')
print(f'{classification_report(y_test, abctest_predict)}')

execution time: --- 0.09810853004455566 seconds ---

ABC Training accuracy score: 0.88125
ABC Test accuracy score: 0.9
ABC Training recall score: 0.8782051282051282
ABC Test recall score: 0.8285714285714286

              precision    recall  f1-score   support

           0       0.88      0.96      0.91        45
           1       0.94      0.83      0.88        35

    accuracy                           0.90        80
   macro avg       0.91      0.89      0.90        80
weighted avg       0.90      0.90      0.90        80



*AdaBoost Classifier model looks like the best model due to having the best scores overall without overfitting to the training data. Even if the test accuracy is higher than the training's, it should not be a pain point for this model considering the exponential loss function's better performance for classification thresholds.*

XGBoostClassifier

In [40]:
start_time = time.time()

xgb = XGBClassifier(learning_rate=0.05, max_depth=5)
xgb.fit(X_train_, y_train)
xgbtrain_predict = xgb.predict(X_train_)
xgbtest_predict = xgb.predict(X_test_)

print('execution time:', "--- %s seconds ---" % (time.time() - start_time))

print(f'\nXGB Training accuracy score: {accuracy_score(y_train, xgbtrain_predict)}')
print(f'XGB Test accuracy score: {accuracy_score(y_test, xgbtest_predict)}')
print(f'XGB Training recall score: {recall_score(y_train, xgbtrain_predict)}')
print(f'XGB Test recall score: {recall_score(y_test, xgbtest_predict)}\n')
print(f'{classification_report(y_test, xgbtest_predict)}')

execution time: --- 0.04252290725708008 seconds ---

XGB Training accuracy score: 0.9625
XGB Test accuracy score: 0.875
XGB Training recall score: 0.9615384615384616
XGB Test recall score: 0.8

              precision    recall  f1-score   support

           0       0.86      0.93      0.89        45
           1       0.90      0.80      0.85        35

    accuracy                           0.88        80
   macro avg       0.88      0.87      0.87        80
weighted avg       0.88      0.88      0.87        80



*Like Gradient Boost, XGBoost exhibits overfitting possibly due to the relatively low learning rate wrt. the data set's size. While the scores are desirable, other unseen data may exhibit high variance.*