# Question 2: Model Assessment Strategies
---
Samarth Kumar

Import dependencies

Run the command below to access UCI ML Repository Data

In [None]:
%pip install ucimlrepo

Collecting ucimlrepo
  Downloading ucimlrepo-0.0.7-py3-none-any.whl.metadata (5.5 kB)
Downloading ucimlrepo-0.0.7-py3-none-any.whl (8.0 kB)
Installing collected packages: ucimlrepo
Successfully installed ucimlrepo-0.0.7


In [None]:
import time
import numpy as np
import pandas as pd
from IPython.display import display
from ucimlrepo import fetch_ucirepo
from sklearn.model_selection import KFold, train_test_split
from sklearn.metrics import roc_auc_score
from sklearn.tree import DecisionTreeClassifier

In [None]:
# Load the Wine Quality dataset from UCI ML Repository.
wine_quality = fetch_ucirepo(id=186)
X = wine_quality.data.features
y = wine_quality.data.targets

# xFeat is an n x d array.
xFeat = X.to_numpy()

# y is an n x 1 array.
y = y.to_numpy().ravel()

# Convert quality scores into binary classification.
# Good Quality (1) when y is at least 6.
# Bad Quality (0) when 0 < y <= 5.
y = np.where(y >= 6, 1, 0)

Initialize a DecisionTreeClassifier model from Scikit-learn

In [None]:
model = DecisionTreeClassifier(max_depth = 10, min_samples_split = 5)

### (a) Holdout Method

In [None]:
def holdout(model, xFeat, y, testSize):
    start = time.time()
    xTrain, xTest, yTrain, yTest = train_test_split(xFeat, y, test_size=testSize)
    model.fit(xTrain, yTrain)
    trainAuc = roc_auc_score(yTrain, model.predict_proba(xTrain)[:, -1])
    testAuc = roc_auc_score(yTest, model.predict_proba(xTest)[:, -1])
    timeElapsed = time.time() - start
    return trainAuc, testAuc, timeElapsed

### (b) K-Fold Cross-Validation

In [None]:
def kfold(model, xFeat, y, k):
    trainSum, testSum = 0.0, 0.0
    start = time.time()
    kf = KFold(n_splits=k, shuffle=True)
    for train_index, test_index in kf.split(xFeat):
        xTrain, xTest = xFeat[train_index], xFeat[test_index]
        yTrain, yTest = y[train_index], y[test_index]
        model.fit(xTrain, yTrain)
        trainAuc = roc_auc_score(yTrain, model.predict_proba(xTrain)[:,-1])
        testAuc = roc_auc_score(yTest, model.predict_proba(xTest)[:,-1])
        trainSum += trainAuc
        testSum += testAuc
    timeElapsed = time.time() - start
    return trainSum/k, testSum/k, timeElapsed

### (c) Monte Carlo Cross-Validation

In [None]:
def monte_carlo(model, xFeat, y, testSize, s):
    trainSum, testSum = 0.0, 0.0
    start = time.time()
    for i in range(s):
        state = np.random.randint(0,10000)
        xTrain, xTest, yTrain, yTest = train_test_split(xFeat, y, test_size=testSize, random_state=state)
        model.fit(xTrain, yTrain)
        trainSum += roc_auc_score(yTrain, model.predict_proba(xTrain)[:, -1])
        testSum += roc_auc_score(yTest, model.predict_proba(xTest)[:, -1])
    timeElapsed = time.time() - start
    return trainSum/s, testSum/s, timeElapsed

### (d) Table of the AUC and time for each model selection technique.

In [None]:
table = pd.DataFrame(columns=['trainAuc', 'testAuc', 'timeElapsed'])
table.loc['Holdout'] = holdout(model, xFeat, y, 0.3)
table.loc['K-Fold'] = kfold(model, xFeat, y, 10)
table.loc['Monte Carlo'] = monte_carlo(model, xFeat, y, 0.3, 40)
print('Results for each model assessment strategy:')
display(table)


Results for each model assessment strategy:


Unnamed: 0,trainAuc,testAuc,timeElapsed
Holdout,0.943394,0.763922,0.16204
K-Fold,0.946634,0.78701,1.321869
Monte Carlo,0.950889,0.770126,4.124124


From the table above, using the K-Fold Cross-Validation technique yielded the highest average AUC of 0.787 on the test data, outperforming Holdout and Monte Carlo methods.
Because Holdout only uses one split, there can be higher variance in the results. K-Fold CV and Monte Carlo can be more robust, if enough splits are made. K-Fold Cross-Validation uses multiple splits, therefore it will likely allow the model to be more accurate and have a better AUC than the Holdout method. Monte Carlo uses repeated random splits and it depends more on the number of iterations, s. If too few iterations are used then the variance will not decrease enough.
The Holdout technique took the least amount of time (0.16 seconds) while Monte Carlo took the longest (4.12 seconds), indicating K-Fold CV compromised a little bit on time in order to maximize the AUC. K-Fold CV (1.32 seconds) was still significantly faster than Monte Carlo. The Monte Carlo method was the slowest, likely because it involved repeated random splitting of the data and also retraining the model multiple times. The K-Fold method only trained and evaluated the model for the set number of times, k. Holdout trains and evaluates the model only once, therefore being the fastest, even if it doesn't prioritize yielding the best AUC.