# Question 3: Robustness of Decision Trees and KNN
---
Samarth Kumar

Run this command if the ucimlrepo library isn't installed. It's required to directly load the dataset in python

In [None]:
%pip install ucimlrepo

Collecting ucimlrepo
  Downloading ucimlrepo-0.0.7-py3-none-any.whl.metadata (5.5 kB)
Downloading ucimlrepo-0.0.7-py3-none-any.whl (8.0 kB)
Installing collected packages: ucimlrepo
Successfully installed ucimlrepo-0.0.7


Import dependencies

In [None]:
import numpy as np
import pandas as pd
from IPython.display import display
from ucimlrepo import fetch_ucirepo
from sklearn.model_selection import KFold, train_test_split
from sklearn.metrics import roc_auc_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier

Load the dataset

In [None]:
wine_quality = fetch_ucirepo(id=186)
X = wine_quality.data.features
y = wine_quality.data.targets
xFeat = X.to_numpy()
y = y.to_numpy().ravel()
y = np.where(y >= 6, 1, 0)

Initialize the Decision Tree and KNN Models

In [None]:
dtModel = DecisionTreeClassifier()
knnModel = KNeighborsClassifier()

### (a) Optimal Parameters for KNN and Decision Tree


#### KNN

In [None]:
# Testing various numumbers of neighbors.
neighbors = [5, 10, 15, 20]
knnParameters = 0
knnAuc = 0.0

# Iterate for each value in neighbors array.
for k in neighbors:
    # Initialize the model with current neighbors, n. Initialize train/test AUC.
    model = KNeighborsClassifier(n_neighbors=k)
    trainSum, testSum = 0.0, 0.0

    # Split the data, using 5-fold cross-validation.
    kf = KFold(n_splits=5, shuffle=True)
    for train_index, test_index in kf.split(xFeat):
        # Initialize train/test data and train the model.
        xTrain, xTest = xFeat[train_index], xFeat[test_index]
        yTrain, yTest = y[train_index], y[test_index]
        model.fit(xTrain, yTrain)

        # Sum AUC for train/test data.
        trainSum += roc_auc_score(yTrain, model.predict_proba(xTrain)[:, -1])
        testSum += roc_auc_score(yTest, model.predict_proba(xTest)[:, -1])

    # Average the AUC for train/test across the 5 folds.
    trainAuc = trainSum / 5
    testAuc = testSum / 5

    # Update the best parameters when test AUC is better than the current.
    if testAuc > knnAuc:
        knnAuc = testAuc
        knnParameters = k

#### Decision Tree

In [None]:
# Testing various max depths and minimum samples split.
maxDepth = [5, 10, 15, 20]
samplesSplit = [2, 5, 10, 15]
treeParameters = (0, 0)
treeAuc = 0.0

# Iterate for each maxDepth value.
for d in maxDepth:

    # Iterate for each minimum samples split value.
    for sample in samplesSplit:

        # Initialize the Decision Tree model.
        model = DecisionTreeClassifier(max_depth=d, min_samples_split=sample)
        trainSum, testSum = 0.0, 0.0

        # Split the data, using 5-fold cross-validation, like I previously did for KNN.
        kf = KFold(n_splits=5, shuffle=True)
        for train_index, test_index in kf.split(xFeat):
            # Initialize train/test data and train the model.
            xTrain, xTest = xFeat[train_index], xFeat[test_index]
            yTrain, yTest = y[train_index], y[test_index]
            model.fit(xTrain, yTrain)

            # Sum AUC for train/test data
            trainSum += roc_auc_score(yTrain, model.predict_proba(xTrain)[:, -1])
            testSum += roc_auc_score(yTest, model.predict_proba(xTest)[:, -1])

        # Average the AUC for train/test across the 5 folds.
        trainAuc = trainSum / 5
        testAuc = testSum / 5

        # Update the best parameters when test AUC is better than the current.
        if testAuc > treeAuc:
            treeAuc = testAuc
            treeParameters = (d, sample)

Display the optimal parameters

In [None]:
display(pd.DataFrame({
    'Decision Tree': [treeParameters, treeAuc],
    'KNN': [knnParameters, knnAuc]
}, index=['Optimal Parameters', 'Optimal AUC']))

Unnamed: 0,Decision Tree,KNN
Optimal Parameters,"(10, 15)",15.0
Optimal AUC,0.796502,0.710276


In the table above, the optimal parameters are formatted in a tuple, where the first value (10) is the maximum depth, and the second value (15) represents the minimum samples split. The parameter for KNN, 15, represents the number of neighbors.

Setting up for parts (b) and (c)

In [None]:
# Initializing array to store all resulting values.
results = []

# Percentages (1%, 5%, 10% to be removed from training data).
percentages = [0.01, 0.05, 0.10]

### (b) KNN: Using hyperparameters found in (a), create 3 subsets removing 1%, 5%, and 10% of the training data and train the models.

In [None]:
# Train KNN model on entire dataset with optimal parameters, record the AUC and Accuracy.
knnModel = KNeighborsClassifier(n_neighbors=knnParameters)
knnModel.fit(xFeat, y)
testAuc_knn = roc_auc_score(y, knnModel.predict_proba(xFeat)[:, -1])
testAccuracy_knn = knnModel.score(xFeat, y)
results.append(['KNN Entire', testAuc_knn, testAccuracy_knn])

# Train KNN on the 3 subsets.
for percent in percentages:
    # Initialize model using the optimal parameters.
    knnModel = KNeighborsClassifier(n_neighbors=knnParameters)

    # Randomly remove a percentage of data, then split into training and test.
    xTrain_reduced, _, yTrain_reduced, _ = train_test_split(xFeat, y, test_size=percent)
    xTrain, xTest, yTrain, yTest = train_test_split(xTrain_reduced, yTrain_reduced, test_size=0.3)

    # Train the model.
    knnModel.fit(xTrain, yTrain)

    # Evaluate AUC and Accuracy, store them.
    testAuc = roc_auc_score(yTest, knnModel.predict_proba(xTest)[:, -1])
    testAccuracy = knnModel.score(xTest, yTest)
    results.append([f'KNN ({percent*100}%)', testAuc, testAccuracy])

### (c) Decision Tree: Using hyperparameters found in (a), create 3 subsets removing 1%, 5%, and 10% of the training data and train the models.

In [None]:
# Train Decision Tree model on entire dataset with optimal parameters, record the AUC and Accuracy.
dtModel = DecisionTreeClassifier(max_depth=treeParameters[0], min_samples_split=treeParameters[1])
dtModel.fit(xFeat, y)
testAuc_tree = roc_auc_score(y, dtModel.predict_proba(xFeat)[:, -1])
testAccuracy_tree = dtModel.score(xFeat, y)
results.append(['Decision Tree Entire', testAuc_tree, testAccuracy_tree])

# Train KNN on the 3 subsets.
for percent in percentages:
    # Initialize the model using the optimal parameters.
    dtModel = DecisionTreeClassifier(max_depth=treeParameters[0], min_samples_split=treeParameters[1])

    # Randomly remove a percentage of data, then split into training and test.
    xTrain_reduced, _, yTrain_reduced, _ = train_test_split(xFeat, y, test_size=percent)
    xTrain, xTest, yTrain, yTest = train_test_split(xTrain_reduced, yTrain_reduced, test_size=0.3)

    # Train the model.
    dtModel.fit(xTrain, yTrain)

    # Evaluate AUC and Accuracy, store them.
    testAuc = roc_auc_score(yTest, dtModel.predict_proba(xTest)[:, -1])
    testAccuracy = dtModel.score(xTest, yTest)
    results.append([f'Decision Tree ({percent*100}%)', testAuc, testAccuracy])

### (d) Report AUC and Accuracy from the 8 models created from (b) and (c).

In [None]:
display(pd.DataFrame(results, columns=['Model', 'Test AUC', 'Test Accuracy']))

Unnamed: 0,Model,Test AUC,Test Accuracy
0,KNN Entire,0.791578,0.722026
1,KNN (1.0%),0.681082,0.651813
2,KNN (5.0%),0.690568,0.663067
3,KNN (10.0%),0.713675,0.681481
4,Decision Tree Entire,0.929302,0.858088
5,Decision Tree (1.0%),0.769275,0.725907
6,Decision Tree (5.0%),0.753405,0.729482
7,Decision Tree (10.0%),0.776242,0.74359


When comparing KNN and Decision Tree models, it is clear that Decision Tree is much more sensitive to reductions in data. The KNN algorithm achieved an AUC of 92.9%, but the AUC decreased to 76.9% when 1% of the training data was removed. This decline was much sharper than when 5% or 10% of the data was removed. The accuracy also decreased, from 85.8% to 72.6%.

In contrast, the KNN model was slightly more robust, noticing a 11% decrease in AUC from removing 1% versus the 16% decrease that Decision Tree saw. And the accuracies for all variations of the KNN model were weaker than that of Decision Tree.

From both model's it's clear that both will have noticable decreases in performance from removing only 1% of the data, but as we decrease a higher percentage of data (5% or 10%, the performance only sees slight changes and evens out. Despite Decision Tree being more sensitive initially, it still had a better accuracy overall.