<a href="https://colab.research.google.com/github/toccht/CSCI-4962/blob/main/CSCI4962_HW2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CSCI-4962 Homework 2
Author: Trevor Tocchet

In [490]:
#@title Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn import tree
from sklearn.model_selection import cross_val_score, RepeatedStratifiedKFold, StratifiedKFold, train_test_split, KFold
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier,AdaBoostClassifier

In [487]:
#@title Disable Warnings
import warnings
from sklearn.exceptions import DataConversionWarning
warnings.filterwarnings(action='ignore', category=DataConversionWarning)
warnings.filterwarnings(action='ignore', category=FutureWarning)

In [494]:
#@title Data Pre-Processing
df = pd.read_csv('banknote-data.csv')

X = df.drop(df.columns[[4]], axis=1)
Y = df.drop(df.columns[[0,1,2,3]], axis=1)

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=0)

In [507]:
#@title Decision Tree

# Initialize our decision tree object
classification_tree = tree.DecisionTreeClassifier(max_depth=None, min_samples_split=2, max_features=None, min_impurity_split=0)

# Train our decision tree (tree induction + pruning)
#classification_tree = classification_tree.fit(iris.data, iris.target)
#tree.plot_tree(classification_tree)

# evaluate the model
cv = StratifiedKFold(n_splits=10, random_state=1, shuffle=True)
n_scores = cross_val_score(classification_tree, X, Y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')

# report performance
print('Accuracy: %.3f (%.3f)' % (np.mean(n_scores), np.std(n_scores)))

clf = classification_tree.fit(X, Y)
print('Training set accuracy: %.3f ' % clf.score(X, Y) )

classification_tree.fit(X_train, Y_train)
dt_score = classification_tree.score(X_test, Y_test)
print('Decision tree model accuracy with 0.7/0.3 split (Holdout Method): %.3f' % dt_score)

Accuracy: 0.982 (0.016)
Training set accuracy: 1.000 
Decision tree model accuracy with 0.7/0.3 split (Holdout Method): 0.976


Of the four hyperparameters (max_depth, min_samples_split, max_features, min_impurity_split) that I adjusted individually and at the same time, nothing resulted in increased accuracy versus the default parameters. If I changed the max_depth starting from one, the accuracy increased with each step until it leveled off at a max_depth around 6-7. For min_samples_split, the accuracy gradually decreased as the parameter increased. For max_features, there was no noticable affect on the accuracy. Lastly, min_impurity_split quickly decreased accuracy when the parameter was changed from 0.4 -> 0.5 but stayed relatively constant at the same accuraccy with larger inputs.

Examples:

(max_depth=None, min_samples_split=2, max_features=None, min_impurity_split=0) = 0.982

(max_depth=None, min_samples_split=10000, max_features=None, min_impurity_split=0) = 0.555

(max_depth=None, min_samples_split=2, max_features=None, min_impurity_split=0.4) = 0.852



In [369]:
#@title Bagging (Random Forest)

random_forest = RandomForestClassifier()

# evaluate the model
cv = StratifiedKFold(n_splits=10, random_state=1, shuffle=True)
n_scores = cross_val_score(random_forest, X, Y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')

# report performance
print('Accuracy: %.3f (%.3f)' % (np.mean(n_scores), np.std(n_scores)))


Accuracy: 0.993 (0.008)


In [502]:
#@title Boosting (Gradient Boosting)

gradient_boost = GradientBoostingClassifier()

# evaluate the model
cv = StratifiedKFold(n_splits=10, random_state=1, shuffle=True)
n_scores = cross_val_score(gradient_boost, X, Y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')

# report performance
print('Accuracy: %.3f (%.3f)' % (np.mean(n_scores), np.std(n_scores)))

Accuracy: 0.996 (0.007)


The ensamble methods bagging and boosting both yielded high results but the boosting method preformed slightly better than bagging. Bagging involves averaging predictions made by many independent models that are fitted on different subsets of the same data. Boosting involves models that depend on the prediction errors of the previous models, building a "strong-learner" from many "weak-learners."

In [506]:
#@title Metric Comparison

k_fold = KFold(n_splits=5, random_state=1, shuffle=True) # default number of splits = 5

random_forest.fit(X_train, Y_train)
fr_score = random_forest.score(X_test, Y_test)
print('Random forest model accuracy with 0.7/0.3 split (Holdout Method): %.3f' % fr_score)

n_scores = cross_val_score(random_forest, X, Y, scoring='accuracy', cv=k_fold, n_jobs=-1, error_score='raise')
print('Random forest model accuracy with 5 K-Fold: %.3f (%.3f)' % (np.mean(n_scores), np.std(n_scores)))

gradient_boost.fit(X_train, Y_train)
gb_score = gradient_boost.score(X_test, Y_test)
print('Gradient boosting model accuracy with 0.7/0.3 split (Holdout Method): %.3f' % gb_score)

n_scores = cross_val_score(gradient_boost, X, Y, scoring='accuracy', cv=k_fold, n_jobs=-1, error_score='raise')
print('Gradient boosting model accuracy with 5 K-Fold: %.3f (%.3f)' % (np.mean(n_scores), np.std(n_scores)))


print('\nBalance of the dataset:\n', df['class'].value_counts() , sep='' )


Random forest model accuracy with 0.7/0.3 split (Holdout Method): 0.988
Random forest model accuracy with 5 K-Fold: 0.991 (0.003)
Gradient boosting model accuracy with 0.7/0.3 split (Holdout Method): 0.993
Gradient boosting model accuracy with 5 K-Fold: 0.995 (0.002)

Balance of the dataset:
0    762
1    610
Name: class, dtype: int64


Using the holdout method, there is a high variability in accuracy measure after different training rounds, especially in smaller datasets. This is because the model is being trained on a different split each time. This method also leads to the model being trained on only a subest of the entire training, leaving out data and patterns that the model could learn. This is where the high variability comes from because the accuracy is dependent on the training data set.

K-Fold helps to eliminate this issue by repeating the holdout method k times for k subsets/splits. The model at each iteration is independent of the model in the previous iteration. So this accuracy metric is more reliable than that of the holdout method generally. Although, if there is an imbalance of data (more labels of a particular class than another class) then subsets could include an inaccurate representation of the whole dataset.

Stratified K-Fold address this problem by restructuring the data so that each subset/split has a balanced and accurate representation of the entire dataset. This method leads to a trustworthy accuracy metric so that we can compare different models accurately.

All these methods are non-exhaustive methods.

Because (as shown above) there is a slight imbalance in the dataset, choosing to compare models with stratified k-fold method is the best choice. The decesion tree classifier had the lowest accuracy, followed by random forest, and gradient boosting had the best accuracy. 

More information/resources:


*   https://www.analyticsvidhya.com/blog/2021/05/importance-of-cross-validation-are-evaluation-metrics-enough/
*   https://neptune.ai/blog/cross-validation-in-machine-learning-how-to-do-it-right