# **CS 5361/6361 Machine Learning**

**Classifying Fashion MNIST and predicting particle behavior using regression algorithms, while evaluating performance based on running time and accuracy for classification and mean squared error for regression.**

**Authors:** <br>
Ruben Martinez <br>
Francis Owusu Dampare <br>
**Last modified:** 10/17/2024<br>

In [None]:
import time
import numpy as np
import pandas as pd
import tensorflow as tf
import matplotlib.pyplot as plt

from google.colab import files, drive

from sklearn.metrics import accuracy_score, confusion_matrix, mean_squared_error, mean_absolute_error
from sklearn import tree
from sklearn.model_selection import train_test_split

# Import all 8 classifiers and regressors.
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.linear_model import LogisticRegression, LinearRegression

# **1. Determining which classification algorithm works best for the fashion MNIST dataset, in terms of running time and accuracy.**

In [None]:
(X_train, y_train), (X_test, y_test) = tf.keras.datasets.fashion_mnist.load_data()

# Normalize the pixel values to be between 0 and 1 and flatten
X_train = np.float32(X_train.reshape(X_train.shape[0],-1)/255)
X_test = np.float32(X_test.reshape(X_test.shape[0],-1)/255)

print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(60000, 784)
(60000,)
(10000, 784)
(10000,)


In [None]:
classifiers = {'K Nearest Neighbors' : KNeighborsClassifier(metric='euclidean',n_jobs=-1),
               'Decision Tree' : DecisionTreeClassifier(criterion='entropy',splitter='best'),
               'Random Forest' : RandomForestClassifier(criterion='entropy',n_jobs=-1),
               'Logistic Regression' : LogisticRegression(n_jobs=-1)
               }

# Table header
print(f'{"Classifier":<20}{"Training Time (s)":<20}{"Testing Time (s)":<20}{"Training Acc":<15}{"Testing Acc":<15}\n{"-" * 90}')

for classifier_name, model in classifiers.items():
    start = time.time()
    model.fit(X_train, y_train)
    end = time.time()
    training_time = end - start

    train_pred = model.predict(X_train)
    training_accuracy = accuracy_score(y_train, train_pred)

    start = time.time()
    test_pred = model.predict(X_test)
    end = time.time()
    testing_time = end - start

    testing_accuracy = accuracy_score(y_test, test_pred)

    print(f'{classifier_name:<20}{training_time:<20.4f}{testing_time:<20.4f}{training_accuracy:<15.4f}{testing_accuracy:<15.4f}')
    print(f'Confusion matrix:\n{confusion_matrix(y_test, test_pred)}\n')
    # print(f'Precision: {precision_score(y_test, test_pred):6.4f}')
    # print(f'Recall: {recall_score(y_test, test_pred):6.4f}')


Classifier          Training Time (s)   Testing Time (s)    Training Acc   Testing Acc    
------------------------------------------------------------------------------------------
K Nearest Neighbors 0.0369              43.6773             0.8998         0.8554         
Confusion matrix:
[[855   1  17  16   3   1 100   1   6   0]
 [  8 968   4  12   4   0   3   0   1   0]
 [ 24   2 819  11  75   0  69   0   0   0]
 [ 41   8  15 860  39   0  34   0   3   0]
 [  2   1 126  26 773   0  71   0   1   0]
 [  1   0   0   0   0 822   5  96   1  75]
 [176   1 132  23  80   0 575   0  13   0]
 [  0   0   0   0   0   3   0 961   0  36]
 [  2   0  10   4   7   0  16   7 953   1]
 [  0   0   0   0   0   2   1  29   0 968]]

Decision Tree       47.6041             0.0086              1.0000         0.8010         
Confusion matrix:
[[736   4  25  46  10   3 163   0  12   1]
 [  8 948   4  25   5   0   8   0   2   0]
 [ 24   2 688  16 144   0 118   0   7   1]
 [ 44  33  16 777  58   0  64   0   6  

# **2. Determining which regression algorithm works best for the particles dataset, in terms of running time and mean squared error.**

In [None]:
drive.mount('/content/drive')
X = np.load('/content/drive/My Drive/Colab Notebooks/particles dataset/particles_X.npy')
y = np.load('/content/drive/My Drive/Colab Notebooks/particles dataset/particles_y.npy')

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
(1419954, 5)
(1419954,)
(354989, 5)
(354989,)


In [None]:

regressors = {
    'K Nearest Neighbors': KNeighborsRegressor(metric='euclidean',n_jobs=-1),
    'Decision Tree': DecisionTreeRegressor(criterion='squared_error',splitter='best'),
    'Linear Regression': LinearRegression(n_jobs=-1),
    'Random Forest': RandomForestRegressor(n_estimators=5, criterion='squared_error',n_jobs=-1)
}

# Table header
print(f'{"Regressor":<20}{"Training Time (s)":<20}{"Testing Time (s)":<20}{"MSE":<10}{"MAE":<10}\n{"-" * 90}')

for regressor_name, model in regressors.items():
    start = time.time()
    model.fit(X_train, y_train)
    end = time.time()
    training_time = end - start

    start = time.time()
    pred = model.predict(X_test)
    end = time.time()
    testing_time = end - start

    mse = mean_squared_error(y_test, pred)
    mae = mean_absolute_error(y_test, pred)

    print(f'{regressor_name:<20}{training_time:<20.4f}{testing_time:<20.4f}{mse:<10.4f}{mae:<10.4f}')

Regressor           Training Time (s)   Testing Time (s)    MSE       MAE       
------------------------------------------------------------------------------------------
K Nearest Neighbors 3.8346              15.3108             0.0441    0.1705    
Decision Tree       23.1917             0.7474              0.0759    0.2195    
Linear Regression   0.1757              0.0038              0.0435    0.1683    
Random Forest       63.8585             2.7343              0.0458    0.1732    


# **3. Discussing results in terms of algorithms and dataset characteristics.**





## Classification
The classification algorithms we have covered in the course are the **K-Nearest Neighbors (KNN) algorithm**, **Decision Tree algorithm**, **Random Forest algorithm**, and **Logistic Regression**.
The dataset used was the fashion MNIST data. This is a dataset which has 70000 instances with 784 features. 60000 of these instances were used for the training of the algorithms and 10000 instances were used for the testing.

### Running Times

The order from fastest to slowest in terms of training times are: KNN, Logistic Regression, Decision tree, Random Forest.

The order from fastest to slowest in terms of testing times are: Decision Tree, Logistic Regression, Random Forest, KNN.

The K-Nearest-Neighbors algorithm had the fastest training time of approximately **0.04s** and the Random Forest algorithm had the slowest training time of approximately **110.95s**. This makes sense, because KNN is a lazy learning algorithm, which doesn't actually modify the data or create any data structure, while Random Forests isn't a lazy learning algorithm and took longer because it spent time creating the trees.  Following Random Forests, the second slowest algorithm was Decision tree. This is intuitive since a single Decision tree should be faster than multiple trees in the Random Forest. Following the Decision Tree as the third slowest was Logistic regression with a training time of **32.62s**. Regarding the testing time, however, the Decision Tree algorithm had the least time of **0.01s** with  KNN algorithm having the highest of **43.68s**. Since the KNN algorithm had the fastest training time but slowest testing time this leads us to believe that there is a tradeoff between testing and training times for algorithms. That is, if the training time is fast, then the testing will be slow and vice versa. Logistic Regression and Random Forest algorithm had the second  and third least testing times of **0.06s** and **0.35s** respectively. This shows the pros and cons of the classification algorithms. K-nn algorithm is the fastest when it comes to training and slowest with testing. Decision Tree is relatively slow in training but fast in testing.

### Accuracy
On accuracy, all four classification algorithms seems to have similar values for both training and testing. These values are similarly reflected in the Confusion Matrix generated for all four algorithms; the predicated values are similar. Interestingly, from the Confusion Matrix, it can be observed that many instances of Class 6 get predicated as Class 0 by all the classifiers. If we look at the fashion mnist dataset, it makes sense that Class 6 (shirts) get misclassified as Class 0 (t-shirt/top) by all the classifiers. When looking at the fashion MNIST dataset, these two categories have very similar visual features, which makes it harder for the algorithms to differentiate between them. Shirts and t-shirts/tops can both have short sleeves and similar shapes, causing confusion in the models.

Regarding algorithm performance, KNN performed the the second best in terms of accuracy; with an accuracy of **.8554**. This is likely because KNN directly compares the test instance to the training instances, allowing it to handle subtle differences between classes more effectively. The drawback is that it took the longest to test since KNN evaluates each test instance against the entire training set.

On the other hand, Random Forests performed the best in terms of accuracy, with an accuracy of **.8769**. This is likely because Random Forests reduce the problem of overfitting by averaging the results of multiple decision trees, leading to better generalization on unseen data. Unlike Decision Trees, which are prone to overfitting due to their tendency to memorize the training data, Random Forests introduce randomness by building each tree on different subsets of the data. This helps them perform well, even on high-dimensional datasets like Fashion MNIST.

In summary, **KNN** can be considered the best overall algorithm for the Fashion MNIST dataset when considering both running time and accuracy. While Random Forests achieved the highest accuracy, they took significantly longer to train and test (110.95 seconds for training). KNN, on the other hand, offers a solid balance—achieving the second-best accuracy while taking considerably less time to train and test. Its strength lies in not overfitting the data, but the tradeoff is the longer testing time. Given the criteria of both running time and accuracy, KNN emerges as the most practical choice, providing competitive performance without the excessive time overhead of Random Forests.

## Regression
The regression algorithms we covered in this course are the Nearest Neighbors (KNN), Decision Tree, Random Forest, and Linear Regression. The dataset used was the Particles dataset, consisting of 1,774,943 records with 5 features. We used 80% (1,419,954 records) for training and 20% (345,989 records) for testing.

### Running Time
In terms of running time, Linear Regression had the fastest performance, with a training time of **0.18** seconds and the shortest testing time of **0.004** seconds. Random Forest had the longest training time at **63.86** seconds and the second-longest testing time at **2.73** seconds. KNN had the fourth-longest training time of **3.83** seconds, but it had the slowest testing time of 15.31 seconds. The Random Forest algorithm’s performance was constrained due to using only 5 trees, as the default number of trees caused the system to crash.

### Performance
In terms of Mean Squared Error (MSE), Linear Regression again performed the best, achieving an MSE of 0.0435, making it the most accurate regressor. KNN had the second-best MSE of 0.0441, followed by Random Forest with 0.0458. The Decision Tree algorithm had the worst MSE at 0.0759.

Considering both running time and MSE, **Linear Regression** emerges as the best regressor for the Particles dataset. The dataset consists of solar proton flux measurements taken at five different times in the past, with the goal of predicting the flux two hours into the future. This type of dataset, with continuous values over time, is well-suited for Linear Regression due to its simplicity and ability to model linear relationships effectively. Since solar proton flux likely follows a relatively smooth, continuous trend over time, Linear Regression can capture the general pattern without overfitting the data, which explains why it performed the best in terms of both running time and accuracy (MSE). Its ability to handle this type of time-series data efficiently makes it an ideal choice for this task. It achieved the lowest error rate and was also the fastest to train and test. KNN performed well in terms of MSE but was significantly slower in testing, which makes Linear Regression the most practical and efficient choice for this dataset.