## MNIST ML Project:
**Project Description:**

It is a dataset of handwritten numbers from
0 to 9. MNIST has a training set of 60,000 examples, and a test set of 10,000
examples. It can be downloaded from: http://yann.lecun.com/exdb/mnist/
K Nearest Neighbors (KNN) is a classifier that finds the class of the test sample
based on the distance of it from the training samples. It finds the K training
samples with smallest distance to the test sample. The dominant class in the K
points is then selected as the test point class.


**Team Members:**

Roaa Fathi Nada

Selsabeel Asim Ali Elbagir

Basma Mahmoud Hashem

Howida Adel Abd El-Halim

In [None]:
from keras.datasets import mnist
from sklearn.model_selection import train_test_split

from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB

from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score, f1_score


In [None]:
from skimage.io import imread, imsave
from skimage.transform import resize
from skimage.feature import hog
from skimage import exposure
import matplotlib.pyplot as plt
from skimage import exposure

In [None]:
# Load the MNIST dataset
(x_train_full, y_train_full), (x_test, y_test) = mnist.load_data()

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz


In [None]:
x_train, x_val, y_train, y_val = train_test_split(x_train_full, y_train_full, test_size=0.2, random_state=42)

In [None]:
# Print the shape of each set
print("Training set shape:", x_train.shape, y_train.shape)
print("Validation set shape:", x_val.shape, y_val.shape)
print("Testing set shape:", x_test.shape, y_test.shape)

Training set shape: (48000, 28, 28) (48000,)
Validation set shape: (12000, 28, 28) (12000,)
Testing set shape: (10000, 28, 28) (10000,)


# Applying HOG Features to the images (sklearn)

*  HOG focuses on the structure of the object. It extracts the information of the edges magnitude as well as the orientation of the edges.
*   It uses a detection window of 64x128 pixels, so the image is first converted into (64, 128) shape.
*   The image is then further divided into small parts, and then the gradient and orientation of each part is calculated. It is divided into 8x16 cells into blocks with 50% overlap, so there are going to be 7x15 = 105 blocks in total, and each block consists of 2x2 cells with 8x8 pixels.
*   We take the 64 gradient vectors of each block (8x8 pixel cell) and put them into a 9-bin histogram.


In [None]:
# Function to compute HOG features for a given image
def compute_hog_features(image):
    # Image resized to fit the detection window of 64x128 pixels
    resized_img = resize(image, (128, 64))

    # Compute HOG features using sk-learn built in function
    features = hog(resized_img, block_norm='L2-Hys', pixels_per_cell=(8, 8), cells_per_block=(2, 2))
    #each block consists of 2x2 cells, and every cell has 8x8 pixels

    return features

In [None]:
# Apply HOG features to the entire training set
hog_features_train = [compute_hog_features(img) for img in x_train]

# Apply HOG features to the entire validation set
hog_features_val = [compute_hog_features(img) for img in x_val]

# Apply HOG features to the entire testing set
hog_features_test = [compute_hog_features(img) for img in x_test]

# Display the shape of the HOG features for the first image in the training set
print("HOG Features shape for the first image in the training set:", hog_features_train[0].shape)

HOG Features shape for the first image in the training set: (3780,)


# KNN with 'K' as a parameter and Euclidian distance

(For my teammates to see ofc)
This link is nice for explaining what KNN does, in code:
https://machinelearningmastery.com/tutorial-to-implement-k-nearest-neighbors-in-python-from-scratch/

Step 1: Calculate Euclidean Distance.

Step 2: Get Nearest Neighbors.

Step 3: Make Predictions.


In [None]:
import numpy as np

In [None]:
X_train = np.array(hog_features_train)
X_val = np.array(hog_features_val)
X_test = np.array(hog_features_test)

In [None]:
# Flatten the labels
#y_train_flat = y_train.flatten()
#y_val_flat = y_val.flatten()


flatten the arrays to ensure each element corresponds to one label, which simplifies the proccess of classification for the built in KNN function

In [None]:
# Function to train and evaluate KNN classifier
def calculate_knn_metrics(k):
    knn = KNeighborsClassifier(n_neighbors=k, metric = 'euclidean')
    ## uses euclidean metric for our KNN classifier

    # Train the classifier
    knn.fit(X_train, y_train)

    # Make predictions on the validation set
    predictions = knn.predict(X_val)

    # Evaluate accuracy
    accuracy = accuracy_score(y_val, predictions)
    print(f"Accuracy for k={k}: {accuracy * 100:.2f}%")

    # Calculate precision
    precision = precision_score(y_val, predictions, average='weighted')


    # Calculate F1 score
    f1 = f1_score(y_val, predictions, average='weighted')
    return precision, f1




In [None]:
# Test different values of K
for k in [1, 3, 5]:
  precision_knn, f1_knn = calculate_knn_metrics(k)
  print(f"Precision for KNN: {precision_knn:.2f}")
  print(f"F1 Score for KNN: {f1_knn:.2f}")
  print("====================================")

Accuracy for k=1: 97.82%
Precision for KNN: 0.98
F1 Score for KNN: 0.98
Accuracy for k=3: 97.65%
Precision for KNN: 0.98
F1 Score for KNN: 0.98
Accuracy for k=5: 97.58%
Precision for KNN: 0.98
F1 Score for KNN: 0.98


# Implement SVM

In [None]:
def calculate_svm_metrics(X_train, y_train, X_val, y_val, C=1.0, degree=3):
  # Create a linear SVM classifier
  svm = SVC(kernel='linear', C=C, degree=degree)

  # Train the classifier
  svm.fit(X_train, y_train)

  # Make predictions on the validation set
  predictions = svm.predict(X_val)

  # Evaluate accuracy
  accuracy = accuracy_score(y_val, predictions)
  print(f"Accuracy for SVM: {accuracy * 100:.2f}%")

  # Calculate precision
  precision = precision_score(y_val, predictions, average='weighted')

  # Calculate F1 score
  f1 = f1_score(y_val, predictions, average='weighted')

  return precision, f1


##SVM with Different Hyper Parameters

In [None]:
# Reshape X_train
#X_train_flat = X_train.reshape(X_train.shape[0], -1)

In [None]:
# Example 1: Smaller Regularization Parameter (C) ===>
precision_svm_1, f1_svm_1 = calculate_svm_metrics(X_train, y_train, X_val, y_val, C=0.8, degree=3)
print(f"Precision for SVM (C=0.8): {precision_svm_1:.2f}")
print(f"F1 Score for SVM (C=0.8): {f1_svm_1:.2f}")

Accuracy for SVM: 98.34%
Precision for SVM (C=0.8): 0.98
F1 Score for SVM (C=0.8): 0.98


In [None]:

# Example 2: Different values for regularization parameter (C)
precision_svm_2, f1_svm_2 = calculate_svm_metrics(X_train, y_train, X_val, y_val, C=1.2, degree=3)
print(f"Precision for SVM (C=1.2): {precision_svm_2:.2f}")
print(f"F1 Score for SVM (C=1.2): {f1_svm_2:.2f}")



Accuracy for SVM: 98.34%
Precision for SVM (C=1.2): 0.98
F1 Score for SVM (C=1.2): 0.98


In [None]:
# Example 3: Different values for polynomial kernel degree
precision_svm_3, f1_svm_3 = calculate_svm_metrics(X_train, y_train, X_val, y_val, C=1.0, degree=2)
print(f"Precision for SVM (degree=2): {precision_svm_3:.2f}")
print(f"F1 Score for SVM (degree=2): {f1_svm_3:.2f}")



Accuracy for SVM: 98.35%
Precision for SVM (degree=2): 0.98
F1 Score for SVM (degree=2): 0.98


In [None]:
# Lower Polynomial Kernel Degree ==> Smoother Decision Boundary ==> May Underfit
precision_svm_4, f1_svm_4 = calculate_svm_metrics(X_train, y_train, X_val, y_val, C=1.0, degree=4)
print(f"Precision for SVM (degree=4): {precision_svm_4:.2f}")
print(f"F1 Score for SVM (degree=4): {f1_svm_4:.2f}")

Accuracy for SVM: 98.35%
Precision for SVM (degree=4): 0.98
F1 Score for SVM (degree=4): 0.98


# Bayesian Classifier Method


# What are the parameters of the function?
GaussianNB(priors, var_smoothing)


1.   priors: sharing initial thoughts or knowledge with the model about how likely each class is before it learns from the data, given to the model in form of array.
2.   var_smoothing: Portion of the largest variance of all features that is added to variances for calculation stability, Its default value is 1e-9.


# How does var_smoothing affect the accuracy?

1.   If there is no variance (all samples in that feature have same value), var_smoothing adds a small value to the variance to prevent division by zero.
2.   Increasing var_smoothing value will lead to (maybe) smooth out the model's decision boundaries, reducing overfitting and result in lower accuracy by oversimplifying the relationships between features and classes.
3. Decreasing var_smoothing value will lead to (maybe) make the model more sensitive to the training data, resulting in overfitting.


**The choice of var_smoothing value should be made carefully, considering the trade-offs between stability, sensitivity to data, and model performance on unseen data.**



In [None]:
def calculate_gaussian_metrics(vs):

  # Create a Gausian Bayes Classifier
  gnb = GaussianNB(var_smoothing= vs)

  # Train the classifier
  gnb.fit(X_train, y_train)

  # Create predictions on the validation set
  predictions = gnb.predict(X_val)

  # Evaluate accuracy
  accuracy = accuracy_score(y_val, predictions)
  print(f"Accuracy for Gaussian Naive Bayes Classifier: {accuracy * 100:.2f}%")

  # Calculate precision
  precision = precision_score(y_val, predictions, average='weighted')

  # Calculate F1 score
  f1 = f1_score(y_val, predictions, average='weighted')

  return precision, f1

In [None]:
for vs in [1e-5, 1e-6, 1e-7, 1e-8, 1e-9]:
  precision_gaussian, f1_gaussian = calculate_gaussian_metrics(vs)
  print(f"Precision for Gaussian Naive Bayes: {precision_gaussian:.2f}")
  print(f"F1 Score for Gaussian Naive Bayes: {f1_gaussian:.2f}")
  print("============================")

Accuracy for Gaussian Naive Bayes Classifier: 87.88%
Precision for Gaussian Naive Bayes: 0.89
F1 Score for Gaussian Naive Bayes: 0.88
Accuracy for Gaussian Naive Bayes Classifier: 86.99%
Precision for Gaussian Naive Bayes: 0.88
F1 Score for Gaussian Naive Bayes: 0.87
Accuracy for Gaussian Naive Bayes Classifier: 86.05%
Precision for Gaussian Naive Bayes: 0.88
F1 Score for Gaussian Naive Bayes: 0.86
Accuracy for Gaussian Naive Bayes Classifier: 84.87%
Precision for Gaussian Naive Bayes: 0.87
F1 Score for Gaussian Naive Bayes: 0.85
Accuracy for Gaussian Naive Bayes Classifier: 83.55%
Precision for Gaussian Naive Bayes: 0.86
F1 Score for Gaussian Naive Bayes: 0.84


# Comparison of the three different models using precision, recall, F-Score..

Roaa, add the part where we calculate accuracy for KNN and SVM, this is just precision and F1 score

In [None]:
k = 1  # best accuracy with k equals 1
precision_knn, f1_knn = calculate_knn_metrics(k)
print(f"Precision for KNN: {precision_knn:.2f}")
print(f"F1 Score for KNN: {f1_knn:.2f}")
print("====================================")

Accuracy for k=1: 97.82%
Precision for KNN: 0.98
F1 Score for KNN: 0.98


In [None]:
#best accuracy for SVM model was C = 1.0, degree = 2
precision_svm, f1_svm = calculate_svm_metrics(X_train, y_train, X_val, y_val, C=1.0, degree=2)
print(f"Precision for SVM: {precision_svm:.2f}")
print(f"F1 Score for SVM: {f1_svm:.2f}")
print("====================================")

Accuracy for SVM: 98.35%
Precision for SVM: 0.98
F1 Score for SVM: 0.98


In [None]:
vs = 1e-5 ## this hyperparameter had the best accuracy for this model
precision_gaussian, f1_gaussian = calculate_gaussian_metrics(vs)
print(f"Precision for Gaussian Naive Bayes: {precision_gaussian:.2f}")
print(f"F1 Score for Gaussian Naive Bayes: {f1_gaussian:.2f}")

Accuracy for Gaussian Naive Bayes Classifier: 87.88%
Precision for Gaussian Naive Bayes: 0.89
F1 Score for Gaussian Naive Bayes: 0.88


In [None]:
# Old calculation should be deleted
# precision_gaussian, f1_gaussian = calculate_gaussian_metrics(X_train, y_train_flat, X_val, y_val_flat)
# print(f"Precision for Gaussian Naive Bayes: {precision_gaussian:.2f}")
# print(f"F1 Score for Gaussian Naive Bayes: {f1_gaussian:.2f}")

Important Notes:
1. A comment on the results and on the comparison of the three applied models
should be given. (As a printed report)
2. It is expected that the selected models should be experimented with different
hyper-parameters.
3. At the final comparison of the results, proper metrics should be selected such
as: precision, recall, F1 measure (F-Score), ...
4. Error analysis should be stated such as: correctly/wrongly classified
examples. Reasons and suggested improvements should also be included.