# **Handwritten Digit Classification**
## Single-Task Classification Implementing *GP Classification*, *Neural Networks*, *k-NN*, and *SVM*


### Dataset Description ([Link to Data](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html))

Each entry corresponds to one hand-written digit on 8x8 pixels. The dataset contains 1797 samples, with about 180 samples for each of the 10 classes (0-9). A Gaussian Process Classification using a Compound RBF Kernel yields the joint highest accuracy rate with a Support Vector Machine also using an RBF Kernel.

In [418]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import sys

from fvgp import GP
from fvgp.gp_kernels import squared_exponential_kernel, matern_kernel_diff2

from sklearn.datasets import load_digits
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.utils import shuffle
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
from sklearn.metrics import classification_report

### Data Pre-Processing

In [424]:
# Load digits dataset
digits = load_digits()
X, y = digits.data, digits.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

## Gaussian Processes
A Compound Radial Basis Function of 10 kernels is defined, building on the off-the-shelf `squared_exponential_kernel` provided within the `fvgp.gp_kernels` class. The kernels are defined to be anisotropic in attempt to capture the likelihood of significant differences between the features. A prior mean is not declared, so the model assumes it to be the (constant) mean of the `y_train` values.

In [422]:
def rbf_kernel_anisotropic(x1, x2, length_scale, constant):
    distance = np.linalg.norm((x1[:, np.newaxis, :] - x2[np.newaxis, :, :]) / length_scale, axis=2)
    return constant ** 2 * squared_exponential_kernel(distance, length_scale[0])
    
def compound_rbf_kernel(x1, x2, hyperparameters):
    n_kernels = 10
    n_features = x1.shape[1]
    length_scales = hyperparameters[:n_kernels * n_features].reshape(n_kernels, n_features)
    constants = hyperparameters[n_kernels * n_features:]
    kernels = [
        rbf_kernel_anisotropic(
            x1, x2,
            length_scales[i],
            constants[i]
        )
        for i in range(n_kernels)
    ]
    return sum(kernels)

In [425]:
n_kernels = 10
length_scales = np.random.uniform(1, 100, X_train.shape[1] * n_kernels)
constants = np.random.uniform(1, 300, n_kernels)

init_hyperparameters = np.concatenate((length_scales, constants))

length_bounds = np.array([[1, 100]] * len(length_scales))
constant_bounds = np.array([[1, 300]] * len(constants))
hps_bounds = np.concatenate((length_bounds, constant_bounds))


#### One-v-Rest Approach

For each of the 10 classes, a separate Gaussian Process Regression (GPR) is trained – i.e. the respective class' y-value is set to 1, while the rest are 0.

In [374]:
gp_models = []
num_classes = 10  # One for each digit 0-9
for class_label in range(num_classes):
    y_train_binary = (y_train == class_label).astype(float)
    gp_model = GP(
        X_train_scaled,
        y_train_binary,
        init_hyperparameters=init_hyperparameters,
        gp_kernel_function=compound_rbf_kernel,
        noise_variances=np.ones(y_train_binary.shape) * 0.25
    )
    gp_model.train(
        hyperparameter_bounds=hps_bounds,
        method='local',
        max_iter=30,
        tolerance=1,
    )
    gp_models.append(gp_model)

#### Prediction

During prediction, the posterior means are extracted from the GPRs, and then are converted to probabilities using the softmax function. The datapoint is then classified based on the greatest probability.

In [373]:
def predict_probs(X_test, gp_models):
    means = np.zeros((X_test.shape[0], len(gp_models)))
    for class_label, gp_model in enumerate(gp_models):
        posterior_rbf = gp_model.posterior_mean(X_test)  # Use posterior_mean
        mean = posterior_rbf["f(x)"]  # Access the mean predictions
        means[:, class_label] = mean.flatten()
    return softmax(means.T).T

def softmax(x):
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum(axis=0)

In [386]:
gp_probabilities = predict_probs(X_test_scaled, gp_models)
gp_predictions = np.argmax(gp_probabilities, axis=1)
gp_accuracy = accuracy_score(y_test, gp_predictions)

In [387]:
print(f'GP Classifier – Accuracy: {gp_accuracy}', '\n')
print(classification_report(y_test, gp_predictions))

GP Classifier – Accuracy: 0.9833333333333333 

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        17
           1       1.00      1.00      1.00        11
           2       1.00      1.00      1.00        17
           3       0.94      0.94      0.94        17
           4       1.00      1.00      1.00        25
           5       0.96      1.00      0.98        22
           6       1.00      1.00      1.00        19
           7       1.00      0.95      0.97        19
           8       1.00      1.00      1.00         8
           9       0.96      0.96      0.96        25

    accuracy                           0.98       180
   macro avg       0.99      0.98      0.99       180
weighted avg       0.98      0.98      0.98       180



## PyTorch Neural Network
A feedforward neural network is implemented using PyTorch. It consists of three fully connected layers: the first layer maps 64 input features to 128 neurons, the second layer reduces this to 64 neurons, and the output layer maps these to 10 output classes. ReLU activation functions are applied after the first two layers to introduce non-linearity, and a softmax function is used in the output layer to convert the final outputs into probability distributions over the 10 classes. The model is trained using the Adam optimizer and cross-entropy loss over 11 epochs with mini-batch gradient descent.

In [390]:
# Define the neural network model
class NeuralNet(nn.Module):
    def __init__(self):
        super(NeuralNet, self).__init__()
        self.layer1 = nn.Linear(64, 128)
        self.layer2 = nn.Linear(128, 64)
        self.output_layer = nn.Linear(64, 10)
        self.relu = nn.ReLU()
        self.softmax = nn.Softmax(dim=1)

    def forward(self, x):
        x = self.relu(self.layer1(x))
        x = self.relu(self.layer2(x))
        x = self.softmax(self.output_layer(x))
        return x

model = NeuralNet()

In [391]:
# Loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Prepare data
X_train_scaled_tensor = torch.tensor(X_train_scaled, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train, dtype=torch.long)
X_test_scaled_tensor = torch.tensor(X_test_scaled, dtype=torch.float32)
y_test_tensor = torch.tensor(y_test, dtype=torch.long)

train_data = TensorDataset(X_train_scaled_tensor, y_train_tensor)
test_data = TensorDataset(X_test_scaled_tensor, y_test_tensor)

train_loader = DataLoader(train_data, batch_size=32, shuffle=True)
test_loader = DataLoader(test_data, batch_size=32, shuffle=False)

In [392]:
for epoch in range(11):
    model.train()
    for X_batch, y_batch in train_loader:
        optimizer.zero_grad()
        outputs = model(X_batch)
        loss = criterion(outputs, y_batch)
        loss.backward()
        optimizer.step()

In [412]:
model.eval()
correct = 0
total = 0
with torch.no_grad():
    for X_batch, y_batch in test_loader:
        outputs = model(X_batch)
        _, predicted = torch.max(outputs.data, 1)
        total += y_batch.size(0)
        correct += (predicted == y_batch).sum().item()

test_acc = correct / total

model.eval()
pt_y_pred = []
with torch.no_grad():
    for X_batch in DataLoader(X_test_scaled_tensor, batch_size=32):
        outputs = model(X_batch)
        _, predicted = torch.max(outputs.data, 1)
        pt_y_pred.extend(predicted.cpu().numpy())

pt_prediction = pt_y_pred

In [413]:
print(f'PyTorch – Accuracy: {test_acc}', '\n')
print(classification_report(y_test, pt_prediction))


PyTorch – Accuracy: 0.9777777777777777 

              precision    recall  f1-score   support

           0       1.00      0.94      0.97        17
           1       1.00      1.00      1.00        11
           2       0.94      1.00      0.97        17
           3       1.00      0.94      0.97        17
           4       1.00      1.00      1.00        25
           5       0.96      1.00      0.98        22
           6       1.00      1.00      1.00        19
           7       1.00      0.95      0.97        19
           8       0.89      1.00      0.94         8
           9       0.96      0.96      0.96        25

    accuracy                           0.98       180
   macro avg       0.97      0.98      0.98       180
weighted avg       0.98      0.98      0.98       180



## k-Nearest Neighbors

In [403]:
param_grid = {'n_neighbors': np.arange(3, 19, 2)}
knn = KNeighborsClassifier()

# Use GridSearchCV to find the best parameter
grid_search = GridSearchCV(knn, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train_scaled, y_train)


In [415]:
best_knn = grid_search.best_estimator_
knn_prediction = best_knn.predict(X_test_scaled)
knn_accuracy = accuracy_score(y_test, knn_prediction)

In [416]:
print(f'Optimal Neighbors: {grid_search.best_params_["n_neighbors"]}')
print(f'kNN – Accuracy: {knn_accuracy}', '\n')
print(classification_report(y_test, knn_prediction))

Optimal Neighbors: 3
kNN – Accuracy: 0.9777777777777777 

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        17
           1       1.00      1.00      1.00        11
           2       0.94      1.00      0.97        17
           3       0.94      0.94      0.94        17
           4       0.96      1.00      0.98        25
           5       1.00      1.00      1.00        22
           6       1.00      1.00      1.00        19
           7       1.00      0.95      0.97        19
           8       1.00      1.00      1.00         8
           9       0.96      0.92      0.94        25

    accuracy                           0.98       180
   macro avg       0.98      0.98      0.98       180
weighted avg       0.98      0.98      0.98       180



## Support Vector Machines

In [419]:
param_grid = {
    'C': [0.1, 1, 10, 100],
    'gamma': [1, 0.1, 0.01, 0.001],
    'kernel': ['linear', 'rbf']
}

svm = SVC()

grid_search = GridSearchCV(svm, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train_scaled, y_train)


best_svm = grid_search.best_estimator_
svm_prediction = best_svm.predict(X_test_scaled)
svm_accuracy = accuracy_score(y_test, y_pred_svm)

In [420]:
print(f'Best Parameters: {grid_search.best_params_}')
print(f'SVM – Accuracy: {accuracy_svm}')
print(classification_report(y_test, y_pred_svm))

Best Parameters: {'C': 10, 'gamma': 0.01, 'kernel': 'rbf'}
SVM – Accuracy: 0.9833333333333333
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        17
           1       1.00      1.00      1.00        11
           2       1.00      1.00      1.00        17
           3       1.00      0.94      0.97        17
           4       1.00      1.00      1.00        25
           5       0.96      1.00      0.98        22
           6       1.00      1.00      1.00        19
           7       1.00      0.95      0.97        19
           8       0.89      1.00      0.94         8
           9       0.96      0.96      0.96        25

    accuracy                           0.98       180
   macro avg       0.98      0.98      0.98       180
weighted avg       0.98      0.98      0.98       180

