# Handwritten Digit Classification
### Gaussian Process Classification

#### Dataset Description ([Link to Data](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html))

Each entry corresponds to one hand-written digit on 8x8 pixels. The dataset contains 1797 samples, with about 180 samples for each of the 10 classes (0-9). A Gaussian Process Classification using the Jensen-Shannon Metric with an exponential kernels, where 10 models are trained for a One-v-Rest approach. We use PCA for direction optimizationand use both the probit link function to convert the regression to classes. This example should replicated without too much difficulty or time. Instead of training on global optimization, use an MCMC with 1000 iterations.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import sys

from fvgp import GP
from fvgp.gp_kernels import exponential_kernel

from sklearn.datasets import load_digits
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.utils import shuffle
from sklearn.decomposition import PCA

from scipy.stats import wasserstein_distance
from scipy.stats import norm

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset


In [2]:
# 1. Load and Preprocess the Digits Dataset
digits = load_digits()
X, y = digits.data, digits.target

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.1, random_state=42, stratify=y
)

# Normalize the data to resemble probability distributions
for i in range(len(X_train)):
    X_train[i] = (X_train[i] - np.min(X_train[i])) + 1e-8
    X_train[i] = X_train[i] / np.sum(X_train[i])

for i in range(len(X_test)):
    X_test[i] = (X_test[i] - np.min(X_test[i])) + 1e-8
    X_test[i] = X_test[i] / np.sum(X_test[i])


In [3]:
# 2. Define KL and JS Divergence Functions
def KL(p, q):
    return np.sum(p * np.log(p / q + 1e-10))  # Added epsilon to avoid log(0)

def JS_divergence(p, q):
    M = 0.5 * (p + q)
    return 0.5 * (KL(p, M) + KL(q, M))

# 3. Compute JS Divergence Matrices with Caching
def compute_JS_matrix(X1, X2):
    n1 = X1.shape[0]
    n2 = X2.shape[0]
    JS_matrix = np.zeros((n1, n2))
    for i in range(n1):
        for j in range(n2):
            JS_matrix[i, j] = JS_divergence(X1[i], X2[j])
    return JS_matrix

In [4]:
print("Computing JS divergence matrices...")
JS_X_train = compute_JS_matrix(X_train, X_train)       # Training vs. Training
JS_X_test = compute_JS_matrix(X_test, X_test)          # Testing vs. Testing
JS_X_train_test = compute_JS_matrix(X_train, X_test)   # Training vs. Testing
print("JS divergence matrices computed.")

Computing JS divergence matrices...
JS divergence matrices computed.


In [5]:
# 4. Define the GP Kernel Function Using Precomputed JS Matrices
def JS_kernel(X1, X2, hyperparameters):
    length_scale = hyperparameters[0]
    n_train = X_train.shape[0]
    n_test = X_test.shape[0]
    if len(X1) == n_train and len(X2) == n_train:
        K = exponential_kernel(JS_X_train, length_scale)
    elif len(X1) == n_test and len(X2) == n_test:
        K = exponential_kernel(JS_X_test, length_scale)
    elif len(X1) == n_train and len(X2) == n_test:
        K = exponential_kernel(JS_X_train_test, length_scale)
    elif len(X1) == n_test and len(X2) == n_train:
        K = exponential_kernel(JS_X_train_test.T, length_scale)
    else:
        raise ValueError("Invalid input sizes for X1 and X2.")
    return K

In [6]:
# 5. Initialize Hyperparameters and Bounds
initial_length_scale = 1.0
init_hyperparameters = np.array([initial_length_scale])

# Define bounds for the length scale
length_scale_bounds = np.array([[0.1, 10.0]])


In [7]:
# 6. Train GP Models Using One-vs-Rest Strategy
gp_models = []
num_classes = 10  # Digits 0-9

print("Training GP models...")
for class_label in range(num_classes):
    print(f"Training GP model for class {class_label}...")
    # Binary labels for the current class
    y_train_binary = (y_train == class_label).astype(float)
    # Initialize GP model
    gp_model = GP(
        X_train,
        y_train_binary,
        init_hyperparameters=init_hyperparameters,
        gp_kernel_function=JS_kernel,
        noise_variances=np.zeros(len(y_train_binary)) + 1e-6  # Noise variance
    )

    # Train the GP model (optimize hyperparameters)
    gp_model.train(
        hyperparameter_bounds=length_scale_bounds,
        method='mcmc',
        max_iter=1000,
        tolerance=1e-3,
    )

    gp_models.append(gp_model)
    print(f"GP model for class {class_label} trained.")

print("All GP models trained.")


Training GP models...
Training GP model for class 0...


  metr_ratio = np.exp(prior_star + likelihood_star - prior - likelihood)


GP model for class 0 trained.
Training GP model for class 1...
GP model for class 1 trained.
Training GP model for class 2...
GP model for class 2 trained.
Training GP model for class 3...
GP model for class 3 trained.
Training GP model for class 4...
GP model for class 4 trained.
Training GP model for class 5...
GP model for class 5 trained.
Training GP model for class 6...
GP model for class 6 trained.
Training GP model for class 7...
GP model for class 7 trained.
Training GP model for class 8...
GP model for class 8 trained.
Training GP model for class 9...
GP model for class 9 trained.
All GP models trained.


In [18]:
# 7. Define the Probit Link Function (Prefer over Logit, Gaussian Assumptions)
def probit(mu, sigma2):
    # Applies the probit function with variance adjustment.
    adjusted_mu = mu / np.sqrt(1 + sigma2)
    return norm.cdf(adjusted_mu)

In [19]:
# 8. Predict Probabilities Using the Trained GP Models
def predict_probs(X_test, gp_models):
    num_classes = len(gp_models)
    n_test = X_test.shape[0]
    
    # Initialize arrays to store means and variances
    means = np.zeros((n_test, num_classes))
    variances = np.zeros((n_test, num_classes))
    
    for class_label, gp_model in enumerate(gp_models):
        # Compute the posterior mean for the test data
        posterior_mean = gp_model.posterior_mean(X_test)
        mean = posterior_mean["f(x)"]  # Extract mean predictions
        means[:, class_label] = mean.flatten()
        
        # Compute the posterior variance for the test data
        posterior_cov = gp_model.posterior_covariance(X_test, variance_only=True)
        variance = posterior_cov["v(x)"]  # Extract variances
        variances[:, class_label] = variance.flatten()
    
    # Apply probit with variance to convert means and variances to probabilities
    probabilities = probit_with_variance(means, variances)
    return probabilities

In [22]:
# 9. Predict Class Labels and Evaluate the Classifier
probabilities = predict_probs(X_test, gp_models)

y_pred = np.argmax(probabilities, axis=1)

accuracy = accuracy_score(y_test, y_pred) * 100
print(f'\nAccuracy: {accuracy:.0f}%')
print('Classification Report:')
print(classification_report(y_test, y_pred))


Accuracy: 99%
Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        18
           1       0.90      1.00      0.95        18
           2       1.00      1.00      1.00        18
           3       1.00      1.00      1.00        18
           4       1.00      1.00      1.00        18
           5       1.00      1.00      1.00        18
           6       1.00      0.94      0.97        18
           7       1.00      1.00      1.00        18
           8       1.00      0.94      0.97        18
           9       1.00      1.00      1.00        18

    accuracy                           0.99       180
   macro avg       0.99      0.99      0.99       180
weighted avg       0.99      0.99      0.99       180

