**Assignment 2** 

This assignment requires you to implement image recognition methods. Please understand and use relevant libraries. You are expected to solve both questions.

**Data preparation and rules**

Please use the images of the MNIST hand-written digits recognition dataset. You may use torchvision.datasets library to obtain the images and splits. You should have 60,000 training images and 10,000 test images. Use test images only to evaluate your model performance.


In [1]:
import cv2
import numpy as np
import os
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split


In [2]:
from tensorflow.keras.datasets import mnist
import cv2
import numpy as np

(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

def preprocess_images(images):
    images = images.astype('float32') / 255.0
    return images

train_images = preprocess_images(train_images)
test_images = preprocess_images(test_images)
print(train_images.shape)
print(test_images.shape)            

(60000, 28, 28)
(10000, 28, 28)


Q1: SIFT-BoVW-SVM [4 points]

1. [2 points] Implement the SIFT detector and descriptor. Compute cluster centers for the Bag-of-Visual-Words approach. Represent the images as histograms (of visual words) and train a linear SVM model for 10-way classification.
Note 1: You may want to use libraries such as cv2 (OpenCV) and sklearn (Sci-kit learn) for doing this question. https://scikit-learn.org/stable/modules/svm.html#multi-class-classification may be useful for the SVM.
Note 2: Seed random numbers for reproducibility (running the notebook again should give you the same results!).

In [8]:
# # Necessary Imports
# import numpy as np
# import cv2
# from sklearn import datasets
# from sklearn.cluster import KMeans
# from sklearn.svm import SVC
# from sklearn.model_selection import train_test_split
# from sklearn.preprocessing import StandardScaler
# from sklearn.pipeline import make_pipeline

# # Load MNIST Data
# def load_mnist():
#     digits = datasets.load_digits()
#     # MNIST images are 8x8 and already in grayscale
#     images = np.array(digits.images)
#     labels = np.array(digits.target)
#     # Reshape the images for SIFT
#     images = [cv2.resize(img, (32, 32)) for img in images]
#     return images, labels

# # Compute SIFT Features
# def compute_sift_features(images):
#     sift = cv2.SIFT_create()
#     descriptors = []
#     for img in images:
#         kp, desc = sift.detectAndCompute(img.astype(np.uint8), None)
#         if desc is not None:
#             descriptors.append(desc)
#         else:
#             descriptors.append(np.zeros((1, sift.descriptorSize())))
#     return descriptors

# # K-Means Clustering for BoVW
# def cluster_descriptors(descriptors, n_clusters=200):
#     kmeans = KMeans(n_clusters=n_clusters, random_state=42)
#     all_descriptors = np.vstack(descriptors)
#     kmeans.fit(all_descriptors)
#     return kmeans

# # Convert Images to Histograms of Visual Words
# def convert_to_histograms(descriptors, kmeans):
#     histograms = []
#     for desc in descriptors:
#         hist = np.zeros(kmeans.n_clusters)
#         if desc is not None:
#             predictions = kmeans.predict(desc)
#             for p in predictions:
#                 hist[p] += 1
#         histograms.append(hist)
#     return histograms

# # Train a Linear SVM Classifier
# def train_svm(histograms, labels):
#     X_train, X_test, y_train, y_test = train_test_split(histograms, labels, test_size=0.2, random_state=42)
#     scaler = StandardScaler()
#     X_train = scaler.fit_transform(X_train)
#     X_test = scaler.transform(X_test)
    
#     svm = SVC(kernel='linear', probability=True, random_state=42)
#     svm.fit(X_train, y_train)
#     print(f"Test accuracy: {svm.score(X_test, y_test) * 100:.2f}%")
#     return svm

# # Main function to execute the workflow
# def main():
#     images, labels = load_mnist()
#     descriptors = compute_sift_features(images)
#     kmeans = cluster_descriptors(descriptors, n_clusters=150) # You may adjust the number of clusters
#     histograms = convert_to_histograms(descriptors, kmeans)
#     svm = train_svm(histograms, labels)

# if __name__ == "__main__":
#     main()


  super()._check_params_vs_input(X, default_n_init=10)
  return fit_method(estimator, *args, **kwargs)


Test accuracy: 7.78%


In [10]:
from tensorflow.keras.datasets import mnist
import cv2
import numpy as np
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix

# Load and preprocess the MNIST dataset
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

def preprocess_images(images):
    images = images.astype('float32') / 255.0
    return images

train_images = preprocess_images(train_images)
test_images = preprocess_images(test_images)

def calc_features(images, thresh):
    sift = cv2.SIFT_create(thresh)
    features = []
    for img in images:
        img = np.uint8(img * 255)  # Convert back to OpenCV usable format
        _, des = sift.detectAndCompute(img, None)
        if des is not None:
            features.append(des)
    return np.vstack(features) if features else np.empty((0, 128))  # Assuming SIFT descriptors have a length of 128


def perform_kmeans(features, k):
    criteria = (cv2.TERM_CRITERIA_EPS + cv2.TERM_CRITERIA_MAX_ITER, 10, 1.0)
    _, _, centers = cv2.kmeans(features, k, None, criteria, 10, cv2.KMEANS_RANDOM_CENTERS)
    return centers

def bag_of_features(features, centers, k):
    vec = np.zeros((1, k), dtype=np.float32)
    for i in range(features.shape[0]):
        diff = np.linalg.norm(np.tile(features[i], (k, 1)) - centers, axis=1)
        idx = np.argmin(diff)
        vec[0, idx] += 1
    return vec

def train_and_evaluate(train_images, train_labels, test_images, test_labels, thresh, k):
    features = calc_features(train_images, thresh)
    centers = perform_kmeans(features, k)
    
    def create_feature_vec(img):
        des = calc_features([img], thresh)
        if des.size > 0:  # Changed from None check to size check
            return bag_of_features(des, centers, k).flatten()
        else:
            return np.zeros((k,))  # Return a zero vector if no features are detected

    # Convert training and testing images to feature vectors
    train_vec = np.array([create_feature_vec(img) for img in train_images])
    test_vec = np.array([create_feature_vec(img) for img in test_images])
    
    # Train SVM
    clf = SVC(kernel='linear', probability=True)
    clf.fit(train_vec, train_labels)
    
    # Evaluate
    preds = clf.predict(test_vec)
    return accuracy_score(test_labels, preds), confusion_matrix(test_labels, preds)

thresh = 10  # SIFT feature threshold
k = 150  # Number of clusters for KMeans
accuracy, conf_mat = train_and_evaluate(train_images, train_labels, test_images, test_labels, thresh, k)

print(f'Accuracy: {accuracy*100:.2f}%')
print('Confusion Matrix:')
print(conf_mat)


Accuracy: 73.55%
Confusion Matrix:
[[ 841    6   17    2    2   27   48   16   12    9]
 [   0 1110    2    2    4    0    5    9    3    0]
 [  48   23  654   29   23   24   28  154   40    9]
 [   7    7   74  736   22   97   15   30   15    7]
 [   4   22   25   13  761   21   25   25   32   54]
 [  50   14   36   71   25  547   61   43   15   30]
 [  68   20   26   12   11   46  587   50   13  125]
 [  10   74  105   21   24   10   33  734   14    3]
 [  23    5   43   28   45   19   36   13  709   53]
 [  24   15   11   15   40   44  130   19   35  676]]


2. [1 point] Keeping everything else constant, plot how classification accuracy changes as you sweep across 6 different values for the number of clusters. Please decide what numbers are meaningful for this question. Explain the trends in classification accuracy that you observe.
Note 1: It is recommended to try hyperparameters in logarithmic steps such as 2x or 3x multiples. An example of 2x multiples is: 1, 2, 5, 10, 20, ... An example of 3x multiples is: 1, 3, 10, 30, 100, ...


3. [1 point] Show the results for 6 different hyperparameter settings. You may play with the SIFT detector or descriptor and the linear SVM. Keep the number of clusters constant based on the answer to the previous question. Explain the trends in classification accuracy that you observe.

Q2: CNNs and Transformers [6 points]
1. [2.5 points] Set up a modular codebase for training a CNN (LeNet) on the task of handwritten digit recognition. You should have clear functional separation between the data (dataset and dataloader), model (nn.Module), and trainer (train/test epoch loops). Implement logging: using Weights & Biases is highly recommended, alternatively, create your own plots using other plotting libraries. Log the training and evaluation losses and accuracies at every epoch, show the plots for at least one training and evaluation run.
Note 1: Seed random numbers for reproducibility (running the notebook again should give you the same results!).


2. [1 point] Show the results for 6 different settings of hyperparameters. You may want to change the batch size, learning rate, and optimizer. Explain the trends in classification accuracy that you observe. Which hyperpa- rameters are most important?


3. [0.5 points] Compare the best performing CNN (from above) against the SIFT-BoVW-SVM approach. Explain the differences.


4. [0.5 points] How does the performance change if you double the number of convolutional layers?


5. [0.5 points] How does the performance change as you increase the number of training samples: [0.6K, 1.8K, 6K, 18K, 60K]? Explain the trends in classification accuracy that you observe.
Note 1: Make sure that all classes are represented equally within different subsets of the training sets.


6. [1 point] Replace the CNN model with a 2 layer TransformerEncoder. Using a ViT style prediction scheme, evaluate classification accuracy when training with 6K and 60K images. How do the results compare against CNNs? Explain the trends.