# MNIST Digit Recognition Workbook

### Starter Code

First, let's load the MNIST dataset and split it into training and testing sets.

In [None]:
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split

# Load MNIST data from https://openml.org/d/554
X, y = fetch_openml('mnist_784', version=1, return_X_y=True)

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Training set shape:", X_train.shape)
print("Test set shape:", X_test.shape)

### Visualization and Plotting

## Part 1: PCA + KNN

### Prelude

Principal Component Analysis (PCA) is a statistical technique to emphasize variation and bring out strong patterns in a dataset. It's often used to make data easy to explore and visualize. Here, you will use PCA to reduce the dimensionality of the MNIST dataset before applying the KNN algorithm for classification.

### Steps

1. **Load the Dataset:** Start by loading the MNIST dataset.
2. **Apply PCA:** Reduce the dimensionality of the dataset.
3. **KNN Classification:** Use the KNN algorithm to classify the digits.

# Tips
- Choosing n_components for PCA: Start with n_components=0.95 which keeps 95% of the variance. Experiment with other values to see how it changes the results.
- Choosing n_neighbors for KNN: Common starting points are 3, 5, and 7. Adjust based on the performance and try to avoid overfitting.
- Explore: Use visualizations like plotting some of the digits before and after PCA to understand what is retained and what is lost.

## Part 2: K-Means + SVM

### Prelude

K-Means is a popular clustering algorithm, and Support Vector Machines (SVMs) are a powerful classification method. In this part, you will use K-Means to extract features from the dataset and then use these features to train an SVM classifier.

### Steps

1. **K-Means Clustering:** Apply K-Means to find clusters in the dataset.
2. **Feature Extraction:** Use the distances from each point to the cluster centroids as features.
3. **SVM Classification:** Use the SVM classifier to classify the digits.

**Additional Tips for Students:**
- **Choosing the Number of Clusters (k) in K-Means:** Start with `k=10` since there are 10 digits (0-9). Experiment with different values to see if they improve the performance.
- **Selecting SVM Kernel:** Try different kernels like 'linear', 'poly', 'rbf', and 'sigmoid'. Observe how the choice of kernel affects accuracy.
- **Visualization:** Consider visualizing the centroids of the clusters. Each centroid is a point in the same space as the input data and can be viewed as an "average" digit if reshaped to 28x28 pixels.
- **Cross-Validation:** Use cross-validation to find the best parameters for both K-Means and SVM to further improve the model.

---

## Part 3: SIFT + SVM

### Prelude

Scale-Invariant Feature Transform (SIFT) is an algorithm to detect and describe local features in images. After extracting these features, you will use an SVM classifier for the classification.

### Steps

1. **SIFT Feature Extraction:** Extract SIFT features from each image.
2. **Feature Description:** Use the features to describe the dataset.
3. **SVM Classification:** Use these descriptions to train and predict using SVM.

### Starter Code

#### SIFT Feature Extraction

First, let's define a function to extract SIFT features from an image.

**Additional Tips for Students:**
- **SIFT Feature Size:** SIFT descriptors are 128-dimensional; ensure all feature vectors are the same length.
- **Choosing SVM Kernel:** Try 'linear', 'poly', 'rbf', and 'sigmoid' kernels to observe their effects.
- **Regularization Parameter (C):** Experiment with different values of \(C\); smaller values specify stronger regularization.
- **Handling Missing Descriptors:** In case no keypoints are found in an image, use a zero vector for that image’s descriptors.

---