# Turing Machine & Deep Learning 2023
*Author: Satchit Chatterji (satchit.chatterji@gmail.com)*

## Lecture 2: Supervised ML (Classification)
> Today's question: How can I classify handwritten digits?

The MNIST dataset is a widely used benchmark dataset in machine learning and computer vision. It consists of a collection of 70,000 grayscale images of handwritten digits from 0 to 9. Each image is a 28x28 pixel square, making it a 28x28 matrix of numerical values. The MNIST dataset is often used for tasks such as digit recognition and serves as a fundamental dataset for developing and evaluating various image classification algorithms and models.

Today, we'll try to classify digits in this dataset using methods we learnt in the lecture.

#### Learning outcomes:
- Using benchmark data sets (MNIST)
- Reshaping inputs, outputs
- Logistic Regression
- Decision Trees
- Random Forests
- SVMs

# Loading MNIST

Because it is so popular, we can get an easily-accessible version via e.g. [TensorFlow](https://www.tensorflow.org/) that already comes with a test/train split.

In [None]:
# get dataset
from tensorflow.keras.datasets import mnist
# get common libraries
import matplotlib.pyplot as plt
import numpy as np

## Exploring MNIST

In [None]:
# load data

### Label distribution

In [None]:
plt.hist(...)
plt.xticks(0.9*np.arange(10)+0.45, range(10))

plt.ylabel(...)
plt.xlabel(...)
plt.show()

### Exploring the images

In [None]:
# shape of the datasets

In [None]:
# shape of each image

In [None]:
# plot first image


In [None]:
# plot a bunch more images

fig, axs = plt.subplots(5,5)
axs = axs.flatten()
fig.tight_layout(pad=0.3)

for i, ax in enumerate(axs):
    ax.imshow(...)
    ax.set_title(...)
    ax.axis("off")

### Reshaping MNIST

The ML algorithms used in this notebook usually expect the input to be a vector, and not a matrix. Thus, we need to reshape the samples into a single vector that is 784 dimensions long.

Later on, we will also need to reshape these back into single images, so we add a function for that too.

In [None]:
def flatten_mnist(samples):
    return ...

def unflatten_mnist(image):
    return ...

train_X, test_X = flatten_mnist(train_X), flatten_mnist(test_X)

# Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression

... # create the logistic regression object
... # fit on training data

### Metrics

In [None]:
from sklearn.metrics import confusion_matrix as cm
import seaborn as sns

# compute cm and plot

The raw confusion matrix for such an accurate baseline model doesn't seem to be too helpful. Instead, we can get the metrics computed directly for us.

In [None]:
# use a function from sklearn for a full classification report
from sklearn.metrics import classification_report

# make predictions
...
# send to classification_report
print("Logistic Regression")
print(classification_report(test_y, pred_y, digits=4))

For now, and what is common for balanced datasets, is to use accuracies as a good inital metric to see how your model is doing. So let's just compute them directly and display them for both the training and test set.

In [None]:
# compute score for a single dataset
...

In [None]:
def get_accuracies(...):
    print("Training score:", ...)
    print("Testing score: ", ...)
    
get_accuracies(...)

In [None]:
def show_incorrect_preds(pred_y, test_X, test_y):
    incorrect_idxs = pred_y!=test_y

    incorrect_pred_y = pred_y[incorrect_idxs]
    incorrect_test_X = test_X[incorrect_idxs]
    incorrect_test_y = test_y[incorrect_idxs]

    fig, axs = plt.subplots(5,5)
    axs = axs.flatten()
    fig.tight_layout(pad=0.3)

    for i, ax in enumerate(axs):
        ax.imshow(unflatten_mnist(incorrect_test_X[i]), cmap=plt.get_cmap('gray'))
        ax.set_title(fr"{incorrect_pred_y[i]}$\neq${incorrect_test_y[i]}")
        ax.axis("off")
        
show_incorrect_preds(pred_y, test_X, test_y)

# Decision Tree

In [None]:
from sklearn import tree

modelDT = tree.DecisionTreeClassifier(...)
modelDT = modelDT.fit(...)

In [None]:
print("Decision Tree")
get_accuracies(..., train_X, train_y, test_X, test_y)

In [None]:
pred_y = modelDT.predict(...)
show_incorrect_preds(pred_y, test_X, test_y)

## Random Forests

In [None]:
from sklearn.ensemble import RandomForestClassifier as RFC

modelRFC = RFC(...)
modelRFC = modelRFC.fit(train_X, train_y)

In [None]:
print("Random Forest")
get_accuracies(modelRFC, train_X, train_y, test_X, test_y)

In [None]:
pred_y = modelRFC.predict(test_X)
show_incorrect_preds(pred_y, test_X, test_y)

# SVMs

SVMs come in a number of flavors with a number of hyperparameters you can play around with. Try them all out here! Keep in mind that since we have so many high-dimensional data points, this method can be excruitiatingly slow.

Let's run these cells first before going back to the lecture.

In [None]:
from sklearn.svm import SVC

modelSVC = SVC(kernel="rbf", C=10, gamma=0.01)
modelSVC = modelSVC.fit(train_X, train_y)

In [None]:
print("SVM")
get_accuracies(modelSVC, train_X, train_y, test_X, test_y)

In [None]:
pred_y = modelSVC.predict(test_X)
show_incorrect_preds(pred_y, test_X, test_y)