## MNIST dataset with SVM, Decision Trees, and Random Forests

This week we're going to apply other classification algorithms on the MNIST data. As usual, click on **Add data** and search for 'mnist.npz', and add the MNIST data to this notebook. Then, run the first cell to show the path to the MNIST data.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

At first we'll again focus on 0's and 1's. Please load the MNIST data, and pick 0's and 1's from  both x/y_train and x/y_test as before.

In [None]:
# Load the MNIST data
mnist = np.load('/kaggle/input/mnist-numpy/mnist.npz')

# Choose 0's and 1's for x_train, y_train, x_test, y_test
# Remember to normalize x_train and x_test by 255 (make the range from 0 to 1 for each pixel)
x_train = mnist['x_train'][mnist['y_train']<=1]/255
y_train = mnist['y_train'][mnist['y_train']<=1]
x_test = mnist['x_test'][mnist['y_test']<=1]/255
y_test = mnist['y_test'][mnist['y_test']<=1]
print('x train shape:', x_train.shape)

## SVM with two features

We'll first use the two features we calculated before as inputs for the SVM: the average pixel density and the average density in the center of the image. (The 'center' is defined as the 6x8 pixels in the center of the 28x28 image.).

Please compute the two averages for each image in the x_train (for 0's and 1's), and produce a (new) x_train where each entry is (full average, center average). The shape of the resulting x_train should be (12665, 2). 

Hint: We did this in the in-class exercise of Week 02 before!

In [None]:
# Prepare the training sample such that each entry is (full average, center average)
x_train = np.array([[img.mean(),img[10:18, 11:17].mean()] for img in x_train])
print('x train shape becomes:', x_train.shape)

# Please do the same for x_test
x_test = np.array([[img.mean(),img[10:18, 11:17].mean()] for img in x_test])

Now we'll create a linear SVM (SVM with a linear kernel) and check the performance of the classification from SVM. Just run the next cell.

In [None]:
# import svm from sklearn
from sklearn import svm

# Training with SVM with a linear kernal  C= regularization parameter
clf = svm.SVC(kernel='linear', C=1.0)
clf.fit(x_train, y_train)
s_train = clf.score(x_train, y_train)
s_test = clf.score(x_test, y_test)
print('Performance (training):', s_train)
print('Performance (testing):', s_test)

The performance is quite good! Only a few images in the test sample glot mis-identified. How many of them? How do they look like?

In the next cell, find how many figures are mis-identified in x_test. Furtheremore, plot a few of them (and better, label the true digit and the predicted digit, e.g. '0->1' means a true 0 that is mis-identified as 1).

Hint: use **np.where()**

Ref: [Usage of SVM classifier](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html) Check 'Methods'.

In [None]:
# Find mis-idenfied entries in x_test
# You need to compare the prediction from the classifier (clf) and the ground truth (y_test)
print('Number of mis-identified figures: ', np.size(np.where(y_test!=clf.predict(x_test))))

# Plot a couple of mis-identified images using any method you like
# Label what the prediction and what the truth are

fig = plt.figure(figsize=(3,5), dpi=80)
for i in range(6):
#     misid = np.where(y_test!=clf.predict(x_test))[i]
    plt.subplot(3,2,i+1)
#     plt.title()
#     plt.tick_params()
    plt.imshow(x_test[y_test!=clf.predict(x_test)])

We can visualize the SVM 'border line' that separates 0's and 1's using a filled contour plot. Run the cell below to see the plot. Please make sure you understand what the code does for making this plot.

In [None]:
# Plotting the plane found by SVM on the all-mean v.s. center-mean 2D plane
fig = plt.figure(figsize=(6,6), dpi=80)

# 'Sampling points' on the 2D plane
xv, yv = np.meshgrid(np.linspace(0.,0.45,100),np.linspace(-0.05,1.05,100))

# ravel = reshape(-1) returns 1-D
# np.c_: add the arrays along the 2nd axis 
# prediction made on each point of (all_mean, center_mean)
zv = clf.predict(np.c_[xv.ravel(), yv.ravel()])

# contourf: plot 'filled' contour
plt.contourf(xv, yv, zv.reshape(xv.shape),alpha=.3, cmap='Blues')

# overlay scatter plots of x_train data
plt.scatter(x_train[:,0][y_train==0], x_train[:,1][y_train==0], c = 'y', s=5, alpha=0.8)
plt.scatter(x_train[:,0][y_train==1], x_train[:,1][y_train==1], c = 'g', s=5, alpha=0.8)
plt.show()

## Decision Tree with two features

Now we build a decision tree to classify 0's and 1's with the two averages. We will use the confusion matrix on the **test sample** to judge its performance. Run the cell below to see what happens!

In [None]:
from sklearn.tree import DecisionTreeClassifier
dtree = DecisionTreeClassifier()
dtree.fit(x_train , y_train)

# Compare the prediction and the ground truth using x_test
y_pred = dtree.predict(x_test)

from sklearn.metrics import confusion_matrix
cmdtree = confusion_matrix(y_test,y_pred)
# Print the confusion matrix
cmdtree

In the cell below prints out the score (mean accuracy) of the decision tree on the training sample and on the test sample.

[Usage of Decision Tree Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html)

In [None]:
# Score of Decision Tree 
print ('Performance on training data:', dtree.score(x_train, y_train))
print ('Performance on test data:', dtree.score(x_test, y_test))

You'll see that you reach 100% accuracy for the training data, i.e. the decision tree makes *no* mistakes for all the 0's and 1's in the training data. How does it happen? Please make a contour plot in the cell below showing the border line from the decision tree.

In [None]:
# Contour from decistion tree
# Plotting the plane found by SVM on the all-mean v.s. center-mean 2D plane
fig = plt.figure(figsize=(6,6), dpi=80)

# 'Sampling points' on the 2D plane
xv, yv = np.meshgrid(np.linspace(0.,0.45,100),np.linspace(-0.05,1.05,100))

# ravel = reshape(-1) returns 1-D
# np.c_: add the arrays along the 2nd axis 
# prediction made on each point of (all_mean, center_mean)
zv = dtree.predict(np.c_[xv.ravel(), yv.ravel()])

# contourf: plot 'filled' contour
plt.contourf(xv, yv, zv.reshape(xv.shape),alpha=.3, cmap='Blues')

# overlay scatter plots of x_train data
plt.scatter(x_train[:,0][y_train==0], x_train[:,1][y_train==0], c = 'y', s=5, alpha=0.8)
plt.scatter(x_train[:,0][y_train==1], x_train[:,1][y_train==1], c = 'g', s=5, alpha=0.8)
plt.show()

You'll see the contour plot from the decision tree looks very 'unnatural', which implies the decision tree is overfitting the training data. Although it still gives a pretty good performance on the test sample of 0's and 1's, we'll see later an example of a decision tree fails to generalize to the test sample.

## Random Forest with two features

We claim an ensemble of decision trees can make the model less overfitting. Is that the case? Let's try with a Random Forest.

In the next cell we build a random forest model. Please print out the performance (score) on the training and on the test data, and make a contour plot in the following cell.

In [None]:
# Random Forest 
from sklearn.ensemble import RandomForestClassifier
rforest = RandomForestClassifier()
rforest.fit(x_train , y_train)

# Print the scores of Random Forest
print ('Performance on training data:', rforest.score(x_train, y_train))
print ('Performance on test data:',rforest.score(x_test, y_test) )

In [None]:
# Random Forest Contour Plot
# Contour from decistion tree
# Plotting the plane found by SVM on the all-mean v.s. center-mean 2D plane
fig = plt.figure(figsize=(6,6), dpi=80)

# 'Sampling points' on the 2D plane
xv, yv = np.meshgrid(np.linspace(0.,0.45,100),np.linspace(-0.05,1.05,100))

# ravel = reshape(-1) returns 1-D
# np.c_: add the arrays along the 2nd axis 
# prediction made on each point of (all_mean, center_mean)
zv = rforest.predict(np.c_[xv.ravel(), yv.ravel()])

# contourf: plot 'filled' contour
plt.contourf(xv, yv, zv.reshape(xv.shape),alpha=.3, cmap='Blues')

# overlay scatter plots of x_train data
plt.scatter(x_train[:,0][y_train==0], x_train[:,1][y_train==0], c = 'y', s=5, alpha=0.8)
plt.scatter(x_train[:,0][y_train==1], x_train[:,1][y_train==1], c = 'g', s=5, alpha=0.8)
plt.show()

## Classification of all digits with SVM, Decision Tree, and Random Forest

We now apply SVM, Decision Tree, and Random Forest to classify *all* digits (0-9) in the MNINST data! To save the time, we will use the first 10000 samples from the MNIST data.

In the next cell, we'll select the first 10000 samples, and treat all 784 pixels as features (using **reshape**) as we did last week.

In [None]:
# Select the first 10000 samples from x_train, y_train, x_test, y_test
x_full_train = mnist['x_train'][:10000]/255.
print('x full train shape:', x_full_train.shape)
y_full_train = mnist['y_train'][:10000]
x_full_test = mnist['x_test']/255.
y_full_test = mnist['y_test']

# flatten input 28*28 images as 1-D arrays. The shape should be (10000, 784) for x_train and x_test.
x_full_train = np.reshape(x_full_train,(10000,-1))
print('x full train shape now becomes:', x_full_train.shape)
x_full_test = np.reshape(x_full_test,(10000,-1))

Please fit the 'full_train' data with an SVM and compare the performances on the training and test samples.

In [None]:
# Create an SVM model and do the fit. Then compare the performances on the training and test samples

# Training with SVM with a linear kernal  C= regularization parameter
clf = svm.SVC(kernel='linear', C=1.0)
clf.fit(x_full_train, y_full_train)
s_train = clf.score(x_full_train, y_full_train)
s_test = clf.score(x_full_test, y_full_test)

print('All-digit Performance (training):', s_train)
print('All-digit Performance (testing):', s_test)

Looks pretty good! What images (digits) are misidentified? Please plot a couple of mis-identified images from the x_full_test data, labeling the true and the predicted digits.

In [None]:
# Find and plot a couple of misidentified images from x_full_test
# Label the predicted digit and the true digit

unmatched = np.where(y_full_test!=clf.predict(x_full_test))

# fig = plt.figure(figsize=(8,8), dpi=80)
# for i in range(100):
#     misid=unmatched[i]
#     plt.subplot(10,10,i)
    
    

Now let's fit the 10-digit data using a Decision Tree, and compare the performances on the training and test samples. What do you find?

In [None]:
# Create and fit a Decision Tree with all digits

dtree.fit(x_full_train , y_full_train)

# Compare the prediction and the ground truth using x_test
#y_full_pred = dtree.predict(x_full_test)


# Score of Decision Tree (should see the outcome of overtraining)
print('All-digit Performance (training):', dtree.score(x_full_train, y_full_train))
print('All-digit Performance (testing):', dtree.score(x_full_test, y_full_test))

While we have a perfect performance on the training data, the accuracy is only about 80% for the test sample, which is clearly a sign of overfitting (sometimes called overtraining).

Will using Random Forest be better? Let's find out in the next cell.

In [None]:
# Create and fit a Random Forest with all digits
rforest = RandomForestClassifier()
rforest.fit(x_full_train , y_full_train)

# Score of Random Forest 
print('All-digit Performance (training):', rforest.score(x_full_train, y_full_train))
print('All-digit Performance (testing):', rforest.score(x_full_test, y_full_test) )