# PHYS591000 in 2022
# inClass-exercise 02
---

## Add MNIST dataset to the noteook

1. First of all, click on "+ Add data" in the upper right corner, search for 'mnist npz', then click on the first one.
2. Run the first cell, which gives you the path of the MNIST dataset



In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
#import plotting
import matplotlib.pyplot as plt

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

## Load the MNIST file and compute the means

Let's load the MNIST file. This time we'll pick the 0's and 1's in one go.  

In [None]:
mnist = np.load('/kaggle/input/mnist-numpy/mnist.npz')

# Prepare 0 and 1 in one single file
x_train = mnist['x_train'][mnist['y_train']<=1]/255.
y_train = mnist['y_train'][mnist['y_train']<=1]

# Please check how many samples do we have in x_train and y_train
print(x_train.shape)
print(y_train.shape)


Next we calculate the means (average pixel densities). 
We will use a slighly different expression this time, which will be useful later.

In [None]:
# Average pixel density for the full image
x_train_fullavg = np.array([img.mean() for img in x_train])

# As a check, make a plot the full average density for 0 and 1
# Is it the same as your plot from Week 01 exercise?
mean0 = x_train_fullavg[y_train==0]
mean1 = x_train_fullavg[y_train==1]

fig = plt.figure(figsize=(6,6), dpi=80)
plt.hist(mean0, bins=50, color='y', label='mean0')
plt.hist(mean1, bins=50, color='g', alpha=0.8, label='mean1')
plt.legend()
plt.show()

Now we'll calculate one more feature, the average density in the center of the image. The 'center' is defined as the 6x8 pixels in the center of the 28x28 image.

In [None]:
# Calculate the 'center average'
# Hint: Take the mean over a range in axis-0 and axis-1 for each 'img' in x_train
# Hint2: For example, the range for axis-0 of each img is [10:18]

x_train_cenavg = np.array([img[10:18,11:17].mean() for img in x_train])

# As a check, make a plot the full average density for 0 and 1
# Is it the same as your plot from Week 01 exercise?
cen_mean0 = x_train_cenavg[y_train==0]
cen_mean1 = x_train_cenavg[y_train==1]

# Make a plot the center average density for 0 and 1
# Does the plot make sense?
fig = plt.figure(figsize=(6,6), dpi=80)
plt.hist(cen_mean0, bins=50, color='y', label='mean0')
plt.hist(cen_mean1, bins=50, color='g', alpha=0.8, label='mean1')
plt.legend()
plt.show()

We can compare the ROC curves for each feature using Scikit-learn. We need to compute ROC curves with **Test data**. 

ROC curve computes the true positive rate (tpr) and false positive rate (fpr) for different thresholds (thr) on the 'score' of the classifier. In our case, the 'score' is simply the average pixel density. 

[Learn more about ROC curve in Scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html)

In [None]:
# Prepare the test samples of 0 and 1 as we've done for the training samples

x_test = mnist['x_test'][mnist['y_test']<=1]/255.
y_test = mnist['y_test'][mnist['y_test']<=1]

# Import Scikit-learn ROC functions
from sklearn.metrics import roc_curve, auc

# ROC for full average.
# The score is simply the full average. 

y_fullavg_score = np.array([img.mean() for img in x_test])

# set pos_label=0 because 0's have higher full average 
# but 0 is defined as the background
fpr_full, tpr_full, thr_full = roc_curve(y_test, y_fullavg_score, pos_label=0)
# Calculate AUC
roc_auc_full = auc(fpr_full, tpr_full)
print('AUC for full average is:',roc_auc_full)

# Plot ROC
plt.figure(figsize=(5,5))
plt.title('Receiver Operating Characteristic')
plt.plot(fpr_full,tpr_full, color='red',label = 'All mean AUC = %0.2f' % roc_auc_full)
# Add a diagonal line representing the ROC from random choice
plt.plot([0, 1], [0, 1],linestyle='--', c='grey')
plt.legend(loc = 'lower right')
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.grid()
plt.show()

Please plot the ROC curve from the center average and compare that to the one from the full average (plot both ROC curve on the same plot).

In [None]:
# Score from center average and compute ROC
y_cenavg_score = np.array([img[10:18,11:17].mean() for img in x_test])

fpr_cen, tpr_cen, thr_cen = roc_curve(y_test, y_cenavg_score)

# Calculate AUC
roc_auc_cen = auc(fpr_cen, tpr_cen)
print('AUC for center average is:',roc_auc_cen)

# Plot ROC for the two averages on the same plot
plt.figure(figsize=(5,5))
plt.title('Receiver Operating Characteristic')
plt.plot(fpr_full,tpr_full, color='red',label = 'All mean AUC = %0.2f' % roc_auc_full)
plt.plot(fpr_cen,tpr_cen, color='blue',label = 'Center mean AUC = %0.2f' % roc_auc_cen)
# Add a diagonal line representing the ROC from random choice
plt.plot([0, 1], [0, 1],linestyle='--', c='grey')
plt.legend(loc = 'lower right')
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.grid()
plt.show()

So indeed the center average is another useful feature. Before combining the two features, let's take a closer look at the relations of the two kinds of averages.

In [None]:
# What is the correlation of the two features?
R = np.corrcoef(x_train_fullavg, x_train_cenavg)
print(R)

# Make a 2D plot of full average v.s. center average
fig = plt.figure(figsize=(6,6), dpi=80)
plt.scatter(mean0, cen_mean0, c='y', s=20, label='True 0')
plt.scatter(mean1, cen_mean1, c='g', s=20, label='True 1')
plt.legend()
plt.show()


## LDA from Scikit-learn 

Let's use Scikit-learn to perform the LDA with the two averages. First we'll prepare the inputs for LDA: The training sample should be an array of \[full average, center average\] for each sample in the training data.

In [None]:
# Prepare the training sample such that each entry is [full average, center average]
x_train_new = np.array([[img.mean(),img[10:18,11:17].mean()] for img in x_train])

# Prepare the test sample as well
x_test_new = np.array([[img.mean(),img[10:18,11:17].mean()] for img in x_test])


# Check that the shape of the (new) x_train should be (12665, 2)


Next we import LDA from Scikit-learn. We will train the LDA, and compare the performance on the training and the test data.

[Learn more about LDA in Scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html)

In [None]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

# Training w/ LDA from skLearn
clf = LinearDiscriminantAnalysis()
f_train = clf.fit_transform(x_train_new, y_train)

# Compare the performance of the LDA on the training and the test data
# Are they similar?
s_train = clf.score(x_train_new, y_train)
s_test = clf.score(x_test_new, y_test)
print('Performance (training):', s_train)
print('Performance (testing):', s_test)

The code below shows how the mapped distribution from LDA looks like:

In [None]:
# Plotting LDA results
fig = plt.figure(figsize=(6,6), dpi=80)
plt.hist(f_train[y_train==0], bins=50, color='y')
plt.hist(f_train[y_train==1], bins=50, color='g', alpha=0.5)
plt.show()

Again we can plot the ROC curve along with the AUC from the LDA trained by Scikit-learn. Does it perform better than using a single feature?

In [None]:
# The score from LDA
y_pred = clf.decision_function(x_test_new)

# Calculate the AUC and make the ROC plot
fpr_full, tpr_full, thr_full = roc_curve(y_test, y_pred, pos_label=1)
# Calculate AUC
roc_auc_full = auc(fpr_full, tpr_full)
print('AUC for full average is:',roc_auc_full)

# Plot ROC
plt.figure(figsize=(5,5))
plt.title('Receiver Operating Characteristic')
plt.plot(fpr_full,tpr_full, color='red',label = 'All mean AUC = %0.2f' % roc_auc_full)
# Add a diagonal line representing the ROC from random choice
plt.plot([0, 1], [0, 1],linestyle='--', c='grey')
plt.legend(loc = 'lower right')
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.grid()
plt.show()