# PDP Problem Set 1

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('white')
sns.set_context('notebook', font_scale=1.25)
%matplotlib inline

## Section 1: MNIST Dataset
The [MNSIT dataset](https://en.wikipedia.org/wiki/MNIST_database) is a large database of handwritten digits. Each digit has been preprocessed to be in black-and-white and to fit into a 28x28 pixel bounding box. The challenge of the MNIST dataset is to train a machine learning model that can accurately classify the digit class (0-9) from a given image. Some example images from the MNIST dataset are presented below:

<img src="https://upload.wikimedia.org/wikipedia/commons/2/27/MnistExamples.png" />


For the sake of this exercise, we will only use 1000 example images per digit. We load the data below. All of the digit images are stored in **data**, whereas the categories are stored in **labels**. 

In [None]:
## Load data from compressed file.
npz = np.load('mnist.npz')

## Extract data.
data, labels = npz['data'], npz['labels']

print('data:   shape = (%s, %s)' %data.shape)
print('labels: shape = (%s,)' %labels.shape)

To confirm the accuracy of the labels, plot a composite image of each digit 0-9. In other words, plot an average of each digit using your favorite heatmap function.

**Note:** To plot the 2d image, you will need to reshape the second axis from *shape=(784)* to *shape=(28,28)*.

## Section 2: Logistic (Multinomial) Classification

In this next section, we will predict the category label using a [multinomial classifier](https://en.wikipedia.org/wiki/Multinomial_distribution). Multinomial classification is just an extension of logistic regression to handle more than 2 categories. As such, multinomial classification can only model linear relationships.

Import the [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression) classifier from *scikit-learn* and fit it to the dataset using the following parameters:

- C = 0.01
- multi_class='multinomial'
- solver='lbfgs'
- max_iter=1000

Extract the classification weights from `fit.coef_`. Plot the classification weights for each of the ten digits, just as you did for the average digit images above, and answer the questions below.

**Q:** Describe the classification weights for each digit. Are the weights isomorphic to their respective digits?

> &nbsp;

**Q:** What does this tell us about the "representations" returned by the multinomial classifier? Is it "learning" something new about the digits?

> &nbsp;

## Section 3: Simple Neural Networks

In this next section, we will predict the category label using a [multilayer perceptron](https://scikit-learn.org/stable/modules/neural_networks_supervised.html) (i.e. neural network). With neural networks, nonlinear relationships can be learned. 

Import the [MLPClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html#sklearn.neural_network.MLPClassifier) classifier from *scikit-learn* and fit it to the dataset using the following parameters:

- hidden_layer_sizes = (5,)
- activation='logistic'
- solver='sgd' 
- alpha=1 
- learning_rate='constant'
- learning_rate_init=0.1,
- max_iter=1000
- n_iter_no_change=100
- momentum=0
- random_state=0
                  

Extract the weights of the first layer from `fit.coef_`. Plot the weights for each of the five hidden units and answer the questions below.

**Q:** Describe the classification weights for each digit. Are the weights isomorphic to their respective digits?

> &nbsp;

**Q:** What does this tell us about the "representations" returned by the neural network? Is it "learning" something new about the digits?

> &nbsp;

Now, plot the weights from the second layer (again using a heatmap) with the 5 hidden units on the x-axis and the 10 outputs on the y-axis. Answer the questions below.

**Q:** Based on what you know about the architecture of neural networks, what can we infer from the second layer of weights? In other words, what do the weights of the second layer tell us about the relationship between the "latent representations" learned in the first layer and the output labels?

> &nbsp;

One interesting way to demonstrate the relationship between the two weight layers is to run the network "in reverse". In other words, we can pretend that we passed a digit *label* to the output layer of the network and observe what  28x28 image it predicts. 

A "hacky" way to run the network in reverse is to take the dot-product of the two weight layers. (We say this is hacky because we are ignoring the activation functions and bias units. That is ok for this simple exercise.) Below, compute the dot product between the two layers. You will end up with 10 "predicted" images. Plot them and answer the questions below.

**Q:** Do the images resemble the original digits? (The images may be clearer if you normalize each image to be on the same scale. Because we are using a "hacky" approach, do not worry if they are not perfect.)

> &nbsp;