# Instructions

1. Add your name and HW Group Number below.
2. Complete each question. Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", and delete and `throw NotImplementedError()` lines.
3. Where applicable, run the test cases *below* each question to check your work. **Note**: In addition to the test cases you can see, the instructor may run additional test cases, including using *other datasets* to validate you code.
4. Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All). You can also use the **Validate** button to run all test cases.
5. Turn in your homework by going to the main screen in JupyterHub, clicking the Assignments menu, and submitting.



In [1]:
"""
Name: Vishnu Challa
HW Group Number: 40
"""

'\nName: Vishnu Challa\nHW Group Number: 40\n'

## HW3: Artifical Neural Networks


### 1) Neural Network Playground

First, go to Tensorflow's [Neural Network Playground](https://playground.tensorflow.org/). This website is an interactive and exploratory visualization of how the features, number of layers, training time, etc, influence the classification boundries of an ANN. Right now, we'll only worry ourselves with *classification* problems.

Play with the visualization, and then answer the following questions below.

#### Scenarios

1. Using the default network topology, try training the network with the different activation functions (ReLU, Tanh, Sigmoid, Linear). What effect does the activation function have on the training time? What effect does the activation function have on the shape of the classification boundries?
2. Take a look at [this setup](https://playground.tensorflow.org/#activation=tanh&batchSize=10&dataset=xor&regDataset=reg-plane&learningRate=0.03&regularizationRate=0&noise=0&networkShape=2,2&seed=0.21855&showTestData=false&discretize=false&percTrainData=50&x=true&y=true&xTimesY=false&xSquared=false&ySquared=false&cosX=false&sinX=false&cosY=false&sinY=false&collectStats=false&problem=classification&initZero=false&hideText=false). Train until the classification boundry converges. This is one of the rare cases where the nodes in an ANN can be (semi) interpreted. What do the nodes in the first hidden layer represent? What about the second hidden layer? How do you think the ANN uses these learned "features" to make a decision?

#### Exploration
For each of the following questions:
* Make a prediction before you begin exploring and testing.
* Include a link to your scenario.
* Explain why you think this scenario has this property.

**Questions**

3. Find a scenario where a simple model (fewer neurons) outperforms a complex model. (In regards to overfitting)
4. Find a scenario where no hidden layers perform well.
5. Find a scenario where a model with no hidden layers performs poorly no matter the features.
6. Find a scenario where it takes a lot of training time to get a correct solution.

1. From the default network topology, among all the activations functions the performance order would be ReLU > Tanh > Sigmoid > Linear i.e. ReLU performs the best and Linear is the least performing one. The timing taken by each function is as below: <br>
ReLU = 000,355 Epochs <br>
Tanh = 000,530 Epochs <br>
Sigmoid = 005,030 Epochs <br>
Linear = Infinite Epochs <br>
And also Linear function performs better on the linearly separable data. Tanh and Sigmoid perform almost same on any kind of randomly scattered data. Whereas ReLU performs like a balance between both Linear and Simoid, Tanh functions on any kind of random shaped data.<br><br>
2. The first hidden layer sets the initial linear boundaries to classify the data. And the second hidden layer generates the linear boundaries based on the output of the first hidden layer and tries to optimize the model further in order to make better decisions on the test data. ANN uses the learned feature to adjust the weights on the input data and tries to adjust them further leading to minimum training loss. This might lead to overfitting in some cases. So there must always be some limit in the number of epochs during training phase, so that we can finalize the model in a balanced way to further use it for predictions. <br><br>
3. In case of linearly separable data if we feed a complex model with only one neuron in the first hidden layer and then pass it to further layers with more number of neurons then the model takes infinite time for the training.<br>
   Simple model: https://tinyurl.com/bddp4yaw <br>
   Complex model: https://tinyurl.com/yckm5kvv <br>
   Both the models use Sigmoid function and simple model outperforms the complex one when linearly separable data is fed as input.<br>
   This is beacause in the complex model one neuron in first hidden layer sets an initial linear boundary and the further layers take more time in setting linear boundaries on top of the previous ones. That is why the complex model performs bad.<br><br>
4. In case of no hidden layers for linear separable data any simple model performs better than any ANN which has hidden layers.
   Model with no hidden layers: https://tinyurl.com/ztere6cx <br>
   Model with hidden layers: https://tinyurl.com/ntkccp4v <br>
   The model with no hidden layers performs better because it is easy to set a boundary for linear separable data. By increasing number of hidden layers the model gets complex uneccesaryly trying to set more boundaries.<br><br>
5. In case of no hidden layers for randomly shaped data, we need a hidden layer in order to perform the predictions when ReLU is used as an activation function.<br>
   Model with no hidden layers: https://tinyurl.com/4bem4v23 <br>
   Model with hidden layers: https://tinyurl.com/bdfskk4s <br>
   In the above case the model with hidden layers performed better because the data input was complex shaped and the hidden layers make use of activation functions to set the boundaries and adjust the weights on the all the input features. So it is easier to classify with ANN having hidden layers in this case.<br><br>
6. In case of a model with randomly shaped data, for a simple neural network using RELU activation takes a lot of time to get trained.<br>
   Model: https://tinyurl.com/35u8jw35<br>
   In the above case the model takes inifinte time to get trained because the input data is randomly shaped and the activation function ReLU continously tries to set the boundaries which bring no change to the updated weights.

## 2) Training and Testing a Neural Network (Group)

For this problem, you'll be looking at a subset of the [UCI ML hand-written digits datasets](https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits), which contains images of hand-written digits: 10 classes where each class refers to a digit.

Each data entry is a input matrix of 8x8 where each element is an integer in the range 0..16. The matrix is flattened in the dataset.


For this question, **you have enough experience to do the entire model pipeline yourself**. That means *loading the data, creating splits, scaling the data, training and tuning the model, and evaluating the model.*

In [62]:
#Import necessary libraries
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

random_state = 42

### Step 1: Load the data. Use `np.unique()` to check the class balance.

In [63]:
from sklearn.datasets import load_digits
df = load_digits()

In [64]:
df.data

array([[ 0.,  0.,  5., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ..., 10.,  0.,  0.],
       [ 0.,  0.,  0., ..., 16.,  9.,  0.],
       ...,
       [ 0.,  0.,  1., ...,  6.,  0.,  0.],
       [ 0.,  0.,  2., ..., 12.,  0.,  0.],
       [ 0.,  0., 10., ..., 12.,  1.,  0.]])

In [65]:
# Get a distribution of the class label (target)
np.unique(df.target, return_counts=True)

(array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]),
 array([178, 182, 177, 183, 181, 182, 181, 179, 174, 180]))

### Step 2: Split the data into X (feautres) and Y (class)

Assign the variables below to split the dataset in to X (features) and Y (target)

In [66]:
X = Y = None
X = df.data
Y = df.target

### Step 3: Create your train/test split. Use the provided random_state.

**Note**: You should use a `train_size` of 0.1, or 10%. Normally we would want to use more of our data for training, but since ANNs are computationally expensive, we're keeping the training dataset small.

In [67]:
from sklearn.model_selection import train_test_split

X_train = X_test = y_train = y_test = None
X_train, X_test, y_train, y_test = train_test_split(X, Y, train_size=0.1, random_state=random_state)

In [68]:
assert X_train.shape == (179, 64)
assert y_train.shape == (179, )
assert X_test.shape == (1618, 64)
assert y_test.shape == (1618, )

### Step 4: Use a [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) to normalize the image data. 

Pixel data, like other data we've encountered, should often be scaled before classification. While in practice scaling image data can be more complex, in this exercise we'll continue to use the [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html).

Fit the scaler only the the training X features, and then apply it to both training and test X features. We do this because in practice, we wouldn't be able to see data in the test X, so it shouldn't affect feature transformation. We therefore only use X_train for feature transformation.

In [69]:
from sklearn.preprocessing import StandardScaler

# Assign these variables the standardized training and test datasets
X_stand_train = X_stand_test = None
scaler = StandardScaler()
X_stand_train = scaler.fit_transform(X_train)
X_stand_test = scaler.transform(X_test)

In [70]:
train_zero_indices = [0, 7, 23, 31, 32, 39]
test_zero_indices = [0, 32, 39]

for i in range(X_stand_train.shape[1]):
    if i in train_zero_indices:
        np.testing.assert_almost_equal(np.mean(X_stand_train[:, i]), 0)
        np.testing.assert_almost_equal(np.std(X_stand_train[:, i]), 0)
    else:
        np.testing.assert_almost_equal(np.mean(X_stand_train[:, i]), 0)
        np.testing.assert_almost_equal(np.std(X_stand_train[:, i]), 1)
        
    if i in test_zero_indices:
        np.testing.assert_almost_equal(np.mean(X_stand_test[:, i]), 0)
        np.testing.assert_almost_equal(np.std(X_stand_test[:, i]), 0)
    else:
        assert np.mean(X_stand_test[:, i]) != 0
        assert np.std(X_stand_test[:, i]) != 1

### Step 5:  Train an MLP with default hyperparameters.

For the following, you'll be using sklearn's built in Multi-layer Perceptron classifier [MLPClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html).

Use the default hyperparams aside from `max_iter`. `max_iter` is how many iterations of training the ANN goes though until it manually stops. The default `max_iter=200` is too long for our data currently. 

**Use random_state as the random_states and max_iter=20**. The detault parameters will use a single hidden layer.



In [71]:
from sklearn.neural_network import MLPClassifier

In [72]:
clf=None
clf = MLPClassifier(random_state=random_state, max_iter=20).fit(X_stand_train, y_train)

### Step 6:  Evaluate the model on the test dataset using a confusion matrix and a classification report

Like all classifiers, the MLP has a [`predict`](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html#sklearn.neural_network.MLPClassifier.predict) function that is used to make predictions on trianing or test data.

In [73]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

In [74]:
# Evaluate the classifier and assign mlp_cm to the confusion matrix of the evaluation
mlp_cm = None
y_pred = clf.predict(X_stand_test)
mlp_cm = confusion_matrix(y_test, y_pred)
mlp_cm

array([[146,   0,   3,   0,   3,   0,   0,   0,   1,   0],
       [  1,  87,  27,   4,  12,   5,   3,   2,  20,   0],
       [  0,   4, 127,   6,   2,   1,   4,   0,  10,   5],
       [  0,   2,  19, 115,   0,   2,   2,   0,   9,  17],
       [  6,   2,   3,   2, 135,   7,   0,   3,   1,   3],
       [  7,   3,  12,   0,   1, 133,   1,   0,   3,   2],
       [ 21,   0,  82,   0,  12,   6,  35,   0,  11,   2],
       [  0,   5,   1,  45,   2,  13,   0,  91,  10,   0],
       [  7,  10,  21,   5,  12,  21,   2,   0,  53,  21],
       [ 10,   2,  36,  13,   5,  17,   1,   4,  28,  51]])

In [75]:
np.testing.assert_almost_equal(mlp_cm, [[146,   0,   3,   0,   3,   0,   0,   0,   1,   0],
       [  1,  87,  27,   4,  12,   5,   3,   2,  20,   0],
       [  0,   4, 127,   6,   2,   1,   4,   0,  10,   5],
       [  0,   2,  19, 115,   0,   2,   2,   0,   9,  17],
       [  6,   2,   3,   2, 135,   7,   0,   3,   1,   3],
       [  7,   3,  12,   0,   1, 133,   1,   0,   3,   2],
       [ 21,   0,  82,   0,  12,   6,  35,   0,  11,   2],
       [  0,   5,   1,  45,   2,  13,   0,  91,  10,   0],
       [  7,  10,  21,   5,  12,  21,   2,   0,  53,  21],
       [ 10,   2,  36,  13,   5,  17,   1,   4,  28,  51]])

In [76]:
# Similarly generate a classification report for the test dataset
mlp_clf_report = None
mlp_clf_report = classification_report(y_test, y_pred)
print(mlp_clf_report)

              precision    recall  f1-score   support

           0       0.74      0.95      0.83       153
           1       0.76      0.54      0.63       161
           2       0.38      0.80      0.52       159
           3       0.61      0.69      0.65       166
           4       0.73      0.83      0.78       162
           5       0.65      0.82      0.72       162
           6       0.73      0.21      0.32       169
           7       0.91      0.54      0.68       167
           8       0.36      0.35      0.36       152
           9       0.50      0.31      0.38       167

    accuracy                           0.60      1618
   macro avg       0.64      0.60      0.59      1618
weighted avg       0.64      0.60      0.59      1618



In [77]:
assert mlp_clf_report == '              precision    recall  f1-score   support\n\n           0       0.74      0.95      0.83       153\n           1       0.76      0.54      0.63       161\n           2       0.38      0.80      0.52       159\n           3       0.61      0.69      0.65       166\n           4       0.73      0.83      0.78       162\n           5       0.65      0.82      0.72       162\n           6       0.73      0.21      0.32       169\n           7       0.91      0.54      0.68       167\n           8       0.36      0.35      0.36       152\n           9       0.50      0.31      0.38       167\n\n    accuracy                           0.60      1618\n   macro avg       0.64      0.60      0.59      1618\nweighted avg       0.64      0.60      0.59      1618\n'

In [78]:
# For comparison, generate a classification report for the *training* dataset
mlp_clf_report = None
x_pred = clf.predict(X_stand_train)
mlp_clf_report = classification_report(y_train, x_pred)
print(mlp_clf_report)

              precision    recall  f1-score   support

           0       0.89      1.00      0.94        25
           1       0.82      0.67      0.74        21
           2       0.67      0.89      0.76        18
           3       0.74      0.82      0.78        17
           4       0.78      0.95      0.86        19
           5       0.79      0.95      0.86        20
           6       0.50      0.25      0.33        12
           7       0.88      0.58      0.70        12
           8       0.63      0.55      0.59        22
           9       0.64      0.54      0.58        13

    accuracy                           0.75       179
   macro avg       0.73      0.72      0.71       179
weighted avg       0.75      0.75      0.74       179



In [79]:
assert mlp_clf_report == '              precision    recall  f1-score   support\n\n           0       0.89      1.00      0.94        25\n           1       0.82      0.67      0.74        21\n           2       0.67      0.89      0.76        18\n           3       0.74      0.82      0.78        17\n           4       0.78      0.95      0.86        19\n           5       0.79      0.95      0.86        20\n           6       0.50      0.25      0.33        12\n           7       0.88      0.58      0.70        12\n           8       0.63      0.55      0.59        22\n           9       0.64      0.54      0.58        13\n\n    accuracy                           0.75       179\n   macro avg       0.73      0.72      0.71       179\nweighted avg       0.75      0.75      0.74       179\n'

How well did the classifier do? What digit did it do best on? Which digits did it confuse the most? Do you think the classifier is likely over-fitting, underfitting or neither?

The classifier did well. The model did best on predicting '0' based on the report metrics. The model got confused on predicting '6' based on the report metrics. I think the classifier has done neither over-fitting nor underfitting and performing in a balanced way. Becuase here the input data is fixed between a range 0-9 and the number of iterations performed in the neural network while traning should be good enough to train the classifier.

## 3) Hyperparameters

**Hyperparams**:

ANNs have *a lot* of hyperparams. This can include simple things such as the number of layers and nodes, up to tuning the learning rate and the gradient descent algorithm used. 

This process can require a lot of experimentation and intution through experience, but it can be automated to some extent using hyperparameter tuning. When we have multiple hyperparameters, we use an approach called GridSearch, where we try all combinations of various hyperparameters to find the combination that works best.

For the following, you will practice the hyperparamater tuning for the [MLPClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html) with sklearn's [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) function, you should explore different combination of the following parameters:

* `activation`: The activation function of the the ANN. Defaults to ReLU.
* `max_iter`: The ANN will train iterations until either the loss stops improving by a specified threshold, or `max_iters` is reached. Warning: the more you increase this, the more the training time will take! Patience is a virtue.
* `hidden_layer_sizes`: A tuple representing the structure of the hidden layers. For example, giving the tuple `(100,50)` means that there's two hidden layers: the first being of size 100, and the second being of size 50. The tuple (100,) would mean a single hidden layer of size 100.

Normally we would try many more possible combinations (and larger networks), but we've kept the list short to reduce computation time.

**Try different permutations of these hyperprams and see how it affects the classification scores of your model.**

In [80]:
# import the library
from sklearn.model_selection import GridSearchCV

In [81]:
# The parameter list you will explore
parameters = {'activation':['logistic', 'relu'], 'max_iter':[5, 10], 'hidden_layer_sizes':[(50,),(20,)]}

Now it's your turn, first initialize an MLPClassifier, make sure to **use "random_state" as the random_states**, then feed the parameter list defined above as well as the training data (**use "X_stand_train"**) to GridSearchCV to create a classifier with the best combination of the parameters. To do so, it uses cross-validation within the training dataset, so you never have to peek at your test dataset. Then fit the final classifier to the whole standardized training dataset.

**Note**: You should use cv=2 in your grid search, to reduce the number of folds tested.

In [82]:
# Assign clf to the optimized (with grid search) MLP model
# TIP: Again, if you want to track the trianing progress, try passing "verbose = True" to the MLP
clf = None
mlp_classifier = MLPClassifier(random_state=random_state)
clf = GridSearchCV(mlp_classifier, parameters, cv=2)
clf.fit(X_stand_train, y_train)

GridSearchCV(cv=2, estimator=MLPClassifier(random_state=42),
             param_grid={'activation': ['logistic', 'relu'],
                         'hidden_layer_sizes': [(50,), (20,)],
                         'max_iter': [5, 10]})

In [83]:
# Now let's see the parameters of the winning model of our grid search
# This model is the one clf actually uses when you call clf.fit
clf.best_estimator_

MLPClassifier(hidden_layer_sizes=(50,), max_iter=10, random_state=42)

In [84]:
assert list(clf.cv_results_['rank_test_score']) == [5, 2, 7, 7, 4, 1, 6, 3]
np.testing.assert_almost_equal(round(clf.best_score_,4), 0.2067)
assert clf.best_params_['hidden_layer_sizes'] == (50,)
assert clf.best_index_ == 5

In [85]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

Now you will use the estimator with the best found parameters to generate predictions (stored as "y_pred") on testing dataset, **remember to use "X_stand_test"**

In [86]:
y_pred = None
y_pred = clf.predict(X_stand_test)

In [87]:
assert list(confusion_matrix(y_test,y_pred)[0]) == [18,  9,  2,  0, 22,  1,  1,  0,  1, 99]

In [88]:
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.43      0.12      0.18       153
           1       0.22      0.61      0.32       161
           2       0.13      0.26      0.18       159
           3       0.50      0.01      0.01       166
           4       0.08      0.13      0.10       162
           5       0.86      0.33      0.48       162
           6       0.05      0.03      0.04       169
           7       0.44      0.04      0.08       167
           8       0.14      0.09      0.10       152
           9       0.19      0.31      0.24       167

    accuracy                           0.19      1618
   macro avg       0.30      0.19      0.17      1618
weighted avg       0.30      0.19      0.17      1618



Note that in this toy example, we used a very limited set of hyperparmeters to reduce training time, and so our tuned model will actually do worse than our original. However, in practice, the tuned model should 

**Remember**: Make sure to complete all problems (.ipynb files) in this assignment. When you finish, double-check the submission instructions at the top of this file, and submit on JupyterHub.