In [None]:
# Initialize Otter Grader
import otter
grader = otter.Notebook()

![data-x](https://raw.githubusercontent.com/afo/data-x-plaksha/master/imgsource/dx_logo.png)

___

#### NAME:

#### STUDENT ID:
___

# **HW3-4: Neural Network**
**(Total 120 points)**

Before we get started, we'd like to introduce you to a ["neural network playground"](https://playground.tensorflow.org) that lets you visualize and tinker with neural networks. It is especially helpful if Artificial Neural Network is a new thing to you.

The notebook utilizes [tensorflow](https://www.tensorflow.org/overview) to create and train neural networks. More specifically, we will use the high-level `tensorflow.keras` [API](https://www.tensorflow.org/api_docs/python/tf/keras). Run the code cell to load necessary modules.

In [1]:
from tensorflow import keras
from tensorflow.python.framework.ops import disable_eager_execution
from tensorflow.keras.backend import clear_session
from matplotlib import pyplot as plt
import matplotlib.gridspec as gridspec
import pandas as pd
import numpy as np

# We disable tensorflow eager execution mode to enhance CPU and RAM efficiency
disable_eager_execution()

## 0. Introduction
**(0 points)**

This notebook is intended for demonstrating various **fundamental ideas of engineering a neural network** model. It happens that MNIST database offers a good platform to showcase these ideas. While the ideas discussed here are general enough and can be applied to a wide class of different neural network architectures, our "vanilla" models (models with just fully-connected layers) here do not scale up well to many other computer vision problems. For harder problems, more sophisticated neural network architectures are needed. To learn novel computer vision techniques, check out [convolutional neural network](https://datax.berkeley.edu/wp-content/uploads/2020/09/slides-m430-convolutional-neural-networks.pdf).

### 0.a Load MNIST Database

In this homework, we will work on the [MNIST Database](http://yann.lecun.com/exdb/mnist/). The database contains hand-written images (digitized with 28-by-28 grey-scale pixels) of ten different classes (digits 0\~9.) There are 60000 entries in the MNIST training dataset and 10000 entries in the MNIST test dataset.

In [2]:
(X_train_full, y_train_full), (X_test, y_test) = keras.datasets.mnist.load_data()

print("Shape of the full training images array:", X_train_full.shape)
print("Shape of the full training labels array:", y_train_full.shape)
print("Shape of the test images array:",X_test.shape)
print("Shape of the test labels array:",y_test.shape)

To reduce the computation power requirement, we will just use 1/6 of the MNIST training data (10000 samples, 1000 each class.) We then use a train_test_split to further split the subset into the training dataset (`X_train`, `y_train`, 6000 samples) and the validation dataset (`X_val`, `y_val`, 4000 samples) we will use for the following problems. Stratified sampling is used to ensure that the 10 classes are balanced in each dataset.

In [3]:
from sklearn.utils import resample
from sklearn.model_selection import train_test_split
## Stratified Sampling
X_train_sample, y_train_sample = resample(X_train_full,y_train_full,
                                          replace=False,n_samples=10000,
                                          stratify=y_train_full,random_state=0)
## Trainv-validation split 
X_train, X_val, y_train, y_val = train_test_split(X_train_sample,y_train_sample,test_size=0.4,
                                                  stratify=y_train_sample,random_state=0)
del X_train_sample, y_train_sample
print("Shape of the training images array:", X_train.shape)
print("Shape of the training labels array:", y_train.shape)
print("Shape of the validation images array:",X_val.shape)
print("Shape of the validation labels array:",y_val.shape)

Let's take a look at the images in `X_train` and see if they match with the labels in `y_train`. We also want to make sure the data are shuffled, which is an important prerequisite for stochastic gradient descent (SGD) methods to work.

In [4]:
fig = plt.figure(constrained_layout=True,figsize=(15,4.5))
spec_arr = gridspec.GridSpec(ncols=10, nrows=3, figure=fig)
for ximg, ylabel, spec in zip(X_train,y_train,spec_arr):
  ax = fig.add_subplot(spec)
  ax.imshow(ximg)
  ax.set_title("({0})".format(ylabel))
  ax.set_xticks([])
  ax.set_yticks([])

### 0.b Preprocessing Data

Each MNIST pixel value ranges from 0\~255 (8-bit), but we scale the pixels uniformly to 0~1 to make weight initializations easier. This results in conversion of data type, and we store the scaled pixel as 16-bit (half-precision) float to save memory without sacrificing much accuracy.

In [5]:
X_train = np.array(X_train,dtype=np.float16) / 255
X_val   = np.array(X_val  ,dtype=np.float16) / 255
X_test  = np.array(X_test ,dtype=np.float16) / 255

### 0.c Building and Training a Neural Network

#### 0.c.1 Weight Initializers
Initialization of parameters (weights) is important for training a neural network. To name some the novel initialization methods
- [Glorot initializer](http://proceedings.mlr.press/v9/glorot10a.html) is designed for neurons with sigmoid or tanh activation functions.
- [He initializer](https://www.cv-foundation.org/openaccess/content_iccv_2015/html/He_Delving_Deep_into_ICCV_2015_paper.html) is designed for neurons with Rectified Linear Unit (ReLu) like activation functions.

> These are the two initializers ([`tf.keras.initializers.GlorotNormal`](https://www.tensorflow.org/api_docs/python/tf/keras/initializers/GlorotNormal) and [`tf.keras.initializers.HeNormal`](https://www.tensorflow.org/api_docs/python/tf/keras/initializers/HeNormal)) we will use throughout this homework. For both of initialization techniques, the initial weights are drawn randomly from a certain probability distribution. To ensure consistent results accross different runs, we set the seed for random number generations.

In [6]:
initializer_G = keras.initializers.GlorotNormal(seed=0)
initializer_H = keras.initializers.HeNormal(seed=0)

#### 0.c.2 Keras [Sequential Model](https://www.tensorflow.org/guide/keras/sequential_model)
[`keras.Sequential`](https://www.tensorflow.org/api_docs/python/tf/keras/Sequential) makes it easy to create models layer-wise. The following code creates a model with three layers.

1. First, the input images ($28\times28$ pixels) are "flattened" into a $28\times28=784$ dimensional vectors. The vector is input to the next layer.

2. A "Dense" (aka "fully connected") layer connecting each element of the input vector to each of the 100 hidden units (neurons). There is a weight parameter associated with each of the $784\times100$ connections, and a bias parameter for each of the 100 neurons. Therefore, the total number of parameters is $(784+1)\times100$ for the layer. More formally, for each 784-dimensional input vector $x$, the layer outputs a 100-dimensional $y=f(w^T x+b)$ where
 - $w$ is a 784$\times$100 weight matrix.
 - $b$ is a 100-dimensional bias vector.
 - $f$ is the non-linear activation function (elementwise "[`sigmoid`](https://www.tensorflow.org/api_docs/python/tf/keras/activations/sigmoid)" in our example).

3. A second Dense layer which takes inputs from the first Dense layer and generates the 10-dimensional outputs of our model. Each of the ten neurons corresponds to each class label (0~9) that can be classified into. It follows similar mathematical formulation to 2, except that the activation function $f$ is "[`softmax`](https://www.tensorflow.org/api_docs/python/tf/keras/activations/softmax)" to give a probabilistic interpretation of our model outputs. This is due to the properties of softmax function:
 - Each element of the output vector is $\in [0,1]$.
 - The summation of the 10 elements of the output vector gives 1.

We then use the `keras.Model.summary` method to get a profile of the model. Does the number of parameters for each layer match with your expectations? 

> The initializers we declared above are used to initialize the parameters for the two Dense layers. Doing the same will help ensure the **reproducibility of your results** in the following pactices.

In [7]:
model_basic = keras.Sequential([
    keras.layers.Flatten(input_shape=[28,28]),
    keras.layers.Dense(units=100,activation="sigmoid",kernel_initializer=initializer_G),
    keras.layers.Dense(units=10,activation="softmax",kernel_initializer=initializer_G)
])
model_basic.summary()

We then configure the model for training through [`keras.Model.compile`](https://www.tensorflow.org/api_docs/python/tf/keras/Model#compile). Two essential components shuold be specified aside from the layers.

- The **loss function** quantifies the "goodness" of the model output. Model training is essentially a process trying to minimize the loss function by updating the parameters. For our multi-class classification problem, [`SparseCategoricalCrossentropy`](https://www.tensorflow.org/api_docs/python/tf/keras/losses/SparseCategoricalCrossentropy) is a natural choice.
- The **optimizer**, i.e. the algorithm used to minimze the loss. For deep neural networks, virtually all popular optimizers are based on [stochastic gradient descent](https://en.wikipedia.org/wiki/Stochastic_gradient_descent) (SGD) method. Here we use a basic [`SGD`](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/SGD) optimizer with learning rate set to 1.

Additionally, you can specify extra **metrics** to keep track of during training besides the loss value. For example, [`Accuracy`](https://www.tensorflow.org/api_docs/python/tf/keras/metrics/Accuracy) can be a useful indicator of the model performance.

In [8]:
model_basic.compile(loss="sparse_categorical_crossentropy",
              optimizer=keras.optimizers.SGD(learning_rate=1),
              metrics=["accuracy"])

The `train_nn` function we define here will be used to fit the models throughout the notebook. It calls the [`keras.Model.fit`](https://www.tensorflow.org/api_docs/python/tf/keras/Model#fit) method of which some relevant arguments are explained below.
- The SGD `batch_size` is set to 32, meaning for each SGD iteration (each time the weights get updated) the gradient corresponding to 32 samples are considered. A reasonable batch size helps speed up the optimization with parallelized computation.
- The SGD `epochs` determines the number of times that all samples in `X_train` are traversed by SGD during training.

> We set `shuffle=False` to make the training results repeatable.

In [9]:
def train_nn(model,epochs=5,verbose=0):
  history = model.fit(X_train,y_train,validation_data=(X_val,y_val),
                      batch_size=32,
                      epochs=epochs,shuffle=False,verbose=verbose)
  return history
history_basic = train_nn(model_basic,verbose=1)
pd.DataFrame(history_basic.history).plot(figsize=(5,5))
plt.grid(True)

The "history" object returned by the fit method contains essential information of the training history. Here we keep a record of the example in `history_basic` which will searve as a baseline for various comparisons in the following sections.

> Unless you have re-initialzed `model_basic`, **do not run the code cell above more than once**. This will overwrite `history_basic` with a different training history that starts with pretrained initial weights. Although this will not affect the grading, this will undermine the purpose of a baseline training history.

#### 0.c.3 Making Predictions

Similarly to scikit-learn estimators, you can call the [`keras.Model.predict`](https://www.tensorflow.org/api_docs/python/tf/keras/Model?version=nightly#predict) method to make predictions. Recall that we used "softmax" for the ten output neurons, so the model will output ten numbers which can be interpretated as the probabilities (model's certainty) that the image belongs to each of the classes.

In [10]:
prediction_basic_9 = model_basic.predict(X_test[:9])

fig = plt.figure(constrained_layout=True,figsize=(12,6))
spec_arr = gridspec.GridSpec(ncols=6, nrows=3, figure=fig)

for n, (ximg, ylabel, pred) in enumerate(zip(X_test[:9],y_test[:9],prediction_basic_9)):
  ax_img  = fig.add_subplot(spec_arr[2*n])
  ax_img.imshow(ximg.astype(float))
  ax_img.set_title("({0})".format(ylabel))
  ax_img.set_xticks([])
  ax_img.set_yticks([])

  ax_pred = fig.add_subplot(spec_arr[2*n+1])
  ax_pred.set_xticks(range(10))
  ax_pred.bar(range(10),pred,color='C0' if np.argmax(pred)==ylabel else 'C1')
  ax_pred.set_ylim([0,1])
  ax_pred.set_xlabel('Class')
  ax_pred.set_ylabel('Modeled Probability')

The code cell below classifies images in `X_test` and evaluates the prediction metric. You should see that the test accuracy is 0.9292.

In [11]:
print ('test acc:',model_basic.evaluate(X_test,y_test)[1])

[`keras.backend.clear_session`](https://www.tensorflow.org/api_docs/python/tf/keras/backend/clear_session) frees your model from memory. Call it whenever necessary to remove unused models, especially if you work on a platform that is memory-limited.

In [12]:
clear_session()

## 1. Training: Loss Minimization 
**(Total 20 points)**

In this section, we dive a bit into the optimizers which are the workhorse of deep learning.

Run the block below to define useful wrapper functions that will be used throughout the section.
- `build_vanilla_NN` builds the neural network model we want to train. The network structure is the same as in 0.c.2. However, the `optimizer` argument remains undeclared, which is exactly what you will work on. 
- We also redefine the `train_nn` method here specifying the training and validation data, batch size, and shuffling criteria. Simply call the method with your compiled model to initiate the training procudure.

In [13]:
def build_vanillaNN(optimizer):
  model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28,28]),
    keras.layers.Dense(100,activation="sigmoid",kernel_initializer=initializer_G),
    keras.layers.Dense(10,activation="softmax",kernel_initializer=initializer_G)
  ])
  model.compile(loss="sparse_categorical_crossentropy",
                optimizer=optimizer,
                metrics=["accuracy"])
  return model

def train_nn(model,epochs=5,verbose=0,validation=True):
  if validation:
    history = model.fit(X_train,y_train,validation_data=(X_val,y_val),
                        batch_size=32,epochs=epochs,shuffle=False,verbose=verbose)
  else:
    history = model.fit(X_train,y_train,
                        batch_size=32,epochs=epochs,shuffle=False,verbose=verbose)
  return history

- `plot_learning` is a helper function that can be used to visualize the learning curves. It can be used to plot learning curves from different models together for the ease of comparisons. 

In [14]:
def plot_learning(history,label,c,ax_loss,ax_acc):
  train_label = "train ({0})".format(label)
  val_label   = "validation ({0})".format(label)
  if 'loss' in history.history:
    ax_loss.plot(history.history['loss'],c=c,label=train_label)
  if 'val_loss' in history.history:
    ax_loss.plot(history.history['val_loss'],'--',c=c,label=val_label)
  if 'accuracy' in history.history:
    ax_acc.plot(history.history['accuracy'],c=c,label=train_label)
  if 'val_accuracy' in history.history:
    ax_acc.plot(history.history['val_accuracy'],'--',c=c,label=val_label)

  for ax in [ax_loss,ax_acc]:
    ax.legend()
    ax.grid('on')
    ax.set_xlabel("epoch")
  ax_loss.set_ylabel("Loss")
  ax_acc.set_ylabel("Accuracy")

### 1.a Learning Rate

In 0.c.2, we trained the network using a plain SGD optimizer with learning rate fixed to 1, which yielded reasonable learning curve. What if we increase or decrease the learning rate? Feel free to explore.

#### 1.a.1 Constant Learning Rate
**(5 points)**

Build a model through `build_vanillaNN`. Find a SGD learning rate that makes the average training error $\le$ 0.1 for the 5th epoch.

**Submission format:**
- The final model (compiled with [`SGD`](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/SGD) optimizer with the learning rate you find) should be stored in **`model_1a`**.
- Run `train_nn(model_1a)` to generate the learning curve. You should be able to see the curves through the last three lines of codes.
  - You can see from the plots whether you will pass the test, i.e. `history_1a1.history['loss'][4]<=0.1`.

<!--
BEGIN QUESTION
name: q1a1
manual: false
points: 5
-->

In [15]:
clear_session()
## Your code here
...
model_1a1 = ...

history_1a1 = train_nn(model_1a1)

## Visualizing learning curves
lr_num = keras.backend.eval(history_1a1.model.optimizer.learning_rate)
fig, AX = plt.subplots(nrows=1,ncols=2,figsize=(9,4))
plot_learning(history_basic,'lr=1.0','C0',AX[0],AX[1])
plot_learning(history_1a1,'lr={0}'.format(lr_num),'C1',AX[0],AX[1])

In [None]:
grader.check("q1a1")

Why is there an optimal learning rate to minimize the training error over a few epochs? What happens when the learning rate is too large/small?

#### 1.a.2 Scheduled Learning Rate
**(5 points)**

A large learning rate may work well when the model is far from the optimum. As the training process gradually converges towards the optimum, a lower learning rate usually works better than the large learning rate that was ideal in the beginning. 

Therefore, an adaptive learning rate can speed up the overall training significantly. For example, we can schedule the learning rate to decrease over steps based on a certain criterion. The [keras LearningRateSchedule](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/schedules/LearningRateSchedule) class provides the flexibility to specify this manually. Let's try to utilize it and beat your training result from 1.a.1.

Build a model through `build_vanillaNN`. Design an SGD learning rate scheduler that enables better training performance than 1.a.1.

**Submission format:**
- When declaring an [`SGD`](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/SGD) optimizer, pass a [`LearningRateSchedule`](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/schedules/LearningRateSchedule) based object (e.g. [`InverseTimeDecay`](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/schedules/InverseTimeDecay?version=nightly), [`ExponentialDecay`](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/schedules/ExponentialDecay?version=nightly)) as the argument for `learning_rate`. You have the freedom to set the scheduling criterion.
- The final model compiled with the SGD optimizer specified above should be stored in **`model_1a2`**.
 - We will check if the `model_1a2.optimizer.learning_rate` is a class in [`keras.optimizers.schedules`](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/schedules?version=nightly).
- `train_nn(model_1a2)` generates the learning curves. The following plots compare your new results with the results from 1.a.1. You will pass the second test if the average training loss of the 5th epoch is lower than that from 1.a.1.
 - That is, `history_1a2.history['loss'][4] < history_1a1.history['loss'][4]`.

<!--
BEGIN QUESTION
name: q1a2
manual: false
points: 5
-->

In [18]:
clear_session()
from tensorflow.keras.optimizers.schedules import *
## Your code here
...

history_1a2 = train_nn(model_1a2)
fig, AX = plt.subplots(nrows=1,ncols=2,figsize=(9,4.5))
plot_learning(history_1a1,'1a1: fixed lr','C0',AX[0],AX[1])
plot_learning(history_1a2,'1a2: scheduled lr','C1',AX[0],AX[1])

In [None]:
grader.check("q1a2")

### 1.b Momentum
**(5 points)**

[Momentum](https://en.wikipedia.org/wiki/Stochastic_gradient_descent#Momentum) is another technique that accelerates the gradient descent. It has a theoretical justification for convex optimization problems, and empirically works well for some non-convex problems such as neural network training. For `tensorflow.keras`, `momentum` is another hyperparameter that can be tuned for the [`SGD`](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/SGD) optimizer. We will utilize it to speed up our training in this problem.

Build a model through `build_vanillaNN`. Find an SGD momentum damping rate that makes the average training error < 0.08 for the 5th epoch.

**Submission format:**
- Declare an [`SGD`](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/SGD) optimizer with `learning_rate=1.0` (same as 0.c.2) and non-zero `momentum` term.
- Find a `momentum` value that makes the average training loss of the 5th epoch < 0.08. The model compiled with the SGD optimizer specified above should be stored in **`model_1b`**.
 - The momentum value should be > 0.
- Run `train_nn(model_1b)` to generate the learning curve.
 - We will test if `history_1b.history['loss'][4] < 0.08`.

<!--
BEGIN QUESTION
name: q1b
manual: false
points: 5
-->

In [22]:
clear_session()
## Your code here
...
model_1b = ...

history_1b = train_nn(model_1b)

## Visualizing learning curves
momentum_num = keras.backend.eval(history_1b.model.optimizer.momentum)
fig, AX = plt.subplots(nrows=1,ncols=2,figsize=(9,4))
plot_learning(history_basic,'no momentum','C0',AX[0],AX[1])
plot_learning(history_1b,'momentum={0:.2f}'.format(momentum_num),'C1',AX[0],AX[1])

In [None]:
grader.check("q1b")

### 1.c More Optimizers
**(5 points)**

There are algorithms based on SGD but try to outsmart it by automatically adapting the update step sizes. To name a few implemented in keras,
- [`keras.optimizers.Adagrad`](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Adagrad)
- [`keras.optimizers.RMSprop`](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/RMSprop)
- [`keras.optimizers.Adam`](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Adam)

To learn more details about them, see the reference in the API pages. Pick one of the advanced optimizers and try it out by training the model with `build_vanillaNN`. 

**Submission format:**
- Declare an optimizer that is not `SGD`, set relavent hyperparameters that can result in good training performance.
- The model compiled with the advanced optimizer specified above should be stored in **`model_1c`**.
- Run `train_nn(model_1c)` to generate the learning curve.
 - We will test if `history_1c.history['loss'][4] < 0.07`.

<!--
BEGIN QUESTION
name: q1c
manual: false
points: 5
-->

In [26]:
clear_session()
## Your code here
...
model_1c = ...

history_1c = train_nn(model_1c)

## Visualizing learning curves
optimizer_name = history_1c.model.optimizer.__class__.__name__
fig, AX = plt.subplots(nrows=1,ncols=2,figsize=(9,4))
plot_learning(history_1a1,'1a1: SGD','C0',AX[0],AX[1])
plot_learning(history_1c,'1c: {0}'.format(optimizer_name),'C1',AX[0],AX[1])

In [None]:
grader.check("q1c")

## 2. Customizing Neural Networks
**(Total 30 points)**

In this section, we will keep working on "shallow" neural networks (with just one hidden layer,) but we will learn the interesting effects of hyperparameters of the hidden layer.

### 2.a Activation Functions

In the previous sections, we used the bio-inspired "sigmoid" as the activation function of the hidden layer. It possesses certain nice properties. For example, it is good at learning logical functions. Although it was a popular choice for a long time, it results in gradients that are hard to handle. In principle, any non-linear functions may be used as activation functions for a neural network. There are some other activation functions that have been proven useful:

1. **tanh** is similar to sigmoid, but it allows the output to be negative. Therefore, it allows for a larger reachable parameter space than sigmoid.

2. **Rectified Linear Unit (ReLu)** has constant gradients in the positive region, so it results in more controllable gradients than sigmoid. Besides, it is computationally efficient. However, it is unable to learn (due to 0 gradient) unless it outputs non-zero values for any samples. This is called the dying ReLu problem.

3. **Exponential Linear Unit (ELU)** is similar to ReLu but is differentiable everywhere, making it numerically more stable than ReLu. Moreover, it allows for negative ouput which alleviates the dying problem.

Note that there is no silver bullet activation function that works the best for all networks. For each problem, the activation function can be chosen through (cross-)validation. 
This motivates us to build models with the newly learned activation functions. For all questions under 2.a, you will still create models with 3 layers as specified below:
- The first layer should be a `Flatten` layer. 
- The second layer should be a `Dense` layer with 100 neurons.
- The third layer, the output layer, should be a `Dense` layer with 10 neurons with softmax activation.

#### 2.a.1 Hyperbolic Tangent (tanh)
**(5 points)**

Train a model with the hidden layer activation function being "[`tanh`](https://www.tensorflow.org/api_docs/python/tf/keras/activations/tanh)." Use any optimization techniques to achieve low training and validation losses.

**Submission format:**
- You have the freedom to use any initializers and optimizers, but make sure the activation function of the second layer is "tanh".
- The model compiled with the designed optimizer specified above should be stored in **`model_2a1`**.
- Run `train_nn(model_2a1)` to generate the learning curves.
 - We will test if `history_2a1.history['loss'][4] < 0.08`.
 - We will test if `history_2a1.history['val_loss'][4] < 0.28`.

<!--
BEGIN QUESTION
name: q2a1
manual: false
points: 5
-->

In [31]:
clear_session()
## Your code here
...

model_2a1 = ...

history_2a1 = train_nn(model_2a1,epochs=5)
fig, AX = plt.subplots(nrows=1,ncols=2,figsize=(9,4.5))
plot_learning(history_2a1,'2a1: tanh','C0',AX[0],AX[1])

In [None]:
grader.check("q2a1")

#### 2.a.2 Rectified Linear Unit (ReLu)
**(5 points)**

Train a model with the hidden layer activation function being "[`relu`](https://www.tensorflow.org/api_docs/python/tf/keras/activations/relu)." Use any optimization techniques to achieve low training and validation losses.

**Submission format:**
- You have the freedom to use any initializers and optimizers, but make sure the activation function of the second layer is "relu".
- The model compiled with the designed optimizer specified above should be stored in **`model_2a2`**.
- Run `train_nn(model_2a2)` to generate the learning curves.
 - We will test if `history_2a2.history['loss'][4] < 0.08`.
 - We will test if `history_2a2.history['val_loss'][4] < 0.28`.

<!--
BEGIN QUESTION
name: q2a2
manual: false
points: 5
-->

In [35]:
clear_session()
## Your code here
...

model_2a2 = ...

history_2a2 = train_nn(model_2a2,epochs=5)
fig, AX = plt.subplots(nrows=1,ncols=2,figsize=(9,4.5))
plot_learning(history_2a2,'2a2: relu','C0',AX[0],AX[1])

In [None]:
grader.check("q2a2")

#### 2.a.3 Exponential Linear Unit (ELU)
**(5 points)**

Train a model with the hidden layer activation function being "[`elu`](https://www.tensorflow.org/api_docs/python/tf/keras/activations/elu)." Use any optimization techniques to achieve low training and validation losses.

**Submission format:**
- You have the freedom to use any initializers and optimizers, but make sure the activation function of the second layer is "elu".
- The model compiled with the designed optimizer specified above should be stored in **`model_2a3`**.
- Run `train_nn(model_2a3)` to generate the learning curves.
 - We will test if `history_2a3.history['loss'][4] < 0.10`.
 - We will test if `history_2a3.history['val_loss'][4] < 0.28`.

<!--
BEGIN QUESTION
name: q2a3
manual: false
points: 5
-->

In [39]:
clear_session()
## Your code here
...

model_2a3 = ...

history_2a3 = train_nn(model_2a3,epochs=5)
fig, AX = plt.subplots(nrows=1,ncols=2,figsize=(9,4.5))
plot_learning(history_2a3,'2a3: elu','C0',AX[0],AX[1])

In [None]:
grader.check("q2a3")

### 2.b Number of Neurons
**(15 points)**

In our model, the number of parameters increases linearly with the "width" of the hidden layer. How does that affect the training and validation performance?

Create models with 3 layers, but vary the number of neurons from 50 to 200 for the hidden layer:
- The first layer should be a `Flatten` layer. 
- The second layer should be a `Dense` layer with `sigmoid` activations and `kernel_initializer=initializer_G`. `units` should be set according to the elements in `nunits_grid`.
- The third layer, the output layer, should be a `Dense` layer with 10 neurons with softmax activation.

**Submission format:**
- For each element in `nunits_grid` in ascending order, build and train (you can still use `train_nn`) a model according to above specifications. Use **the same optimizer** to train all the models.
- Append the training history for each model to the list `histories_2b`. That is, we should have `histories_2b[0]` being results for `units=50`, `histories_2b[1]` being results for `units=100`, and so on.
  - For the model with `units=50`, you shuold have average training loss for the 5th epoch < 0.1.
  - We expect the training loss not to increase as the `nunits` increases. This will be tested for `units` in `[100,150]` through `histories_2b[1:3]`.


In [43]:
clear_session()

nunits_grid  = [50,100,150]
histories_2b = [] 
# Your code here
...

fig, AX = plt.subplots(nrows=1,ncols=2,figsize=(9,4.5))
Colors = ['C0','C1','C2']
for history, nunits, c in zip(histories_2b,nunits_grid,Colors):
  plot_learning(history,'{0}'.format(nunits),c,AX[0],AX[1])

<!--
BEGIN QUESTION
name: q2b1
manual: false
points: 5
-->

In [44]:
# tests for units=50 (5 points)

In [None]:
grader.check("q2b1")

<!--
BEGIN QUESTION
name: q2b2
manual: false
points: 5
-->

In [49]:
# tests for units=100 (5 points)

In [None]:
grader.check("q2b2")

<!--
BEGIN QUESTION
name: q2b3
manual: false
points: 5
-->

In [54]:
# tests for units=150 (5 points)

In [None]:
grader.check("q2b3")

How did the validation loss and accuracy change as you increased `units`? You may find that with \~120K parameters for 150 hidden neurons, the overfitting problem is not worse than the case with 100 neurons (\~80K parameters). More interestingly, there is evidence showing that [over-parameterization helps neural network generalize](https://openreview.net/forum?id=BygfghAcYX)! This is a curious behaviour of neural network that was not expected in conventional machine learning theory, and is one of the reasons that gigantic neural networks are used in practice.

## 3. Deep Neural Networks 
**(Total 20 points)**

A "deep" neural network has more than one hidden layers. This is motivated by the observation that a deep nerwork is able to learn features at different levels of abstractions. The deeper the data get passed down the network, the higher the abstraction level. This allows archirectures such as convolutional neural network (CNN) to learn well-structured and useful feature maps. Therefore, practical CNN usually has tens or hundreds of layers. For plain fully-connected architecture such as the one we have here, networks with 2 hidden layers usually outperform networks with 1 hidden layer. However, stacking even more fully-connected layers may not further boost performance of fully-connected networks.

### 3.a More Hidden Layers
**(8 points)**

Create and train models with more hidden layers! Here we will build models with up to 3 hidden layers.

**Submission format:**
- For each element in `nhlayers_grid` in ascending order, build and train (you can still use `train_nn`) a model with the corresponding number of hidden `Dense` layers. Use **the same optimizer** to train all the models.
- For each model, the first layer should still be a `Flatten` layer, and the final layer should still be a `Dense` layer with 10 units and softmax activation. Therefore, the **total number of layers should be 2+(# of hidden layers)**.
- Append the training history for each model to the list `histories_3a`. That is, we should have `histories_3a[0]` being results for `nhlayers=1`, `histories_3a[1]` being results for `nhlayers=2`, and so on.
- There is no training loss requirement for this question. We will simply test if your models comply to the specifications above. 
- We suggest the following hyperparameters:
 - Use default `RMSprop` optimizer.
 - For each hidden layer, `units=100`, `activation='relu'`, and `kernel_initializer=initializer_H`.

In [59]:
clear_session()

nhlayers_grid = [1,2,3]
histories_3a  = []
# Your code here
...

fig, AX = plt.subplots(nrows=1,ncols=2,figsize=(9,4.5))
for history,n,c in zip(histories_3a,nhlayers_grid,['C0','C1','C2','C3']):
  plot_learning(history,'{0}'.format(n), c,AX[0],AX[1])

<!--
BEGIN QUESTION
name: q3a1
manual: false
points: 4
-->

In [60]:
# test for 2 hidden dense (4 points)

In [None]:
grader.check("q3a1")

<!--
BEGIN QUESTION
name: q3a2
manual: false
points: 4
-->

In [62]:
# test for 3 hidden dense (4 points)

In [None]:
grader.check("q3a2")

### 3.b Batch Normalization
**(12 points)**

As the network gets deeper, it becomes harder to train the network because the gradients are easier to either explode or vanish. An effective way to counter this problem is through [`BatchNormalization`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/BatchNormalization?version=nightly) which analogously would attempt to standardize the inputs for each layer. It is also a great way to accelerate the training. Now, we are going to add it to our model and see how much it helps.

**Submission format:**
- For each element in `nhDenselayers_grid` in ascending order, repeat what you did in the last question, but now add a `BatchNormalization` layer after each `Dense` hidden layer.
- For each model, the first layer should still be a `Flatten` layer, and the final layer should still be a `Dense` layer with 10 units and softmax activation. The number of hidden layers get doubled from before because of the newly-added `BatchNormalization` layer. Therefore, the **total number of layers should be 2+2$\times$(# of hidden Dense layers)**.
- Append the training history for each model to the list `histories_3b`. That is, we should have `histories_3b[0]` being results for `nhDenselayers=1`, `histories_3b[1]` being results for `nhDenselayers=2`, and so on.
- The other suggestions follow 3.a.

In [64]:
clear_session()

nhDenselayers_grid = [1,2,3]
histories_3b  = []
# Your code here
...

fig, AX = plt.subplots(nrows=1,ncols=2,figsize=(9,4.5))
for history,n,c in zip(histories_3b,nhDenselayers_grid,['C0','C1','C2','C3']):
  plot_learning(history,'{0}'.format(n), c,AX[0],AX[1])

<!--
BEGIN QUESTION
name: q3b1
manual: false
points: 4
-->

In [65]:
# test for 1 hidden dense (4 points)

In [None]:
grader.check("q3b1")

<!--
BEGIN QUESTION
name: q3b2
manual: false
points: 4
-->

In [67]:
# test for 2 hidden dense (4 points)

In [None]:
grader.check("q3b2")

<!--
BEGIN QUESTION
name: q3b3
manual: false
points: 4
-->

In [69]:
# test for 3 hidden dense (4 points)

In [None]:
grader.check("q3b3")

You should be able to see substantial improvements in the training loss. Note that inserting Batch Normalization layer after the linear layer and right before the activation function is used more commonly than what we have done here.

## 4. Regularizations
**(Total 20 points)**

Being heavily parameterized nonlinear models, neural networks are prone to overfitting. Overfitting may be resolved either by including more training data or through regularization (e.g. penalizing large weights as we have learned in HW2). Aside from weight penalies (see [`tf.keras.regularizers`](https://www.tensorflow.org/api_docs/python/tf/keras/regularizers)), there exists an interesting and powerful regularization technique for neural network - [Dropout](https://cs231n.github.io/neural-networks-2/#reg). It is also implemented in tensorflow keras as [`tf.keras.layers.Dropout`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dropout).

Now, apply any regularization technique to reduce the validation loss for a neural network with the following specifications:
- There should by only one hidden Dense layer with 100 units and `relu` activation.
- You may add any numbers of `Dropout` layers, but do not add layers with parameters (such as `BatchNorm`). `Dropout` layers do not have parameters. The dropout rate is a hyperparameter.
- Therefore, the number of parameters is fixed to 79510. This will be tested to ensure **everyone has the same baseline model**.
- You are free to use **any initializer, optimizers and regularizers with the hyperparameters you like**.

**Submission format:**
- The model compiled with the optimizer and regularizersshould be stored in **`model_4`**.
- Run `train_nn(model_4)` to generate the learning curves. Your grade will be determined by the validation loss after 10 epochs of training.
 - You get 100% marks if `history_4.history['val_loss'][9] < 0.20`.
 - You get 75% marks if `0.20 <= history_4.history['val_loss'][9] < 0.21`.
 - You get 50% marks if `0.21 <= history_4.history['val_loss'][9] < 0.22`.
 - You get 25% marks if `0.22 <= history_4.history['val_loss'][9] < 0.23`.

 **Important:** To make sure your validation loss is reproducible by the grader, **seed everything you use that requires random number generations**. This may include the `kernel_initializer` and the `Dropout` layer.

In [71]:
clear_session()
## Your code here
...

model_4 = ...

In [72]:
history_4 = train_nn(model_4,epochs=10)
fig, AX = plt.subplots(nrows=1,ncols=2,figsize=(9,4.5))
plot_learning(history_4,'Prob 4','C0',AX[0],AX[1])

<!--
BEGIN QUESTION
name: q41
manual: false
points: 5
-->

In [73]:
# test for val loss < 0.20 (5 points)

In [None]:
grader.check("q41")

<!--
BEGIN QUESTION
name: q42
manual: false
points: 5
-->

In [76]:
# test for val loss < 0.21 (5 points)

In [None]:
grader.check("q42")

<!--
BEGIN QUESTION
name: q43
manual: false
points: 5
-->

In [79]:
# test for val loss < 0.22 (5 points)

In [None]:
grader.check("q43")

<!--
BEGIN QUESTION
name: q44
manual: false
points: 5
-->

In [82]:
# test for val loss < 0.23 (5 points)

In [None]:
grader.check("q44")

## 5. Hyperparameter Tuning
**(Total 30 points)**

Equipped with all the techniques we just learned, you are ready to build your own neural network!
- For this question, we unlock the full training dataset. Your model will be trained on 80% of `(X_train_full,y_train_full)` for 5 epochs and be validated on the rest 20%. 

In [85]:
# Do not execute this cell more than once, or you need to reload MNIST
X_train_full = np.array(X_train_full,dtype=np.float16)/255

- There are no restrictions on the model structure, but make sure your training can be **finished in a reasonable time span** (say, < 2 minutes.) 
- **"accuracy"** should be added to the tracked metrics when you compile the model.
- Store your final model in **`model_final`**. Your model's validation accuracy after 5 epochs of training will determine the points you get. The code cell following the problem cell handles the training and validation.
 - You get 100% marks if the validation score is > 97.00%.
 - You get 75% marks if the validation score is > 96.75% but < 97.00%.
 - You get 50% marks if the validation score is > 96.50% but < 96.75%.
 - You get 25% marks if the validation score is > 96.25% but < 96.50%

 **Important:** To make sure your validation loss is reproducible by the grader, **seed everything you use that requires random number generations**. This may include the `kernel_initializer` and the `Dropout` layer.


In [86]:
clear_session()
## Your code here
...
model_final = ...

In [87]:
from timeit import default_timer

assert (model_final.history is None),('Do not pass a model that has been trained before')
start_time = default_timer()
history = model_final.fit(X_train_full,y_train_full,validation_split=0.2,
                          epochs=5,batch_size=32,shuffle=False,verbose=1)
end_time = default_timer()
print ('Elapsed training time: {0}s'.format(end_time-start_time))
fig, AX = plt.subplots(nrows=1,ncols=2,figsize=(9,4.5))
plot_learning(history,'{0}'.format(''), 'C1',AX[0],AX[1])
print ('validation accuracy:', history.history['val_accuracy'][-1])

<!--
BEGIN QUESTION
name: q51
manual: false
points: 7.5
-->

In [88]:
# test for accuracy > 0.9700 (7.5 points)

In [None]:
grader.check("q51")

<!--
BEGIN QUESTION
name: q52
manual: false
points: 7.5
-->

In [90]:
# test for accuracy > 0.9675 (7.5 points)

In [None]:
grader.check("q52")

<!--
BEGIN QUESTION
name: q53
manual: false
points: 7.5
-->

In [92]:
# test for accuracy > 0.9650 (7.5 points)

In [None]:
grader.check("q53")

<!--
BEGIN QUESTION
name: q54
manual: false
points: 7.5
-->

In [94]:
# test for accuracy > 0.9625 (7.5 points)

In [None]:
grader.check("q54")

Finally, we have the test accuracy.

In [96]:
model_final.evaluate(X_test,y_test)

# Submit
Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output.
**Please save before submitting!**

In [None]:
# Save your notebook first, then run this cell to create a pdf for your reference.