# Chapter-10: Deep Learning Hyperparameters

## Regularization
In regularization, we choose a relatively large neural network; since it will lead to overfitting, we will keep some constraints on the weights so that they are always at a lower value. By this way, we have several nodes to explain the complexity, since each node is not showing its full power, the overfitting is also in check. In simple words, instead of dropping the hidden nodes, we are keeping them but with lower weights.


### Regularization in regression
In Regression, we try to minimize the sum of squares of errors. The regression will lead to overfitting if we have too any polynomial terms in the predictor variables list. If we overdo the feature engineering and create too many derived variables, in that case, also it may lead to overfitting. We can still retain too many features by using the regularization method

Below is a worked-out example on small data set to demonstrate how regularization works on regression.

In [None]:
import pandas as pd
import numpy as np

The data set

In [None]:
x=[-0.99768,-0.69574,-0.40373,-0.10236,0.22024,0.47742,0.82229]
y=[2.0885,1.1646,0.3287,0.46013,0.44808,0.10013,-0.32952]

input_data = pd.DataFrame(list(zip(x, y)), columns =['x', 'y']) 
print(input_data)

Plotting the data

In [None]:
x = np.array(input_data.x)
y = input_data.y
#scatter plot x and y
import matplotlib.pyplot as plt
%matplotlib inline
plt.title("Input data", fontsize=20)
plt.scatter(x,y,s=50,c="g")
plt.xlabel("X")
plt.ylabel("Y")
plt.show()

We will now build a simple regression line and a fifth-order polynomial regression. Simple regression
will be an under-fitted model for the above data and fifth-order polynomial will be overfitted model.

Simple regression

In [None]:
import statsmodels.api as sm
x1 = sm.add_constant(x)
m1 = sm.OLS(y,x1).fit()
#SSE
print("m1 SSE", m1.ssr)

Second order polynomial regression

In [None]:
x2 = sm.add_constant(np.column_stack([x,np.square(x)]))
m2 = sm.OLS(y,x2).fit()
print("m2 SSE", m2.ssr)

Fifth order polynomial regression

In [None]:
x3 = sm.add_constant(np.column_stack([x, np.power(x,2),np.power(x,3),np.power(x,4),np.power(x,5)]))
m3 = sm.OLS(y,x3).fit()
print("m3 SSE", m3.ssr)

We will take the fifth-order polynomial model, which is already overfitted.

We will now use the regularization parameter. The weights will be newly calculated using a
regularized cost function. We will get all the six weights but as lambda increases, the weights reduce.

In [None]:
X = x3
y = np.array(y)
n_col = X.shape[1]
d = np.identity(n_col)
d[0,0] = 0
w = []

reg =0 
w.append(np.linalg.lstsq(X.T.dot(X) + reg * d, X.T.dot(y))[0])

reg =1 
w.append(np.linalg.lstsq(X.T.dot(X) + reg * d, X.T.dot(y))[0])


reg =10 
w.append(np.linalg.lstsq(X.T.dot(X) + reg * d, X.T.dot(y))[0])

In [None]:
print("Regularized weights  lambda=0 \n", w[0])
print("Regularized weights  lambda=1 \n", w[1])
print("Regularized weights  lambda=10 \n", w[2])

With lamda = 0, we should see the same weights as the fifth-order polynomial regression. We have built that model previously. Let us compare these two models. Below is the code for fetching the old and new weights

In [None]:
print("Regularized Weights With lambda = 0 \n", list(w[0]))
print("Standard Weights With inbuilt package \n",list(m3.params))

As expected the regularized weights with lambda=0 are the same as standard weights. Now we will
see the plot and observe the results from the three models.

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams["figure.figsize"] = (8,6)
plt.title('Model results for different lambda values', fontsize=20)
plt.scatter(x,y, s = 50, c = "g")
x_new = np.linspace(x.min(), x.max(), 200)
plt.plot(x_new, np.poly1d(np.polyfit(x, X.dot(w[0]), 5))(x_new),label='$\lambda$ = 0', c = "b")
plt.plot(x_new, np.poly1d(np.polyfit(x, X.dot(w[1]), 5))(x_new),label='$\lambda$ = 1', c = "r")
plt.plot(x_new, np.poly1d(np.polyfit(x, X.dot(w[2]), 5))(x_new),label='$\lambda$ = 10', c = "g")
plt.legend(loc='upper right');
plt.show()

Below are the observations from the above graphs. The blue line is the model with lambda=0, this
model is overfitted. The green line is the model with lambda=10, and this model is under fitted. The
red line is the model with lambda=1, this model is better than the other two models.

In conclusion, we can say that the three models are built with fifth-order polynomial. By
changing the value of regularization parameter, we can avoid overfitting. From the above output we
can choose the model with lambda=1

In [None]:
#weights
print("Final Weights \n", w[1])
#perdiction
pred = X.dot(w[1])
##SSE
SSE_Final = sum(np.square(y-pred))
print("Final SSE ", SSE_Final)

importing required packages and libraries

In [None]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

loading the dataset

In [None]:
(X_train, Y_train), (X_test, Y_test) = keras.datasets.mnist.load_data()
num_classes=10
x_train = X_train.reshape(60000, 784)
x_test = X_test.reshape(10000, 784)
x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
x_train /= 255
x_test /= 255

## Convert class vectors to binary class matrices
y_train = keras.utils.to_categorical(Y_train, num_classes)
y_test = keras.utils.to_categorical(Y_test, num_classes)

print(x_train.shape, 'train input samples')
print(x_test.shape, 'test input samples')

print(y_train.shape, 'train output samples')
print(y_test.shape, 'test output samples')

### L1 and L2 regularization
* In regression adding too many polynomial terms led to overfitting, we did not reduce the number of
polynomial terms. We used a regularized cost function. Until now we minimized the sum of squares of weights. This method is known as L2 norm or L2 regularization. A regression model with this method is known as Ridge regression.
* Alternatively, we can use the cost function that minimizes the sum of absolute values of the weights.
This method is known as L1 norm or L1 regularization. In regression, it is known as Losso regression

In Kears we can use kernel_regularizer parameter inside layers.Dense() function. Below is the code
for building the model without regularization and with regularization.

When we build a model with too many hidden nodes, we can see the accuracy of train data is high
starting from the first epoch. The same configuration model shows less accuracy in its epochs while
training.

* without regularization

In [None]:
model = keras.Sequential()
model.add(layers.Dense(256, activation='sigmoid', input_shape=(784,)))
model.add(layers.Dense(128, activation='sigmoid'))
model.add(layers.Dense(10, activation='softmax'))
model.summary()

In [None]:
model.compile(loss='categorical_crossentropy', metrics=['accuracy'])
model.fit(x_train, y_train,epochs=10)

In [None]:
#Final Results
loss, acc = model.evaluate(x_train,  y_train, verbose=2)
print("Train Accuracy: {:5.2f}%".format(100*acc))

loss, acc = model.evaluate(x_test,  y_test, verbose=2)
print("Test Accuracy: {:5.2f}%".format(100*acc))

We can see slight overfitting in this model. We can see above 90% accuracy from the second epoch
itself. We will build the same model with regularization now.

* with regularization

In [None]:
from tensorflow.keras import regularizers
model_r = keras.Sequential()
model_r.add(layers.Dense(256, activation='sigmoid', input_shape=(784,), kernel_regularizer=regularizers.l2(0.01)))
model_r.add(layers.Dense(128, activation='sigmoid',kernel_regularizer=regularizers.l2(0.01)))
model_r.add(layers.Dense(10, activation='softmax'))
model_r.summary()

In [None]:
model_r.compile(loss='categorical_crossentropy', metrics=['accuracy'])
model_r.fit(x_train, y_train,epochs=10)

In [None]:
#Final Results
loss, acc = model_r.evaluate(x_train,  y_train, verbose=2)
print("Train Accuracy: {:5.2f}%".format(100*acc))

loss, acc = model_r.evaluate(x_test,  y_test, verbose=2)
print("Test Accuracy: {:5.2f}%".format(100*acc))

We can now see the impact of regularization on the weights. Epoch by epoch comparison on the
train data reveals the impact of the penalty on weights. The final model also shows no signs of
overfitting. We can apply L1 regularization also in a similar manner using the parameter
kernel_regularizer=regularizers.l1(0.01)

### Dropout regularization
There is one more regularization that helps us to reduce the dominance of individual hidden nodes. The dropout method is another effective way of avoiding overfitting. Ignoring a few hidden nodes while training the model is called the dropout method.

There are three crucial points to note in the dropout method.
* The first point is that dropout is not applied at the overall network level; it is applied at each
hidden layer level. We can even apply dropout on a few layers and keep all nodes in the rest
of the layers.
* The second point is, the drop out happens at each iteration. We do not drop the nodes once
and train the network with alliterations. We drop the weights randomly in each epoch. We
can not guess what is the network architecture in a given iteration.
* The third point is about weights. All the weights are considered at the time of prediction.
Each weight is multiplied by q; here q=1-p.

We need to mention dropout as a layer. It is an imaginary layer that is applied to the hidden layers.
The dropout rate can vary from layer to layer

In [None]:
from tensorflow.keras.layers import Dropout
model_rd = keras.Sequential()

model_rd.add(layers.Dense(256, activation='sigmoid', input_shape=(784,)))
model_rd.add(Dropout(0.7))

model_rd.add(layers.Dense(128, activation='sigmoid'))
model_rd.add(Dropout(0.6))

model_rd.add(layers.Dense(10, activation='softmax'))
model_rd.summary()

From the above code, we can see the dropout layer after the hidden layer. We choose p=0.7 which
means 70% of the nodes will be dropped from the first hidden layer and 60% of the nodes will be
dropped from the second hidden layer. At any given iteration, we will see only 77 nodes in the first
hidden layer and 51 nodes in the second layer.

Drop out is an abstract layer with zero nodes. We can now train the model. In each epoch, we will
see less accuracy, as we are using only fewer nodes. Finally, when we calculate the accuracy of train
and test data, we will see a higher value.

In [None]:
model_rd.compile(loss='categorical_crossentropy', metrics=['accuracy'])
model_rd.fit(x_train, y_train,epochs=10)

In [None]:
#Final Results
loss, acc = model_rd.evaluate(x_train,  y_train, verbose=2)
print("Train Accuracy: {:5.2f}%".format(100*acc))

loss, acc = model_rd.evaluate(x_test,  y_test, verbose=2)
print("Test Accuracy: {:5.2f}%".format(100*acc))

We can see from the output the model is free of overfitting.

### Early stopping method
The early stopping method follows a simple approach to avoid overfitting. If there are too many
hidden nodes and layers, then the accuracy on train data increases as the number of epochs
increases. With sufficient hidden nodes and sufficient epochs, the accuracy might even reach 100%.

First, we need to know how to store the model and its weights in each epoch. For storing the model
weights, we use the h5py package. Using this package, we can store the model weights in a file. The
model weights file will have the hdf5 extension.

In [None]:
model_re = keras.Sequential()
model_re.add(layers.Dense(256, activation='sigmoid', input_shape=(784,)))
model_re.add(layers.Dense(128, activation='sigmoid'))
model_re.add(layers.Dense(10, activation='softmax'))
model_re.summary()

In [None]:
model_re.compile(loss='categorical_crossentropy', metrics=['accuracy'])
#Enable saving checkpoints
# Checkpoint the weights when validation accuracy improves
from tensorflow.keras.callbacks import ModelCheckpoint
import h5py

# checkpoint
#dont forget to create a directory to store the checkpoints:"early_stopping_checkpoints"
checkpoint = ModelCheckpoint("\early_stopping_checkpoints\epoch-{epoch:02d}.hdf5")
model_re.fit(x_train, y_train,epochs=10,validation_data=(x_test, y_test),callbacks=[checkpoint])

The above code saves all the model weight files inside the directory that we have mentioned in the
code. We can load the model weights from a particular epoch. Imagine that after
epoch 7 the model getting into overfitting then we can load the weights from epoch7 using
load_weights() function.

In [None]:
model_re.load_weights("\early_stopping_checkpoints\epoch-07.hdf5")# change the file name to the epoch you want to load

Either we can manually perform early stopping regularization using the above weights storing and
load_weights approach or we can directly use keras.callbacks.EarlyStopping() function. In this
function, we have to mention the validation measure and minimum improvement in the test data
that we want t see in each iteration.

In [None]:
model_re = keras.Sequential()
model_re.add(layers.Dense(256, activation='sigmoid', input_shape=(784,)))
model_re.add(layers.Dense(128, activation='sigmoid'))
model_re.add(layers.Dense(10, activation='softmax'))

model_re.compile(loss='categorical_crossentropy', metrics=['accuracy'])

es = keras.callbacks.EarlyStopping(monitor='val_accuracy',
                              min_delta=0.01,
                              patience=2)

#train the model with call back method
model_re.fit(x_train, y_train, epochs=30,validation_data=(x_test, y_test), callbacks=[es])

In the above code
* Monitor – Monitor accuracy or loss
* min_delta - The minimum improvement required in each step.
* Patience - How many epochs to wait after the monitor termination condition has been
reached. Sometimes test accuracy seams to be decreased in an epoch but it increases up
after that; to avoid this strict rule, we can use the patience parameter. The model will wait
for few more epochs before exiting.
* Overall the above code means that terminate the model builing when the accuracy
improvemet is less than 0.01 in the validation data for two consecutive epochs.

Though we have mentioned 30 epochs, the model will exit when it reaches the min_delta
termination criteria.

## Activation function
Below is the example code that shows how to configure a network with different activation
functions. We can mention different activation functions in different layers.

In [None]:
model2 = keras.Sequential()
model2.add(layers.Dense(15, activation='sigmoid', input_shape=(784,)))
model2.add(layers.Dense(15, activation='relu'))
model2.add(layers.Dense(15, activation='tanh'))
model2.add(layers.Dense(15, activation='relu'))
model2.add(layers.Dense(10, activation='softmax'))
model2.summary()

In [None]:
model2.compile(loss='categorical_crossentropy', metrics=['accuracy'])
model2.fit(x_train, y_train,epochs=10)

## Learning function
To reach the minimum value of the error, We move the weights in that direction where overall error
reduces. A negative gradient of error gives us direction. We are multiplying the actual gradient term
with . This is known as the learning rate. By increasing or decreasing the values of the learning rate,
we can dictate how much should the weights move in one iteration.

The learning rate is part of the gradient descent function. We need to mention the learning rate in
optimizes function. For example, tf.keras.optimizers.SGD() is an optimiser function. SGD is stochastic
Gradient Descent, and it is a specific type of gradient descent function. Below is the code to mention
the learning rate in optimizer function

In [None]:
model3 = keras.Sequential()
model3.add(layers.Dense(20, activation='sigmoid', input_shape=(784,)))
model3.add(layers.Dense(20, activation='sigmoid'))
model3.add(layers.Dense(10, activation='softmax'))
model3.summary()

#High Learning Rate
opt_new = tf.keras.optimizers.SGD(learning_rate=10)
model3.compile(optimizer=opt_new, loss='categorical_crossentropy', metrics=['accuracy'])
model3.fit(x_train, y_train,epochs=20)

From the above output we can see that the model got stuck at an accuracy of 0.2. The weights must
have been oscillating between two points, not able to penetrate further inside a minimum. Now we
will try with a very low learning rate.

In [None]:
model3 = keras.Sequential()
model3.add(layers.Dense(20, activation='sigmoid', input_shape=(784,)))
model3.add(layers.Dense(20, activation='sigmoid'))
model3.add(layers.Dense(10, activation='softmax'))
model3.summary()

#Low learning rate
opt_new = tf.keras.optimizers.SGD(learning_rate=0.00001)
model3.compile(optimizer=opt_new, loss='categorical_crossentropy', metrics=['accuracy'])
model3.fit(x_train, y_train,epochs=20)

We can see from the above output that the model is either stuck at a local minimum or model is
extremely slow in learning. Now we will try building the model with medium learning rate.

In [None]:
model3 = keras.Sequential()
model3.add(layers.Dense(20, activation='sigmoid', input_shape=(784,)))
model3.add(layers.Dense(20, activation='sigmoid'))
model3.add(layers.Dense(10, activation='softmax'))
model3.summary()

#Optimal learning rate
opt_new = tf.keras.optimizers.SGD(learning_rate=0.01)
model3.compile(optimizer=opt_new, loss='categorical_crossentropy', metrics=['accuracy'])
model3.fit(x_train, y_train,epochs=20)

There is no shortcut for reaching the optimal learning rate. Since the optimal learning rate is a range
of values, it will not be an enormous challenge to find the optimal learning rate.

## Momentum
The learning rate can be made better by adding one more additional factor called momentum. In the
initial epochs, the weights will be changed by larger values. As the error function is reaching the
minimum value then the delta weights will be smaller. If we are reaching a local minimum, then the
momentum can help us to push out of local minima and help the algorithm to converge much faster.

Momentum should be seen
as a helping aid to learning rate, it is not an alternative to the learning rate. Below is the code to
include momentum along with the learning rate.

In [None]:
model3 = keras.Sequential()
model3.add(layers.Dense(20, activation='sigmoid', input_shape=(784,)))
model3.add(layers.Dense(20, activation='sigmoid'))
model3.add(layers.Dense(10, activation='softmax'))
model3.summary()

#Optimal learning rate
opt_new = tf.keras.optimizers.SGD(learning_rate=0.01, momentum=0.5)
model3.compile(optimizer=opt_new, loss='categorical_crossentropy', metrics=['accuracy'])
model3.fit(x_train, y_train,epochs=20)

The above code is the same as the previous model. We have included a momentum term
additionally.

We can observe a faster convergence after including the momentum term. This parameter comes
very handily for broad and deep neural networks.

## Optimizers
We discussed the gradient descent algorithm for updating the weights. The original theory of
Gradient Descent uses all the data to calculate the gradients to update the weights.

If there are millions of points, then calculating gradient for all those points in one go takes much
time.

### SGD(Stochastic Gradient Descent) 
Stochastic Gradient Descent formally known as SGD, uses an estimate of gradient instead of an
actual gradient. SGD approximates the overall gradient using a single point. This will make sure that
the individual gradients calculated faster and weights are getting updated rapidly.



### Minibatch Gradient Descent
We will take SGD and make a small modification to it. Instead of calculating gradients for every
single record, we will make a batch of records. A small subset of the data and calculate the gradient
to update the weights.

In [None]:
model4 = keras.Sequential()
model4.add(layers.Dense(20, activation='sigmoid', input_shape=(784,)))
model4.add(layers.Dense(20, activation='sigmoid'))
model4.add(layers.Dense(10, activation='softmax'))
model4.summary()


opt_new = tf.keras.optimizers.SGD(learning_rate=0.01, momentum=0.5)
model4.compile(loss='categorical_crossentropy', metrics=['accuracy'])

#Batch size=fll data(GD)
model4.fit(x_train, y_train,batch_size=x_train.shape[0], epochs=10)

We need to pass the batch size parameter in the model.fit() function.

From the above output, we can observe that the loss has not reduced drastically even after 10
epochs. Now we will build the model with SGD where the batch size is 1

In [None]:
model4 = keras.Sequential()
model4.add(layers.Dense(20, activation='sigmoid', input_shape=(784,)))
model4.add(layers.Dense(20, activation='sigmoid'))
model4.add(layers.Dense(10, activation='softmax'))
model4.summary()


opt_new = tf.keras.optimizers.SGD(learning_rate=0.01, momentum=0.5)
model4.compile(loss='categorical_crossentropy', metrics=['accuracy'])

#Batch size=1 (SGD)
model4.fit(x_train, y_train,batch_size=1, epochs=2)

From the above output, we can observe that within two epochs, we achieved high accuracy.
However, the problem is in execution time. Each epoch takes a significantly higher amount of time.
Now we will build the third model with batch size

In [None]:
model4 = keras.Sequential()
model4.add(layers.Dense(20, activation='sigmoid', input_shape=(784,)))
model4.add(layers.Dense(20, activation='sigmoid'))
model4.add(layers.Dense(10, activation='softmax'))
model4.summary()


opt_new = tf.keras.optimizers.SGD(learning_rate=0.01, momentum=0.5)
model4.compile(loss='categorical_crossentropy', metrics=['accuracy'])

#Batch size = 512
model4.fit(x_train, y_train,batch_size=512, epochs=10)

This output shows better results than GS and SGD. Mini batch GD is used in solving real business
problems since it is better than the other two in terms of execution time and accuracy