### 1.	Is it OK to initialize all the weights to the same value as long as that value is selected randomly using He initialization?

Initializing all the weights with same value leads the neurons to learn the same features during training. In fact, any constant initialization scheme will perform very poorly.

### 2.	Is it OK to initialize the bias terms to 0?

It is important to note that setting biases to 0 will not create any problems as non-zero weights take care of breaking the symmetry and even if bias is 0, the values in every neuron will still be different.

### 3.	Name three advantages of the SELU activation function over ReLU.

1) Similar to ReLUs, SELUs enable deep neural networks since there is no problem with vanishing gradients.

2) In contrast to ReLUs, SELUs cannot die.

3) SELUs on their own learn faster and better than other activation functions, even if they are combined with batch normalization.

### 4.	In which cases would you want to use each of the following activation functions: SELU, leaky ReLU (and its variants), ReLU, tanh, logistic, and softmax?

`Sigmoid/ logistic:`
    
- It is commonly used for models where we have to predict the probability as an output. Since probability of anything exists only between the range of 0 and 1, sigmoid is the right choice because of its range.

- The function is differentiable and provides a smooth gradient, i.e., preventing jumps in output values.


`tanh`: 

- The output of the tanh activation function is Zero centered; hence we can easily map the output values as strongly negative, neutral, or strongly positive.

- Usually used in hidden layers of a neural network as its values lie between -1 to; therefore, the mean for the hidden layer comes out to be 0 or very close to it. It helps in centering the data and makes learning for the next layer much easier.

`softmax`

- The softmax function is used as the activation function in the output layer of neural network models that predict a multinomial probability


`Relu`

- Since only a certain number of neurons are activated, the ReLU function is far more computationally efficient when compared to the sigmoid and tanh functions.

- ReLU accelerates the convergence of gradient descent towards the global minimum of the loss function due to its linear, non-saturating property. Hence Rleu is commenly used in hidden layers

`Leaky relu`

- Advanages are similar to Relu activation function, and Dying relu problem is comparitvely less to Leaky relu

- Leaky relu is used in Hidden layers commonly.

`SELU (Scaled Exponential Linear Unit)`

- Internal normalization is faster than external normalization, which means the network converges faster. Used in Hideen layers



### 5.	What may happen if you set the momentum hyperparameter too close to 1 (e.g., 0.99999) when using an SGD optimizer?

then the algorithm will likely pick up a lot of speed, hopefully moving roughly toward the global minimum, but its momentum will carry it right past the minimum which is called as overshooting. And the SGD will never converge to global minima.

### 6.	Name three ways you can produce a sparse model.

Train the model normally, then zero out tiny weights.
For more sparsity, you can
apply ℓ1 regularization (Lasso) during training, which pushes the optimizer to zero out as many weights as it can.
A third option is to combine ℓ1 regularization with dual averaging, using TensorFlow's FTRLOptimizer class.

### 7.	Does dropout slow down training? Does it slow down inference (i.e., making predictions on new instances)? What about MC Dropout?

Logically, by omitting at each iteration neurons with a dropout, those omitted on an iteration are not updated during the backpropagation. They do not exist. So the training phase is slowed down. However, applying dropout to a neural network typically increases the training time.

Dropout will not affect during Inferencing, Since all neurons will be actively participating during inferencing

Monte Carlo Dropout boils down to training a neural network with the regular dropout and keeping it switched on at inference time. This way, we can generate multiple different predictions for each instance.

### 8.	Practice training a deep neural network on the CIFAR10 image dataset:

- a.	Build a DNN with 20 hidden layers of 100 neurons each (that’s too many, but it’s the point of this exercise). Use He initialization and the ELU activation function.

- b.	Using Nadam optimization and early stopping, train the network on the CIFAR10 dataset. You can load it with keras.datasets.cifar10.load_data(). The dataset is composed of 60,000 32 × 32–pixel color images (50,000 for training, 10,000 for testing) with 10 classes, so you’ll need a softmax output layer with 10 neurons. Remember to search for the right learning rate each time you change the model’s architecture or hyperparameters.

- c.	Now try adding Batch Normalization and compare the learning curves: Is it converging faster than before? Does it produce a better model? How does it affect training speed?

- d.	Try replacing Batch Normalization with SELU, and make the necessary adjustements to ensure the network self-normalizes (i.e., standardize the input features, use LeCun normal initialization, make sure the DNN contains only a sequence of dense layers, etc.).

- e.	Try regularizing the model with alpha dropout. Then, without retraining your model, see if you can achieve better accuracy using MC Dropout.


In [None]:
import tensorflow as tf

from tensorflow.keras import datasets, layers, models, activations
import matplotlib.pyplot as plt
from tensorflow.keras.callbacks import EarlyStopping

(train_images, train_labels), (test_images, test_labels) = datasets.cifar10.load_data()

# Normalize pixel values to be between 0 and 1
train_images, test_images = train_images / 255.0, test_images / 255.0

class_names = ['airplane', 'automobile', 'bird', 'cat', 'deer',
               'dog', 'frog', 'horse', 'ship', 'truck']

plt.figure(figsize=(10,10))
for i in range(25):
    plt.subplot(5,5,i+1)
    plt.xticks([])
    plt.yticks([])
    plt.grid(False)
    plt.imshow(train_images[i])
    # The CIFAR labels happen to be arrays, 
    # which is why you need the extra index
    plt.xlabel(class_names[train_labels[i][0]])
plt.show()

In [None]:
initializer = tf.keras.initializers.HeNormal()

In [None]:
model = models.Sequential()
model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(32, 32, 3)))
model.add(layers.Flatten())
model.add(layers.Dense(100, activation = 'elu', kernel_initializer=initializer))
model.add(layers.Dense(100, activation = 'elu', kernel_initializer=initializer))
model.add(layers.Dense(100, activation = 'elu', kernel_initializer=initializer))
model.add(layers.Dense(100, activation = 'elu', kernel_initializer=initializer))
model.add(layers.Dense(100, activation = 'elu', kernel_initializer=initializer))
model.add(layers.Dense(100, activation = 'elu', kernel_initializer=initializer))
model.add(layers.Dense(100, activation = 'elu', kernel_initializer=initializer))
model.add(layers.Dense(100, activation = 'elu', kernel_initializer=initializer))
model.add(layers.Dense(100, activation = 'elu', kernel_initializer=initializer))
model.add(layers.Dense(100, activation = 'elu', kernel_initializer=initializer))
model.add(layers.Dense(100, activation = 'elu', kernel_initializer=initializer))
model.add(layers.Dense(100, activation = 'elu', kernel_initializer=initializer))
model.add(layers.Dense(100, activation = 'elu', kernel_initializer=initializer))
model.add(layers.Dense(100, activation = 'elu', kernel_initializer=initializer))
model.add(layers.Dense(100, activation = 'elu', kernel_initializer=initializer))
model.add(layers.Dense(100, activation = 'elu', kernel_initializer=initializer))
model.add(layers.Dense(100, activation = 'elu', kernel_initializer=initializer))
model.add(layers.Dense(100, activation = 'elu', kernel_initializer=initializer))
model.add(layers.Dense(100, activation = 'elu', kernel_initializer=initializer))
model.add(layers.Dense(100, activation = 'elu', kernel_initializer=initializer))
model.add(layers.Dense(10, activation = 'softmax'))



model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])

history = model.fit(train_images, train_labels, epochs=50, 
                    validation_data=(test_images, test_labels))

In [None]:
test_loss, test_acc = model.evaluate(test_images,  test_labels, verbose=2)

print('\nTest accuracy:', test_acc)

In [None]:
model = models.Sequential()
model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(32, 32, 3)))
model.add(layers.Flatten())
model.add(layers.Dense(100, activation = 'elu', kernel_initializer=initializer))
model.add(layers.Dense(100, activation = 'elu', kernel_initializer=initializer))
model.add(layers.Dense(100, activation = 'elu', kernel_initializer=initializer))
model.add(layers.Dense(100, activation = 'elu', kernel_initializer=initializer))
model.add(layers.Dense(100, activation = 'elu', kernel_initializer=initializer))
model.add(layers.Dense(100, activation = 'elu', kernel_initializer=initializer))
model.add(layers.Dense(100, activation = 'elu', kernel_initializer=initializer))
model.add(layers.Dense(100, activation = 'elu', kernel_initializer=initializer))
model.add(layers.Dense(100, activation = 'elu', kernel_initializer=initializer))
model.add(layers.Dense(100, activation = 'elu', kernel_initializer=initializer))
model.add(layers.Dense(100, activation = 'elu', kernel_initializer=initializer))
model.add(layers.Dense(100, activation = 'elu', kernel_initializer=initializer))
model.add(layers.Dense(100, activation = 'elu', kernel_initializer=initializer))
model.add(layers.Dense(100, activation = 'elu', kernel_initializer=initializer))
model.add(layers.Dense(100, activation = 'elu', kernel_initializer=initializer))
model.add(layers.Dense(100, activation = 'elu', kernel_initializer=initializer))
model.add(layers.Dense(100, activation = 'elu', kernel_initializer=initializer))
model.add(layers.Dense(100, activation = 'elu', kernel_initializer=initializer))
model.add(layers.Dense(100, activation = 'elu', kernel_initializer=initializer))
model.add(layers.Dense(100, activation = 'elu', kernel_initializer=initializer))
model.add(layers.Dense(10, activation = 'softmax'))


model.compile(optimizer='nadam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])

early_stopping = EarlyStopping(patience=5)


history = model.fit(train_images, train_labels, epochs=50, 
                    validation_data=(test_images, test_labels), callbacks = early_stopping)

In [None]:
test_loss, test_acc = model.evaluate(test_images,  test_labels, verbose=2)

print('\nTest accuracy:', test_acc)

In [None]:
initializer = tf.keras.initializers.LecunNormal()

In [None]:
model = models.Sequential()
model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(32, 32, 3)))
model.add(layers.Flatten())
model.add(layers.Dense(100, activation = 'elu', kernel_initializer=initializer))
model.add(layers.BatchNormalization())
model.add(layers.Dense(100, activation = 'elu', kernel_initializer=initializer))
model.add(layers.BatchNormalization())
model.add(layers.Dense(100, activation = 'elu', kernel_initializer=initializer))
model.add(layers.BatchNormalization())
model.add(layers.Dense(100, activation = 'elu', kernel_initializer=initializer))
model.add(layers.BatchNormalization())
model.add(layers.Dense(100, activation = 'elu', kernel_initializer=initializer))
model.add(layers.BatchNormalization())
model.add(layers.Dense(100, activation = 'elu', kernel_initializer=initializer))
model.add(layers.BatchNormalization())
model.add(layers.Dense(100, activation = 'elu', kernel_initializer=initializer))
model.add(layers.BatchNormalization())
model.add(layers.Dense(100, activation = 'elu', kernel_initializer=initializer))
model.add(layers.BatchNormalization())
model.add(layers.Dense(100, activation = 'elu', kernel_initializer=initializer))
model.add(layers.BatchNormalization())
model.add(layers.Dense(100, activation = 'elu', kernel_initializer=initializer))
model.add(layers.BatchNormalization())
model.add(layers.Dense(100, activation = 'elu', kernel_initializer=initializer))
model.add(layers.BatchNormalization())
model.add(layers.Dense(100, activation = 'elu', kernel_initializer=initializer))
model.add(layers.BatchNormalization())
model.add(layers.Dense(100, activation = 'elu', kernel_initializer=initializer))
model.add(layers.BatchNormalization())
model.add(layers.Dense(100, activation = 'elu', kernel_initializer=initializer))
model.add(layers.BatchNormalization())
model.add(layers.Dense(100, activation = 'elu', kernel_initializer=initializer))
model.add(layers.BatchNormalization())
model.add(layers.Dense(100, activation = 'elu', kernel_initializer=initializer))
model.add(layers.BatchNormalization())
model.add(layers.Dense(100, activation = 'elu', kernel_initializer=initializer))
model.add(layers.BatchNormalization())
model.add(layers.Dense(100, activation = 'elu', kernel_initializer=initializer))
model.add(layers.BatchNormalization())
model.add(layers.Dense(100, activation = 'elu', kernel_initializer=initializer))
model.add(layers.BatchNormalization())
model.add(layers.Dense(100, activation = 'elu', kernel_initializer=initializer))
model.add(layers.BatchNormalization())
model.add(layers.Dense(10, activation = 'softmax'))


model.compile(optimizer='nadam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])

history = model.fit(train_images, train_labels, epochs=50, 
                    validation_data=(test_images, test_labels), callbacks = early_stopping)

In [None]:
test_loss, test_acc = model.evaluate(test_images,  test_labels, verbose=2)

print('\nTest accuracy:', test_acc)

In [None]:
model = models.Sequential()
model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(32, 32, 3)))
model.add(layers.Flatten())
model.add(layers.Dense(100, activation = 'selu', kernel_initializer=initializer))
model.add(layers.Dense(100, activation = 'selu', kernel_initializer=initializer))
model.add(layers.Dense(100, activation = 'selu', kernel_initializer=initializer))
model.add(layers.Dense(100, activation = 'selu', kernel_initializer=initializer))
model.add(layers.Dense(100, activation = 'selu', kernel_initializer=initializer))
model.add(layers.Dense(100, activation = 'selu', kernel_initializer=initializer))
model.add(layers.Dense(100, activation = 'selu', kernel_initializer=initializer))
model.add(layers.Dense(100, activation = 'selu', kernel_initializer=initializer))
model.add(layers.Dense(100, activation = 'selu', kernel_initializer=initializer))
model.add(layers.Dense(100, activation = 'selu', kernel_initializer=initializer))
model.add(layers.Dense(100, activation = 'selu', kernel_initializer=initializer))
model.add(layers.Dense(100, activation = 'selu', kernel_initializer=initializer))
model.add(layers.Dense(100, activation = 'selu', kernel_initializer=initializer))
model.add(layers.Dense(100, activation = 'selu', kernel_initializer=initializer))
model.add(layers.Dense(100, activation = 'selu', kernel_initializer=initializer))
model.add(layers.Dense(100, activation = 'selu', kernel_initializer=initializer))
model.add(layers.Dense(100, activation = 'selu', kernel_initializer=initializer))
model.add(layers.Dense(100, activation = 'selu', kernel_initializer=initializer))
model.add(layers.Dense(100, activation = 'selu', kernel_initializer=initializer))
model.add(layers.Dense(100, activation = 'selu', kernel_initializer=initializer))
model.add(layers.Dense(10, activation = 'softmax'))


In [None]:
model.compile(optimizer='nadam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])

history = model.fit(train_images, train_labels, epochs=50, 
                    validation_data=(test_images, test_labels), callbacks = early_stopping)

In [None]:
test_loss, test_acc = model.evaluate(test_images,  test_labels, verbose=2)

print('\nTest accuracy:', test_acc)

In [None]:
model = models.Sequential()
model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(32, 32, 3)))
model.add(layers.Flatten())
model.add(layers.Dense(100, activation = 'selu', kernel_initializer=initializer))
model.add(layers.AlphaDropout(10))
model.add(layers.Dense(100, activation = 'selu', kernel_initializer=initializer))
model.add(layers.AlphaDropout(10))
model.add(layers.Dense(100, activation = 'selu', kernel_initializer=initializer))
model.add(layers.AlphaDropout(10))
model.add(layers.Dense(100, activation = 'selu', kernel_initializer=initializer))
model.add(layers.AlphaDropout(10))
model.add(layers.Dense(100, activation = 'selu', kernel_initializer=initializer))
model.add(layers.AlphaDropout(10))
model.add(layers.Dense(100, activation = 'selu', kernel_initializer=initializer))
model.add(layers.AlphaDropout(10))
model.add(layers.Dense(100, activation = 'selu', kernel_initializer=initializer))
model.add(layers.AlphaDropout(10))
model.add(layers.Dense(100, activation = 'selu', kernel_initializer=initializer))
model.add(layers.AlphaDropout(10))
model.add(layers.Dense(100, activation = 'selu', kernel_initializer=initializer))
model.add(layers.AlphaDropout(10))
model.add(layers.Dense(100, activation = 'selu', kernel_initializer=initializer))
model.add(layers.AlphaDropout(10))
model.add(layers.Dense(100, activation = 'selu', kernel_initializer=initializer))
model.add(layers.AlphaDropout(10))
model.add(layers.Dense(100, activation = 'selu', kernel_initializer=initializer))
model.add(layers.AlphaDropout(10))
model.add(layers.Dense(100, activation = 'selu', kernel_initializer=initializer))
model.add(layers.AlphaDropout(10))
model.add(layers.Dense(100, activation = 'selu', kernel_initializer=initializer))
model.add(layers.AlphaDropout(10))
model.add(layers.Dense(100, activation = 'selu', kernel_initializer=initializer))
model.add(layers.AlphaDropout(10))
model.add(layers.Dense(100, activation = 'selu', kernel_initializer=initializer))
model.add(layers.AlphaDropout(10))
model.add(layers.Dense(100, activation = 'selu', kernel_initializer=initializer))
model.add(layers.AlphaDropout(10))
model.add(layers.Dense(100, activation = 'selu', kernel_initializer=initializer))
model.add(layers.AlphaDropout(10))
model.add(layers.Dense(100, activation = 'selu', kernel_initializer=initializer))
model.add(layers.AlphaDropout(10))
model.add(layers.Dense(100, activation = 'selu', kernel_initializer=initializer))
model.add(layers.AlphaDropout(10))
model.add(layers.Dense(10, activation = 'softmax'))


model.compile(optimizer='nadam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])

history = model.fit(train_images, train_labels, epochs=50, 
                    validation_data=(test_images, test_labels), callbacks = early_stopping)

In [None]:
test_loss, test_acc = model.evaluate(test_images,  test_labels, verbose=2)

print('\nTest accuracy:', test_acc)