**1. Is it OK to initialize all the weights to the same value as long as that value is selected randomly using He initialization?**

**Answer**: No, it's not OK. Initializing all weights to the same value, even if that value is drawn from a distribution like He initialization, will lead to the same gradients during backpropagation. This means that all neurons in a given layer will always get updated by the same amount, effectively making them identical throughout training. The network won't be able to exploit the power of a deep architecture. Diverse initial weights break symmetry and allow neurons to learn different features.

**2. Is it OK to initialize the bias terms to 0?**

**Answer**: Yes, it's generally acceptable to initialize bias terms to 0. This is because the asymmetry breaking is typically provided by the small random numbers in the weights. Thus, starting the biases at zero usually won't hinder learning.

**3. Name three advantages of the SELU activation function over ReLU.**

**Answer**:
- **Self-normalization**: When used with the correct initialization and network architecture, SELU activation functions tend to output values that preserve a mean of 0 and standard deviation of 1 during training, which can help mitigate the vanishing/exploding gradients problem.
  
- **No dying units**: Unlike ReLU, which can have neurons that stop outputting anything other than 0 (dying ReLUs), SELU doesn't have this problem as it's smooth and differentiable everywhere.

- **Mitigates vanishing gradients**: The negative slope for values less than 0 in SELU helps mitigate the vanishing gradients problem, which ReLUs can sometimes exacerbate for negative input values.

**4. In which cases would you want to use each of the following activation functions: SELU, leaky ReLU (and its variants), ReLU, tanh, logistic, and softmax?**

**Answer**:
- **SELU**: Useful for deep neural networks as it helps keep the activations' mean and variance close to 0 and 1, respectively. Works best with a specific architecture (e.g., purely dense layers).

- **Leaky ReLU and its variants**: Useful when there's a concern about dying ReLUs. Variants like Parametric or Exponential Leaky ReLU provide more flexibility and can sometimes outperform the standard Leaky ReLU.

- **ReLU**: A good default for most situations in feedforward deep networks due to its simplicity and efficiency. However, it can be problematic if there are many dying ReLUs.

- **tanh**: Useful when outputs need to be scaled between -1 and 1. Common in older architectures and in certain RNN structures.

- **logistic (sigmoid)**: Often found in binary classification tasks as the activation function for the output layer. Also used in older architectures.

- **softmax**: Specifically used in the output layer for multi-class classification problems. Outputs a probability distribution over N classes.

**5. What may happen if you set the momentum hyperparameter too close to 1 (e.g., 0.99999) when using an SGD optimizer?**

**Answer**: Setting the momentum hyperparameter too close to 1 can cause the optimizer to become very sensitive to the most recent gradients and might overshoot a lot. This can lead to oscillations or divergence and may prevent the optimizer from settling into a minimum.

**6. Name three ways you can produce a sparse model.**

**Answer**: 
- **L1 Regularization**: This imposes a penalty on the absolute values of the weights. This tends to produce sparse weight matrices where many weights are exactly zero.

- **Pruning**: After training a model, small-weight connections can be pruned (set to zero), and the model can be fine-tuned further with the pruned architecture.

- **Using specialized techniques or architectures**: Such as the TensorFlow Model Optimization Toolkit (TF-MOT), which provides tools to produce sparse models.

**7. Does dropout slow down training? Does it slow down inference (i.e., making predictions on new instances)? What about MC Dropout?**

**Answer**: 
- **Dropout**: It does slow down training since, at each training step, it randomly drops a fraction of the inputs. However, it doesn't slow down inference. During inference, dropout layers are turned off, and all neurons are used.

- **MC Dropout**: MC Dropout does slow down inference. This is because, even during inference, dropout is kept active, and the network needs multiple forward passes to obtain an averaged prediction. The number of forward passes depends on how many samples you decide to use for the Monte Carlo approximation.

**8. Practice training a deep neural network on the CIFAR10 image dataset:**

**a. Build a DNN with 20 hidden layers of 100 neurons each (that’s too many, but it’s the point of this exercise). Use He initialization and the ELU activation function.**

In [1]:
import numpy as np
import tensorflow as tf
from tensorflow import keras

# Build the DNN
model = keras.models.Sequential()

# Input layer
model.add(keras.layers.Flatten(input_shape=[32, 32, 3]))

# Add 20 hidden layers of 100 neurons each
for _ in range(20):
    model.add(keras.layers.Dense(100, activation="elu", kernel_initializer="he_normal"))


**b. Using Nadam optimization and early stopping, train the network on the CIFAR10 dataset. You can load it with keras.datasets.cifar10.load_​data(). The dataset is composed of 60,000 32 × 32–pixel color images (50,000 for training, 10,000 for testing) with 10 classes, so you’ll need a softmax output layer with 10 neurons. Remember to search for the right learning rate each time you change the model’s architecture or hyperparameters.**

In [2]:
# Load the CIFAR10 dataset
(X_train_full, y_train_full), (X_test, y_test) = keras.datasets.cifar10.load_data()

# Split the full training set into a validation set and a (smaller) training set
X_train = X_train_full[5000:]
y_train = y_train_full[5000:]
X_valid = X_train_full[:5000]
y_valid = y_train_full[:5000]

# Output layer
model.add(keras.layers.Dense(10, activation="softmax"))

# Compile the model with Nadam optimization
optimizer = keras.optimizers.Nadam(lr=5e-5)
model.compile(optimizer=optimizer, loss="sparse_categorical_crossentropy", metrics=["accuracy"])

# Early stopping
early_stopping_cb = keras.callbacks.EarlyStopping(patience=20, restore_best_weights=True)

# Train the model
history = model.fit(X_train, y_train, epochs=100, validation_data=(X_valid, y_valid), callbacks=[early_stopping_cb])



Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100


**c. Now try adding Batch Normalization and compare the learning curves: Is it converging faster than before? Does it produce a better model? How does it affect training speed?**

In [15]:
# Clear previous model
keras.backend.clear_session()

model = keras.models.Sequential()
model.add(keras.layers.Flatten(input_shape=[32, 32, 3]))

for _ in range(20):
    model.add(keras.layers.Dense(100, kernel_initializer="he_normal"))
    model.add(keras.layers.BatchNormalization())
    model.add(keras.layers.Activation("elu"))

model.add(keras.layers.Dense(10, activation="softmax"))
optimizer = tf.keras.optimizers.legacy.Nadam(lr=5e-5)
#optimizer.build(model.trainable_variables)
model.compile(optimizer=optimizer, loss="sparse_categorical_crossentropy", metrics=["accuracy"])

#model.build(input_shape=(None, 32, 32, 3))
#optimizer = tf.keras.optimizers.legacy.Nadam(lr=5e-5)
#model.compile(optimizer=optimizer, loss="sparse_categorical_crossentropy", metrics=["accuracy"])

In [16]:
history_bn = model.fit(X_train, y_train, epochs=100, validation_data=(X_valid, y_valid), callbacks=[early_stopping_cb])

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100


**d. Try replacing Batch Normalization with SELU, and make the necessary adjustements to ensure the network self-normalizes (i.e., standardize the input features, use LeCun normal initialization, make sure the DNN contains only a sequence of dense layers, etc.).**

In [17]:
# Clear previous model
keras.backend.clear_session()

# Standardize the input
pixel_means = X_train.mean(axis=0, keepdims=True)
pixel_stds = X_train.std(axis=0, keepdims=True)
X_train_scaled = (X_train - pixel_means) / pixel_stds
X_valid_scaled = (X_valid - pixel_means) / pixel_stds
X_test_scaled = (X_test - pixel_means) / pixel_stds

model = keras.models.Sequential()
model.add(keras.layers.Flatten(input_shape=[32, 32, 3]))

for _ in range(20):
    model.add(keras.layers.Dense(100, activation="selu", kernel_initializer="lecun_normal"))

model.add(keras.layers.Dense(10, activation="softmax"))

**e. Try regularizing the model with alpha dropout. Then, without retraining your model,see if you can achieve better accuracy using MC Dropout.**

In [18]:
# Add AlphaDropout for regularization
model.add(keras.layers.AlphaDropout(rate=0.1))

# Train the model
model.compile(optimizer=optimizer, loss="sparse_categorical_crossentropy", metrics=["accuracy"])
history_alpha = model.fit(X_train_scaled, y_train, epochs=100, validation_data=(X_valid_scaled, y_valid), callbacks=[early_stopping_cb])

# Use MC Dropout for predictions
y_probas = np.stack([model(X_test_scaled, training=True) for sample in range(100)])
y_proba = y_probas.mean(axis=0)
y_pred = np.argmax(y_proba, axis=1)

accuracy = np.sum(y_pred == y_test[:, 0]) / len(y_test)
print(accuracy)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
0.4597
