1. Is it OK to initialize all the weights to the same value as long as that value is selected
randomly using He initialization?


Yes, it is generally okay to initialize all the weights to the same value as long as that value is selected randomly using He initialization. The key idea behind He initialization is to set the initial weights to random values that take into account the size of the layer and the activation function used. By using a random initialization, we avoid having all the weights initialized to the same value, which can cause problems such as symmetrical weights and gradient descent being stuck at a certain point.

However, initializing all weights to the same value (even if that value is randomly generated) may still result in poor performance of the network due to symmetry issues. Therefore, it is generally better to use a proper initialization scheme like He initialization that takes into account the size of the layer and activation function used.

2. Is it OK to initialize the bias terms to 0?


Yes, it is generally okay to initialize the bias terms to 0. This is a common practice in deep learning, and many popular frameworks, such as TensorFlow and PyTorch, use this default initialization for biases. The reason for this is that the initial bias values do not significantly affect the training of the neural network, as the gradient of the bias term is not directly influenced by the input data.

However, some studies have shown that initializing the bias terms to small positive values, such as 0.1, can help improve the convergence speed of the network, especially in networks with ReLU activation functions. This is because a small positive bias can help ensure that neurons in the network start firing and learning from the very beginning of the training process. So, while it is generally okay to initialize the bias terms to 0, initializing them to small positive values may provide some benefits in certain situations.

3. Name three advantages of the SELU activation function over ReLU.


Here are three advantages of the SELU activation function over ReLU:

Avoids the dying ReLU problem: One of the main disadvantages of the ReLU activation function is that it can lead to a large number of "dead" neurons that never activate, a problem known as the "dying ReLU" problem. The SELU activation function avoids this problem by having a negative slope for negative inputs, which can activate neurons that would otherwise be "dead" with ReLU.

Self-normalizing: The SELU activation function was designed to have a mean activation of 0 and a standard deviation of 1 for inputs in the normal distribution. This means that deep neural networks using SELU can self-normalize, which can result in faster convergence and better performance.

Improved performance: Several studies have shown that neural networks using SELU activation functions can achieve better performance than those using ReLU, especially for deep neural networks with many layers. This is due to the self-normalizing property of the SELU function, as well as its avoidance of the dying ReLU problem.

In [None]:
4. In which cases would you want to use each of the following activation functions: SELU, leaky
ReLU (and its variants), ReLU, tanh, logistic, and softmax?


- SELU: It is recommended to use SELU activation function for deep neural networks, particularly feedforward neural networks, as it ensures that the weights are scaled in such a way that the output of each layer will have zero mean and unit variance, which leads to self-normalization and improved convergence. However, SELU should only be used with dense layers, as the normalization property is lost in convolutional layers.

- Leaky ReLU (and its variants): Leaky ReLU can be used as a replacement for ReLU when you want to avoid "dying ReLU" problem, which occurs when a large fraction of neurons in a layer output zero due to a very large negative bias. Leaky ReLU solves this problem by introducing a small slope (typically 0.01) for negative inputs. Variants of Leaky ReLU include Parametric ReLU (PReLU), which allows the slope to be learned during training, and Exponential Linear Units (ELU), which have similar advantages to SELU.

- ReLU: ReLU should be used as the default choice for activation function, as it is simple, computationally efficient, and performs well in most cases. ReLU is especially useful in deep neural networks, where it can help alleviate the vanishing gradient problem.

- tanh: tanh can be used as an alternative to ReLU, especially when you need outputs between -1 and 1, such as in a binary classification problem. tanh is also useful in recurrent neural networks (RNNs), where it can help avoid the exploding gradient problem.

- logistic: logistic (also known as sigmoid) is commonly used as an activation function for binary classification problems, as it outputs values between 0 and 1, which can be interpreted as probabilities. However, it is generally not recommended to use logistic for hidden layers, as it can lead to the vanishing gradient problem.

- softmax: softmax is typically used as the activation function for the output layer in multi-class classification problems, as it outputs a probability distribution over the classes.

In [None]:
5. What may happen if you set the momentum hyperparameter too close to 1 (e.g., 0.99999)
when using an SGD optimizer?


Setting the momentum hyperparameter too close to 1 (e.g., 0.99999) when using an SGD optimizer can cause the optimizer to overshoot the minimum of the cost function and oscillate around it, rather than converge to it. This is because the momentum term accumulates a lot of past gradients, making the optimizer take large steps in the direction of the minimum, which can cause it to overshoot and oscillate. This can lead to slow convergence or even prevent convergence altogether. Therefore, it is important to choose an appropriate value for the momentum hyperparameter to ensure the optimizer converges to the minimum of the cost function.






In [None]:
6. Name three ways you can produce a sparse model.


Here are three ways to produce a sparse model:

1. L1 regularization: By adding an L1 penalty term to the cost function, the optimizer is encouraged to set the weights of many features to zero, resulting in a sparse model.
2. Dropout regularization: During training, dropout randomly drops out some neurons, forcing the network to rely on a smaller subset of features. This can lead to a sparse model.
3. Data pruning: After training, weights that are close to zero are set to exactly zero, effectively removing the corresponding connections in the network. This can be done in a one-time post-processing step or by dynamically pruning during training based on certain criteria.

7. Does dropout slow down training? Does it slow down inference (i.e., making predictions on
new instances)? What about MC Dropout?


Yes, dropout does slow down training because it introduces additional computations during each forward and backward pass of the network. During training, dropout randomly sets a fraction of the neurons to zero, and then scales up the remaining neurons to maintain the expected value of each neuron's output. This scaling operation is what causes the additional computations.

However, dropout does not slow down inference since it is only applied during training. Inference is simply a forward pass through the network without any dropout applied.

MC Dropout, on the other hand, can slow down inference since it requires multiple forward passes with dropout applied to each pass. During each forward pass, dropout randomly sets a fraction of the neurons to zero, and the predictions from all the passes are averaged to produce the final prediction. This can be computationally expensive, especially for large networks and large datasets.

8. Practice training a deep neural network on the CIFAR10 image dataset:

a. Build a DNN with 20 hidden layers of 100 neurons each (that’s too many, but it’s the
point of this exercise). Use He initialization and the ELU activation function.

b. Using Nadam optimization and early stopping, train the network on the CIFAR10
dataset. You can load it with keras.datasets.cifar10.load_​data(). The dataset is
composed of 60,000 32 × 32–pixel color images (50,000 for training, 10,000 for
testing) with 10 classes, so you’ll need a softmax output layer with 10 neurons.
Remember to search for the right learning rate each time you change the model’s
architecture or hyperparameters.

c. Now try adding Batch Normalization and compare the learning curves: Is it
converging faster than before? Does it produce a better model? How does it affect
training speed?

d. Try replacing Batch Normalization with SELU, and make the necessary adjustements
to ensure the network self-normalizes (i.e., standardize the input features, use
LeCun normal initialization, make sure the DNN contains only a sequence of dense
layers, etc.).

e. Try regularizing the model with alpha dropout. Then, without retraining your

In [2]:
pip install tensorflow


Collecting tensorflow
  Downloading tensorflow-2.12.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (585.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m585.9/585.9 MB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting gast<=0.4.0,>=0.2.1
  Downloading gast-0.4.0-py3-none-any.whl (9.8 kB)
Collecting jax>=0.3.15
  Downloading jax-0.4.8.tar.gz (1.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m38.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
[?25hCollecting grpcio<2.0,>=1.24.3
  Downloading grpcio-1.54.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.1/5.1 MB[0m [31m37.4 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting libclang>=13.0.

In [None]:
from tensorflow import keras

model = keras.models.Sequential()

# Add input layer
model.add(keras.layers.Flatten(input_shape=[32, 32, 3]))

# Add hidden layers
for _ in range(20):
    model.add(keras.layers.Dense(100, activation="elu", kernel_initializer="he_normal"))

# Add output layer
model.add(keras.layers.Dense(10, activation="softmax"))


In [None]:
# Load CIFAR10 dataset
(X_train_full, y_train_full), (X_test, y_test) = keras.datasets.cifar10.load_data()

# Scale pixel values to [0, 1]
X_train_full = X_train_full.astype('float32') / 255.0
X_test = X_test.astype('float32') / 255.0

# Compile the model
model.compile(loss="sparse_categorical_crossentropy", optimizer=keras.optimizers.Nadam(lr=5e-5), metrics=["accuracy"])

# Define early stopping callback
early_stopping_cb = keras.callbacks.EarlyStopping(patience=20, restore_best_weights=True)

# Train the model
history = model.fit(X_train_full, y_train_full, epochs=100, validation_split=0.2, callbacks=[early_stopping_cb])


Downloading data from https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz




Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
 197/1250 [===>..........................] - ETA: 18s - loss: 1.6480 - accuracy: 0.3969

In [None]:
model = keras.models.Sequential()

# Add input layer
model.add(keras.layers.Flatten(input_shape=[32, 32, 3]))

# Add hidden layers with Batch Normalization
for _ in range(20):
    model.add(keras.layers.Dense(100, kernel_initializer="he_normal"))
    model.add(keras.layers.BatchNormalization())
    model.add(keras.layers.Activation("elu"))

# Add output layer
model.add(keras.layers.Dense(10, activation="softmax"))


In [None]:
from tensorflow.keras.layers import Dense, Flatten, Input
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import Nadam
from tensorflow.keras.regularizers import l2
from tensorflow.keras.initializers import lecun_normal
from tensorflow.keras.layers import Dropout

# Define the model
model = Sequential()

# Add input layer
model.add(Input(shape=(32, 32, 3)))

# Add hidden layers with SELU and LeCun normal initialization
for _ in range(20):
    model.add(Dense(100, activation="selu", kernel_initializer=lecun_normal()))

# Add output layer
model.add(Dense(10, activation="softmax"))

# Compile the model
model.compile(loss="sparse_categorical_crossentropy", optimizer=Nadam(lr=5e-5), metrics=["accuracy"])

# Train the model
history = model.fit(X_train_full, y_train_full, epochs=100, validation_split=0.2, callbacks=[early_stopping_cb])


In [None]:
model = keras.models.Sequential()

# Add input layer
model.add(keras.layers.Flatten(input_shape=[32, 32, 3]))

# Add hidden layers with alpha dropout
for _ in range(20):
    model.add(keras.layers.Dense(100, activation="elu", kernel_initializer="he_normal"))
    model.add(keras.layers.AlphaDropout(rate=0.1))

# Add output layer
model.add(keras.layers.Dense(10
