What is the problem that Glorot initialization and He initialization aim to fix?
> Unstable gradients - Vanishing gradients (typical) or exploding gradients (mostly for RNNs). In vanishing gradients, gradients become smaller and smaller as we progress back to the lower layers. This mean learning in lower layers gets stuck and never converges to a good solution. Saturating activation functions aggravate the problem as their gradients get closer to zero in low or high values. Essentially making the neurons stuck once they get sufficiently large.

Correct answer:
> Glorot initialization and He initialization were designed to make the output standard deviation as close as possible to the input standard deviation, at least at the beginning of training. This reduces the vanishing/exploding gradients problem.

Is it OK to initialize all the weights to the same value as long as that value is selected randomly using He initialization?
> No, because then it's equivalent to zero (the old problem) if the bias term learns that value. You have to either randomize the weights through a normal distribution or uniform distribution (sqrt(3*sigma^2)) per the initialization formulas.

Correct answer:
> No, all weights should be sampled independently; they should not all have the same initial value. One important goal of sampling weights randomly is to break symmetry: if all the weights have the same initial value, even if that value is not zero, then symmetry is not broken (i.e., all neurons in a given layer are equivalent), and backpropagation will be unable to break it. Concretely, this means that all the neurons in any given layer will always have the same weights. It's like having just one neuron per layer, and much slower. It is virtually impossible for such a configuration to converge to a good solution.

Is it OK to initialize the bias terms to 0?
> Yes. This is in fact what's happening by default.

In which cases would you want to use each of the activation functions we discussed in this chapter?
> Sigmoid, tanh - multi-label or binary classification in the output layer
> Softmax  multi-class classification in the output layer
> ReLU - common activation function in the hidden layers of shallow networks. Could kill neurons.
> Leaky ReLU - solve the dead neurons problem with ReLU
> Swish - common activation function in the hidden layers of deep networks. But has higher computational cost.
> SeLU - self normalizing, i.e. every layer output is normalized with 0 mean and 1 stdev. Only possible in simple MLPs, must standardize inputs, layer must be initialized with LeCun normal, no regularization technique is allowed.

> The softplus activation function is useful in the output layer when you need to ensure that the output will always be positive.

 What may happen if you set the momentum hyperparameter too close to 1 (e.g., 0.99999) when using an SGD optimizer?
 > The sum of past gradients would decay, i.e. the rate of change may explode. Momentum performs an "exponentially decaying sum" - not average.

Answer:
> If you set the momentum hyperparameter too close to 1 (e.g., 0.99999) when using an SGD optimizer, then the algorithm will likely pick up a lot of speed, hopefully moving roughly toward the global minimum, but its momentum will carry it right past the minimum. Then it will slow down and come back, accelerate again, overshoot again, and so on. It may oscillate this way many times before converging, so overall it will take much longer to converge than with a smaller momentum value.

Name three ways you can produce a sparse model.
> ReLU to kill neurons
> L1 regularization
> Dropout  ==> wrong!

Correct answer:
> One way to produce a sparse model (i.e., with most weights equal to zero) is to train the model normally, then zero out tiny weights. For more sparsity, you can apply ℓ1 regularization during training, which pushes the optimizer toward sparsity. A third option is to use the TensorFlow Model Optimization Toolkit.

Does dropout slow down training?
> It is more likely to speed up training because fewer neurons are being calculated
> ==> INCORRECT, actually slows down the convergence process

Does it slow down inference (i.e., making predictions on new instances)?
> No, as there is no effect on inference - the network architecture is unaffected

What about MC dropout?
> Slows down inference, as the networks needs to infer K samples per output

Correct answer:
> Yes, dropout does slow down training, in general roughly by a factor of two. However, it has no impact on inference speed since it is only turned on during training. MC Dropout is exactly like dropout during training, but it is still active during inference, so each inference is slowed down slightly. More importantly, when using MC Dropout you generally want to run inference 10 times or more to get better predictions. This means that making predictions is slowed down by a factor of 10 or more.