### Exponential Linear Unit as an activation function outperformed ReLU

it takes on negative numbers allowing the training to not have vanishing gradients

nonzero gradient for z < 0 so no dead neurons

Helps gradient descent beacause the function is smooth everywhere (doesn't bounch as much)

### SELU is a scaled ELU activation function

Cannot use regularization techiques like l1 or l2, max-norm, batch-norm, regular dropout

self-normalizing is only guarenteed with plain MLPs

input features must be standardized: mean 0 and SD of 1

### GELU Gaussian Error Linear Units looks like ReLU but is smooth all over makes gradient descent easier to fit complex problems

SiLU activation outperformed GELU (Swish, β to scale sigmoid function's input)

Mish is smooth, nonconvex, and nonmonotonix variant of ReLU and outperformed Swish

### ReLU is a good default (hardware accelerators provide ReLU-specific optimizations)

Switch is better default for more complex tasks, Mish may give slightly better results

for runtime latency LeakyReLU or Parameterized Leaky ReLU for complex tasks

### Batch Normalization (BN) reduce the danger of vanishing/exploding gradients

adding an operation in model before or after activation function of each hidden layer. Zero-centers and normalizess each input, the nscales and shifts (using two new parameter vectors per layer scale and shifting)

No need for `StandardScaler` or `Normalization` if BN is first layer

#### **1. Compute the Mini-Batch Mean**
$$
\mu_B = \frac{1}{m} \sum_{i=1}^{m} x_i
$$

#### **2. Compute the Mini-Batch Variance**
$$
\sigma_B^2 = \frac{1}{m} \sum_{i=1}^{m} (x_i - \mu_B)^2
$$

#### **3. Normalize the Inputs**
$$
\hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}
$$
where $ \epsilon $ is a small constant to prevent division by zero. (smoothing term)

#### **4. Scale and Shift**
$$
y_i = \gamma \hat{x}_i + \beta
$$
where:
- $ \gamma $ (scale) and $ \beta $ (shift) are **learnable parameters**.

if we want to test predicitions for individual instances rather than batches we only have the batch mean/SD so:

most implementations of batch normalization estimate final statistics during training by using a moving
average of the layer's input means and standard deviations, Keras does this automatically.

it's possible to fuse the BN layer with the previous layer after training, avoiding the runtime penalty.
This is done by updating the previous layer’s weights and biases so that it directly produces outputs of the appropriate scale and offset. 

previous layer computes XW + b, then the BN layer will compute γ⊗(XW + b – μ) / σ + β (ignoring the smoothing term ε). W' = γ⊗W /
σ and b′ = γ⊗(b – μ) / σ + β, the equation simplifies to XW' + b'. replace the previous layer's weights and biases (W and b) with the updated weightsand biases (W' and b'), we can get rid of the BN layer (⊗ element-wise multiplication)

For small networks might not have much impact but you can see for deeper networks this can make a huge difference

In [2]:
import tensorflow as tf

model = tf.keras.Sequential([
    tf.keras.layers.Input(shape=(28, 28)),
    tf.keras.layers.Flatten(),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dense(300, activation="relu", kernel_initializer="he_normal"),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dense(100, activation="relu", kernel_initializer="he_normal"),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dense(10, activation="softmax")
])

In [3]:
model.summary()

first BN layer: 3,136 parameters = 4 × 784 (γ, β, μ, and σ)

μ and σ, are the moving averages not trainable 

In [4]:
[(var.name, var.trainable) for var in model.layers[1].variables]

[('gamma', True),
 ('beta', True),
 ('moving_mean', False),
 ('moving_variance', False)]

In [6]:
# There is some debate to put the BN before/after activation function
# BN layer includes one offset parameter per input, you can remove the bias term from the previous layer 
model = tf.keras.Sequential([
    tf.keras.layers.Input(shape=(28, 28)),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(300, kernel_initializer="he_normal", use_bias=False),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Activation("relu"),
    tf.keras.layers.Dense(100, kernel_initializer="he_normal", use_bias=False),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Activation("relu"),
    tf.keras.layers.Dense(10, activation="softmax")
])

Hyperparams for `BatchNormalization` momentum:

$\hat{v}$: running average

$\hat{v}$ <- $\hat{v}$ × momentum + v × (1 − momentum)

axis:

defaults to -1, last axis (using means and SD computed across other axes)
if you you want to treat each pixel independently axis=[1, 2]

### Gradient Clipping mitigates the exploding gradients (setting a threshold)

In [7]:
# optimizer = tf.keras.optimizer.SGD(clipvalue=1.0)
# model.compile([...], optimizer=optimizer)

clipping should be done by setting the threshold of the norm
usually we want to clip between -1.0 and 1.0

we don't want to change the orientation of gradient example: [0.9, 100.0] to [0.9, 1.0]
clipnorm=1.0 will do this instead [0.9, 100.0] -> [0.00899, 0.9999] keeping orientation

### Resusing Layers

You can re-use lower layers of a Deep NN by freezing them so backpropigation only performs on top layers instead of dropping them

### transfer learning

say you have images of T-shirts and sandels and want to use the training from the fashion dataset

In [9]:
# my_model_A was trained on fashion dataset
# model_A = tf.keras.models.load_model("my_model_A")
# model_B_on_A = tf.keras.Sequential(model_A.layers[:-1])
# model_B_on_A.add(tf.keras.layers.Dense(1, activation="sigmoid")

In [11]:
# model_A_clone = tf.keras.models.clone_model(model_A)
# model_A_clone.set_weights(model_A.get_weights()) # otherwise weights are initialized randomly

In [12]:
# to avoid large error gradients that may wreck the reused wweights
# freeze the reused layers during first few epochs
"""
    for layer in model_B_on_A.layers[:-1]:
        layer.trainable = False

    optimizer = tf.keras.optimizers.SGD(learning_rate=0.001)
    model_B_on_A.compile(loss="binary_crossentropy", optimizer=optimizer, metrics=["accuracy"])
"""

'\n    for layer in model_B_on_A.layers[:-1]:\n        layer.trainable = False\n\n    optimizer = tf.keras.optimizers.SGD(learning_rate=0.001)\n    model_B_on_A.compile(loss="binary_crossentropy", optimizer=optimizer, metrics=["accuracy"])\n'

now unfreeze reused layers and continue training, good idea to reduce learning rate

In [13]:
"""
history = model_B_on_A.fit(X_trian_B, y_train_B, epochs=4,
                           validation_data=(X_valid_B, y_valid_B))

for layer in model_B_on_A.layers[:-1]:
    layer.trainable = True

optimizer = tf.keras.optimizers.SGD(learning_rate=0.001)
model_B_on_A.compile(loss="binary_crossentropy", optimizer=optimizer, metrics=["accuracy"])
history = model_B_on_A.fit(X_train_B, y_train_B, epochs=16,
                           validation_data=(X_valid_B, y_valid_B))
"""

'\nhistory = model_B_on_A.fit(X_trian_B, y_train_B, epochs=4,\n                           validation_data=(X_valid_B, y_valid_B))\n\nfor layer in model_B_on_A.layers[:-1]:\n    layer.trainable = True\n\noptimizer = tf.keras.optimizers.SGD(learning_rate=0.001)\nmodel_B_on_A.compile(loss="binary_crossentropy", optimizer=optimizer, metrics=["accuracy"])\nhistory = model_B_on_A.fit(X_train_B, y_train_B, epochs=16,\n                           validation_data=(X_valid_B, y_valid_B))\n'

In [14]:
# model_B_on_A.evaluate(X_test_B, y_test_B)

Transfer learning only works well with Deep convolutional neural networks, not small dense networks

### Unsupervised pretraining

you can use this if you don't have much labeled training data, and you cannot find a model trained on a similar task, could use autoencoders or GANs, and the finial task is just on the labeled data using supervised learning

### optimizers

(Regular gradient descent will take small steps when slope is gentle and big steps when slope is steep but will never pick up speed)

- momentum
- Nesterov accelerated gradient
- AdaGrad
- RMSProp
- Adam

momemntum is like a ball in a bowl, it cares about previous gradients. It subtracts the local gradient from momentum vector at each iteration. Updates weights by adding momentum, gradient is used as an acceleration not speed

To simulate some sort of friction and prevent momentum from growing large new hyperparameter β
0 (high friction) 1 (low) good to have friction gets rid of oscillations and speeds up convergence

### Gradient Descent:
$$ w := w - \alpha \nabla J(w) $$

### Momentum Gradient Descent:
$$ v := \beta v - \alpha \nabla J(w) $$
$$ w := w + v $$

In [17]:
# optimizer = tf.keras.optimizers.SGD(learning_rate=0.001, momentum=0.9)

### Nesterov Accelerated Gradient (NAG)

a variant to momentum optimization measures the gradient of the cost function not local but slightly ahead. Applies the gadients after momentum step (faster than regular momentum)

$$ v := \beta v - \alpha \nabla J(w + \beta v) $$
$$ w := w + v $$

In [19]:
# optimizer = tf.keras.optimizers.SGD(learning_rate=0.001, momentum=0.9, nesterov=True)

### AdaGrad Adaptive Gradient Algorithm

corrects direction to global optimum by scaling down gradient vecotr along steepest dimensions

$$ G_t := G_{t-1} + \nabla J(w_t)^2 $$
$$ w_{t+1} := w_t - \frac{\alpha}{\sqrt{G_t} + \epsilon} \nabla J(w_t) $$

AdaGrad performs well for simple quadratic problems, it often stops too early when training NN: the learning rate gets scaled down so much that the algorithm ends up stopping before reaching the global optimum. Runs the risk of never converging

### RMSProp

accumulates from most recent iterations, as opposed to all the gradients since beginning, uses exponential decay in first step:

$$ E[g^2]_t := \beta E[g^2]_{t-1} + (1 - \beta) \nabla J(w_t)^2 $$
$$ w_{t+1} := w_t - \frac{\alpha}{\sqrt{E[g^2]_t} + \epsilon} \nabla J(w_t) $$

this was perferred before adam optimizer came around did much better (except on simple problems) than AdaGrad

### Adam adaptive moment estimation

keeps track of exponentially decaying average of past gradients (not sum!)

(the decaying average is just 1 – β1 times the decaying sum)

#### Update biased moment estimates

$$ m_t := \beta_1 m_{t-1} + (1 - \beta_1) \nabla J(w_t) $$  
$$ v_t := \beta_2 v_{t-1} + (1 - \beta_2) \nabla J(w_t)^2 $$

#### biased correction
$$ \hat{m}_t := \frac{m_t}{1 - \beta_1^t} $$  
$$ \hat{v}_t := \frac{v_t}{1 - \beta_2^t} $$

#### updated weights
$$ w_{t+1} := w_t - \frac{\alpha}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t $$

In [21]:
# optimizer = tf.keras.optimizers.Adam(learning_rate=0.001, beta_1=0.9, beta_2=0.999)

Three variants of Adam:
- AdaMax
- Nadam
- AdamW