### Exponential Linear Unit as an activation function outperformed ReLU

it takes on negative numbers allowing the training to not have vanishing gradients

nonzero gradient for z < 0 so no dead neurons

Helps gradient descent beacause the function is smooth everywhere (doesn't bounch as much)

### SELU is a scaled ELU activation function

Cannot use regularization techiques like l1 or l2, max-norm, batch-norm, regular dropout

self-normalizing is only guarenteed with plain MLPs

input features must be standardized: mean 0 and SD of 1

### GELU Gaussian Error Linear Units looks like ReLU but is smooth all over makes gradient descent easier to fit complex problems

SiLU activation outperformed GELU (Swish, β to scale sigmoid function's input)

Mish is smooth, nonconvex, and nonmonotonix variant of ReLU and outperformed Swish

### ReLU is a good default (hardware accelerators provide ReLU-specific optimizations)

Switch is better default for more complex tasks, Mish may give slightly better results

for runtime latency LeakyReLU or Parameterized Leaky ReLU for complex tasks

### Batch Normalization (BN) reduce the danger of vanishing/exploding gradients

adding an operation in model before or after activation function of each hidden layer. Zero-centers and normalizess each input, the nscales and shifts (using two new parameter vectors per layer scale and shifting)

No need for `StandardScaler` or `Normalization` if BN is first layer

#### **1. Compute the Mini-Batch Mean**
$$
\mu_B = \frac{1}{m} \sum_{i=1}^{m} x_i
$$

#### **2. Compute the Mini-Batch Variance**
$$
\sigma_B^2 = \frac{1}{m} \sum_{i=1}^{m} (x_i - \mu_B)^2
$$

#### **3. Normalize the Inputs**
$$
\hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}
$$
where $ \epsilon $ is a small constant to prevent division by zero. (smoothing term)

#### **4. Scale and Shift**
$$
y_i = \gamma \hat{x}_i + \beta
$$
where:
- $ \gamma $ (scale) and $ \beta $ (shift) are **learnable parameters**.

if we want to test predicitions for individual instances rather than batches we only have the batch mean/SD so:

most implementations of batch normalization estimate final statistics during training by using a moving
average of the layer's input means and standard deviations, Keras does this automatically.

it's possible to fuse the BN layer with the previous layer after training, avoiding the runtime penalty.
This is done by updating the previous layer’s weights and biases so that it directly produces outputs of the appropriate scale and offset. 

previous layer computes XW + b, then the BN layer will compute γ⊗(XW + b – μ) / σ + β (ignoring the smoothing term ε). W' = γ⊗W /
σ and b′ = γ⊗(b – μ) / σ + β, the equation simplifies to XW' + b'. replace the previous layer's weights and biases (W and b) with the updated weightsand biases (W' and b'), we can get rid of the BN layer (⊗ element-wise multiplication)

For small networks might not have much impact but you can see for deeper networks this can make a huge difference

In [2]:
import tensorflow as tf

model = tf.keras.Sequential([
    tf.keras.layers.Input(shape=(28, 28)),
    tf.keras.layers.Flatten(),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dense(300, activation="relu", kernel_initializer="he_normal"),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dense(100, activation="relu", kernel_initializer="he_normal"),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dense(10, activation="softmax")
])

In [3]:
model.summary()

first BN layer: 3,136 parameters = 4 × 784 (γ, β, μ, and σ)

μ and σ, are the moving averages not trainable 

In [4]:
[(var.name, var.trainable) for var in model.layers[1].variables]

[('gamma', True),
 ('beta', True),
 ('moving_mean', False),
 ('moving_variance', False)]

In [6]:
# There is some debate to put the BN before/after activation function
# BN layer includes one offset parameter per input, you can remove the bias term from the previous layer 
model = tf.keras.Sequential([
    tf.keras.layers.Input(shape=(28, 28)),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(300, kernel_initializer="he_normal", use_bias=False),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Activation("relu"),
    tf.keras.layers.Dense(100, kernel_initializer="he_normal", use_bias=False),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Activation("relu"),
    tf.keras.layers.Dense(10, activation="softmax")
])

Hyperparams for `BatchNormalization` momentum:

$\hat{v}$: running average

$\hat{v}$ <- $\hat{v}$ × momentum + v × (1 − momentum)

axis:

defaults to -1, last axis (using means and SD computed across other axes)
if you you want to treat each pixel independently axis=[1, 2]

### Gradient Clipping mitigates the exploding gradients (setting a threshold)

In [7]:
# optimizer = tf.keras.optimizer.SGD(clipvalue=1.0)
# model.compile([...], optimizer=optimizer)

clipping should be done by setting the threshold of the norm
usually we want to clip between -1.0 and 1.0

we don't want to change the orientation of gradient example: [0.9, 100.0] to [0.9, 1.0]
clipnorm=1.0 will do this instead [0.9, 100.0] -> [0.00899, 0.9999] keeping orientation

### Resusing Layers

You can re-use lower layers of a Deep NN by freezing them so backpropigation only performs on top layers instead of dropping them

### transfer learning

say you have images of T-shirts and sandels and want to use the training from the fashion dataset

In [9]:
# my_model_A was trained on fashion dataset
# model_A = tf.keras.models.load_model("my_model_A")
# model_B_on_A = tf.keras.Sequential(model_A.layers[:-1])
# model_B_on_A.add(tf.keras.layers.Dense(1, activation="sigmoid")

In [11]:
# model_A_clone = tf.keras.models.clone_model(model_A)
# model_A_clone.set_weights(model_A.get_weights()) # otherwise weights are initialized randomly

In [12]:
# to avoid large error gradients that may wreck the reused wweights
# freeze the reused layers during first few epochs
"""
    for layer in model_B_on_A.layers[:-1]:
        layer.trainable = False

    optimizer = tf.keras.optimizers.SGD(learning_rate=0.001)
    model_B_on_A.compile(loss="binary_crossentropy", optimizer=optimizer, metrics=["accuracy"])
"""

'\n    for layer in model_B_on_A.layers[:-1]:\n        layer.trainable = False\n\n    optimizer = tf.keras.optimizers.SGD(learning_rate=0.001)\n    model_B_on_A.compile(loss="binary_crossentropy", optimizer=optimizer, metrics=["accuracy"])\n'

now unfreeze reused layers and continue training, good idea to reduce learning rate

In [13]:
"""
history = model_B_on_A.fit(X_trian_B, y_train_B, epochs=4,
                           validation_data=(X_valid_B, y_valid_B))

for layer in model_B_on_A.layers[:-1]:
    layer.trainable = True

optimizer = tf.keras.optimizers.SGD(learning_rate=0.001)
model_B_on_A.compile(loss="binary_crossentropy", optimizer=optimizer, metrics=["accuracy"])
history = model_B_on_A.fit(X_train_B, y_train_B, epochs=16,
                           validation_data=(X_valid_B, y_valid_B))
"""

'\nhistory = model_B_on_A.fit(X_trian_B, y_train_B, epochs=4,\n                           validation_data=(X_valid_B, y_valid_B))\n\nfor layer in model_B_on_A.layers[:-1]:\n    layer.trainable = True\n\noptimizer = tf.keras.optimizers.SGD(learning_rate=0.001)\nmodel_B_on_A.compile(loss="binary_crossentropy", optimizer=optimizer, metrics=["accuracy"])\nhistory = model_B_on_A.fit(X_train_B, y_train_B, epochs=16,\n                           validation_data=(X_valid_B, y_valid_B))\n'

In [14]:
# model_B_on_A.evaluate(X_test_B, y_test_B)

Transfer learning only works well with Deep convolutional neural networks, not small dense networks

### Unsupervised pretraining

you can use this if you don't have much labeled training data, and you cannot find a model trained on a similar task, could use autoencoders or GANs, and the finial task is just on the labeled data using supervised learning

### optimizers

(Regular gradient descent will take small steps when slope is gentle and big steps when slope is steep but will never pick up speed)

- momentum
- Nesterov accelerated gradient
- AdaGrad
- RMSProp
- Adam

momemntum is like a ball in a bowl, it cares about previous gradients. It subtracts the local gradient from momentum vector at each iteration. Updates weights by adding momentum, gradient is used as an acceleration not speed

To simulate some sort of friction and prevent momentum from growing large new hyperparameter β
0 (high friction) 1 (low) good to have friction gets rid of oscillations and speeds up convergence

### Gradient Descent:
$$ w := w - \alpha \nabla J(w) $$

### Momentum Gradient Descent:
$$ v := \beta v - \alpha \nabla J(w) $$
$$ w := w + v $$

In [17]:
# optimizer = tf.keras.optimizers.SGD(learning_rate=0.001, momentum=0.9)

### Nesterov Accelerated Gradient (NAG)

a variant to momentum optimization measures the gradient of the cost function not local but slightly ahead. Applies the gadients after momentum step (faster than regular momentum)

$$ v := \beta v - \alpha \nabla J(w + \beta v) $$
$$ w := w + v $$

In [19]:
# optimizer = tf.keras.optimizers.SGD(learning_rate=0.001, momentum=0.9, nesterov=True)

### AdaGrad Adaptive Gradient Algorithm

corrects direction to global optimum by scaling down gradient vecotr along steepest dimensions

$$ G_t := G_{t-1} + \nabla J(w_t)^2 $$
$$ w_{t+1} := w_t - \frac{\alpha}{\sqrt{G_t} + \epsilon} \nabla J(w_t) $$

AdaGrad performs well for simple quadratic problems, it often stops too early when training NN: the learning rate gets scaled down so much that the algorithm ends up stopping before reaching the global optimum. Runs the risk of never converging

### RMSProp

accumulates from most recent iterations, as opposed to all the gradients since beginning, uses exponential decay in first step:

$$ E[g^2]_t := \beta E[g^2]_{t-1} + (1 - \beta) \nabla J(w_t)^2 $$
$$ w_{t+1} := w_t - \frac{\alpha}{\sqrt{E[g^2]_t} + \epsilon} \nabla J(w_t) $$

this was perferred before adam optimizer came around did much better (except on simple problems) than AdaGrad

### Adam adaptive moment estimation

keeps track of exponentially decaying average of past gradients (not sum!)

(the decaying average is just 1 – β1 times the decaying sum)

#### Update biased moment estimates

$$ m_t := \beta_1 m_{t-1} + (1 - \beta_1) \nabla J(w_t) $$  
$$ v_t := \beta_2 v_{t-1} + (1 - \beta_2) \nabla J(w_t)^2 $$

#### biased correction
$$ \hat{m}_t := \frac{m_t}{1 - \beta_1^t} $$  
$$ \hat{v}_t := \frac{v_t}{1 - \beta_2^t} $$

#### updated weights
$$ w_{t+1} := w_t - \frac{\alpha}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t $$

In [21]:
# optimizer = tf.keras.optimizers.Adam(learning_rate=0.001, beta_1=0.9, beta_2=0.999)

Three variants of Adam:
- AdaMax
- Nadam
- AdamW

### AdaMax

works like the equations above (Adam) but replaces step 2 `s <- max(β2s, abs(∇θJ(θ)))` drops step 4 and in step 5 updates by a factor of s which is the max of the absolute value of the time-decayed gradients. Could be more stable than Adam depends on data

### Nadam

Adam optimization + Nesterov trick, converge faster than Adam. Outperforms Adam but sometimes outperformed by RMSProp

### AdamW

integrates a regularization technique called weight decay which reduces the size of models weights at each training interation by mutliplying them by a decay factor 0.99.

Combining Adam with l2 regularization results in models that don't generalize well as those produced by SGD, AdamW fixes this issue.

If none of the adam techniques work the data you might be using might not like adaptive gradients, try NAG

`tf.keras.optimizers.Adam` `tf.keras.optimizers.Nadam` `tf.keras.optimizers.Adamax` `tf.keras.optimizers.experimental.AdamW` with AdamW we may want to tune the `weight_decay` hyperparam

These optimizers are only first-order partial derivatives (Jacobians). Hessians are hard to compute in NN and are slow they don't even fit in memory

### Learning rate

setting too high may diverge, too low may take a very long time to converge to optimum. Its good to start with large learning rate then reduce it.

It can also be beneficial to start with a low learning rate, increase it, then drop it again.

#### Scheduling
- Power:
  η(t) = η0 / (1 + t/s)c drops at each step: n/2 -> n/3 -> n/4 and so on
- Exponential:
  η(t) = η0 0.1t/s gradually drop by a factor of 10 every steps
- Piecewise constant:
  n0 = 0.1 for 5 epochs, n1 = 0.001 for 50 epochs, involves fiddling aronud
- Performance:
  Measure the validation error every N steps, reduce the learning rate by a factor of λ
- 1cycle:
  starts by increasing initial learning rate till n1 (halfway through training) then decreasing learning rate back down to n0

both performance and exponential scheduling performs well, but 1cycle might perform even better

In [24]:
# power scheduling
# optimizer = tf.keras.optimizers.SGD(learning_rate=0.01, decay=1e-4)
# ^ old way of doing this

initial_learning_rate = 0.01
lr_schedule = tf.keras.optimizers.schedules.PolynomialDecay(
    initial_learning_rate,
    decay_steps=100000,  # Number of steps for the decay to happen
    end_learning_rate=0.0001,  # Final learning rate after decay
    power=1.0  # Power for polynomial decay (set to 1.0 for linear decay)
)

# Set up the optimizer with the learning rate schedule
optimizer = tf.keras.optimizers.SGD(learning_rate=lr_schedule)

In [25]:
# exponential decay
def exponential_decay_fn(epoch):
    lr0 = 0.01  # Initial learning rate
    s = 20  # Decay step
    return lr0 * 0.1 ** (epoch / s)

# Define the LearningRateScheduler callback with the decay function
lr_scheduler = tf.keras.callbacks.LearningRateScheduler(exponential_decay_fn)
# history = model.fit(X_train, y_train, [...], callbacks=[lr_scheduler])

In [26]:
# Theres a built-in exponential decay:
# Define the exponential decay schedule
lr_schedule = tf.keras.optimizers.schedules.ExponentialDecay(
    initial_learning_rate=0.01,  # Initial learning rate
    decay_steps=1000,  # Number of steps per decay
    decay_rate=0.96,  # Decay factor
    staircase=True,  # If True, decay in discrete intervals (step-wise)
)

# Use the learning rate schedule with an optimizer
optimizer = tf.keras.optimizers.SGD(learning_rate=lr_schedule)

### Exponential Decay Example

Let’s assume the following parameters for the exponential decay:

- Initial learning rate \( \text{lr}_0 = 0.01 \)
- Decay rate \( \text{decay\_rate} = 0.96 \)
- Decay steps \( \text{decay\_steps} = 1000 \)

The formula for exponential decay is:

$$
\text{lr}(t) = \text{lr}_0 \times \text{decay\_rate}^{\frac{t}{\text{decay\_steps}}}
$$

Where:
- \( \text{lr}_0 \) is the initial learning rate.
- \( t \) is the current step (epoch).
- \( \text{decay\_rate} \) is the factor by which the learning rate is reduced at each step.
- \( \text{decay\_steps} \) is the number of steps per decay.

#### Step 1: Initial Learning Rate

At the start (epoch 0), the learning rate is:

$$
\text{lr}(0) = 0.01
$$

#### Step 2: After 1 Step (Epoch 1)

After 1 step, the learning rate decays by the `decay_rate` of 0.96. The new learning rate will be:

$$
\text{lr}(1) = 0.01 \times 0.96 = 0.0096
$$

#### Step 3: After 2 Steps (Epoch 2)

After 2 steps, the learning rate will decay further, and the new learning rate is:

$$
\text{lr}(2) = 0.01 \times 0.96^2 = 0.01 \times 0.9216 = 0.009216
$$

You could have the learning rate update at every steps (if you have many) instead of at the beginning of each epoch

In [28]:
# piece-wise
def piecewise_constant_fn(epoch):
    if epoch < 5:
        return 0.01
    elif epoch < 15:
        return 0.005
    else:
        return 0.001

# performance
lr_scheduler = tf.keras.callbacks.ReduceLROnPlateau(factor=0.5, patience=5)
# history = model.fit(X_trian, y_train, [...], callbacks=[lr_scheduler])

In [31]:
# updates after each step:
import numpy as np
import math

# Generate synthetic training data (1000 samples, 10 features each)
X_train = np.random.rand(1000, 10).astype(np.float32)

batch_size = 32
n_epochs = 25
n_steps = n_epochs * math.ceil(len(X_train) / batch_size)
scheduled_learning_rate = tf.keras.optimizers.schedules.ExponentialDecay(
    initial_learning_rate=0.01, decay_steps=n_steps, decay_rate=0.1)
optimizer = tf.keras.optimizers.SGD(learning_rate=scheduled_learning_rate)

### L1 and L2 Regularization (L1: Lasso Regression, L2: Ridge Regression)

L2 to contrain a NN connection weights, L1 if you want a sparse model)

L1 penalty term: adds the sum of absolute values of the weights to loss function

L1 feature selection: selectes important features

L2 penalty term: Adds the sum of the squared values of the weights to the loss function

L2 feature selections: all features contribute

In [32]:
layer = tf.keras.layers.Dense(100, activation="relu",
                              kernel_initializer="he_normal",
                              kernel_regularizer=tf.keras.regularizers.l2(0.01))

if both are needed: `tf.keras.regularizers.l1_l2()` L1 + L2 = Elastic Net

$$
L1\_loss = \lambda \sum |w_i|
$$

$$
L2\_loss = \lambda \sum w_i^2
$$

Elastic Net combines both penalties:

$$
\lambda_1 \sum |w_i| + \lambda_2 \sum w_i^2
$$

instead of repeating arguments (same regularizer, activation, initialization) in all hidden layers, you could use loops, but there is `functools.partial()`

In [34]:
from functools import partial

RegularizedDense = partial(
    tf.keras.layers.Dense,
    activation="relu",
    kernel_initializer="he_normal",
    kernel_regularizer=tf.keras.regularizers.l2(0.01)
)

model = tf.keras.Sequential([
    tf.keras.layers.Flatten(input_shape=[28, 28]),
    RegularizedDense(100),
    RegularizedDense(100),
    RegularizedDense(10, activation="softmax")
])

Do not use Adam with L2 use AdamW

### Dropout Regularization

at every training step it ignores neurons

improves accuracy 1-2%

dropout rate hyperparam is p set between 10-50%, 20-30% in recurrent NN, 40-50% CNN

In [35]:
model = tf.keras.Sequential([
    tf.keras.layers.Flatten(input_shape=[28,28]),
    tf.keras.layers.Dropout(rate=0.2),
    tf.keras.layers.Dense(100, activation="relu",
                          kernel_initializer="he_normal"),
    tf.keras.layers.Dropout(rate=0.2),
    tf.keras.layers.Dense(100, activation="relu",
                          kernel_initializer="he_normal"),
    tf.keras.layers.Dropout(rate=0.2),
    tf.keras.layers.Dense(10, activation="softmax")
])

make sure to evaluate the training loss without dropout (after training)

increase dropout if overfitting and vise-versa

### Monte Carlo (MC) Dropout

can improve performance of any trained dropout model without having to retrain it

In [39]:
# Generate synthetic training data (1000 samples, 10 features each)
X_test = np.random.rand(200, 28, 28).astype(np.float32)

y_probas = np.stack([model(X_test, training=True)
                     for sample in range(100)])
y_proba = y_probas.mean(axis=0)

model(X) is similar to model.predict(X) except it returns a tensor rather than a NumPy array

training=True ensures that the Dropout layer remains active

Averaging over multiple predictions with dropout turned on gives us a Monte Carlo estimate

### Max-Norm Regularization

explicitly limits the magnitude of the weight vectors by constraining them within a fixed norm

reducing r helps reduce overfitting

In [40]:
dense = tf.keras.layers.Dense(100, activation="relu", kernel_initializer="he_normal",
                              kernel_constraint=tf.keras.constraints.max_norm(1.))

If a sparse model is needed use L1 regularization zero out tiny weights, if a sparser model is needed use TensorFlow Model Optimization Toolkit

low-latency model performs lightning-fast predictions, use ReLU or leaky RelU fold bath normalization layers into previous layers, reduce float precision from 32 bits to 16 or 8

use MC dropout to boost performance and get more reliable probability estimates, along with uncertainty estimates