### Exponential Linear Unit as an activation function outperformed ReLU

it takes on negative numbers allowing the training to not have vanishing gradients

nonzero gradient for z < 0 so no dead neurons

Helps gradient descent beacause the function is smooth everywhere (doesn't bounch as much)

### SELU is a scaled ELU activation function

Cannot use regularization techiques like l1 or l2, max-norm, batch-norm, regular dropout

self-normalizing is only guarenteed with plain MLPs

input features must be standardized: mean 0 and SD of 1

### GELU Gaussian Error Linear Units looks like ReLU but is smooth all over makes gradient descent easier to fit complex problems

SiLU activation outperformed GELU (Swish, β to scale sigmoid function's input)

Mish is smooth, nonconvex, and nonmonotonix variant of ReLU and outperformed Swish

### ReLU is a good default (hardware accelerators provide ReLU-specific optimizations)

Switch is better default for more complex tasks, Mish may give slightly better results

for runtime latency LeakyReLU or Parameterized Leaky ReLU for complex tasks

### Batch Normalization (BN) reduce the danger of vanishing/exploding gradients

adding an operation in model before or after activation function of each hidden layer. Zero-centers and normalizess each input, the nscales and shifts (using two new parameter vectors per layer scale and shifting)

No need for `StandardScaler` or `Normalization` if BN is first layer

#### **1. Compute the Mini-Batch Mean**
$$
\mu_B = \frac{1}{m} \sum_{i=1}^{m} x_i
$$

#### **2. Compute the Mini-Batch Variance**
$$
\sigma_B^2 = \frac{1}{m} \sum_{i=1}^{m} (x_i - \mu_B)^2
$$

#### **3. Normalize the Inputs**
$$
\hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}
$$
where $ \epsilon $ is a small constant to prevent division by zero. (smoothing term)

#### **4. Scale and Shift**
$$
y_i = \gamma \hat{x}_i + \beta
$$
where:
- $ \gamma $ (scale) and $ \beta $ (shift) are **learnable parameters**.

if we want to test predicitions for individual instances rather than batches we only have the batch mean/SD so:

most implementations of batch normalization estimate final statistics during training by using a moving
average of the layer's input means and standard deviations, Keras does this automatically.

it's possible to fuse the BN layer with the previous layer after training, avoiding the runtime penalty.
This is done by updating the previous layer’s weights and biases so that it directly produces outputs of the appropriate scale and offset. 

previous layer computes XW + b, then the BN layer will compute γ⊗(XW + b – μ) / σ + β (ignoring the smoothing term ε). W' = γ⊗W /
σ and b′ = γ⊗(b – μ) / σ + β, the equation simplifies to XW' + b'. replace the previous layer's weights and biases (W and b) with the updated weightsand biases (W' and b'), we can get rid of the BN layer (⊗ element-wise multiplication)

For small networks might not have much impact but you can see for deeper networks this can make a huge difference

In [2]:
import tensorflow as tf

model = tf.keras.Sequential([
    tf.keras.layers.Input(shape=(28, 28)),
    tf.keras.layers.Flatten(),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dense(300, activation="relu", kernel_initializer="he_normal"),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dense(100, activation="relu", kernel_initializer="he_normal"),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dense(10, activation="softmax")
])

In [3]:
model.summary()

first BN layer: 3,136 parameters = 4 × 784 (γ, β, μ, and σ)

μ and σ, are the moving averages not trainable 

In [4]:
[(var.name, var.trainable) for var in model.layers[1].variables]

[('gamma', True),
 ('beta', True),
 ('moving_mean', False),
 ('moving_variance', False)]

In [None]:
# There is some debate to put the BN before/after activation function
# BN layer includes one offset parameter per input, you can remove the bias term from the previous layer 
model = tf.keras.Sequential([
    tf.keras.layers.Input(shape=(28, 28)),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(300, kernel_initalizer="he_normal", use_bias=False),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Activation("
])