## Artificial Neural Network

---

### The perceptron

Threshold logic unit (TLU)

1. The inputs and output are numbers (instead of binary on/off values), and each input connection is associated with a weight.
2. The TLU first computes a linear function of its inputs.
3. Then it applies a step function to the result.
4. It’s almost like logistic regression, except it uses a step function instead of the logistic (sigmoid) function.

A perceptron is composed of one or more TLUs organized in a single layer.

Remember each neuron has a bias value

---

### Backpropagation for perceptron

1. The weights of the perceptron are initialized randomly or with small random values.
2. The input data is fed into the network, and calculations are carried out layer by layer from the input layer to the output layer to produce the output.
    1. The input values are multiplied by their respective weights.
    2. These products are summed, and the sum is passed through an activation function to produce the output.
3. The error (or loss) is calculated by comparing the network’s output with  the actual target value using a loss function. A common loss function for a single output is the mean squared error (MSE)
4. Backpropagation involves three main steps
    1. Calculate the gradient: The gradient of the error with respect to each weight is calculated using the chain rule of calculus. This involves determining how changes in weights affect the error.
    2. Update the Weights:  The weights are updated in the direction that reduces the error, which is opposite to the gradient. 
    3. Iterate.

---

### Backpropagation for multi-layer perceptron

**Solved Example Back Propagation Algorithm Multi-Layer Perceptron Network by Dr. Mahesh Huddar**

https://www.youtube.com/watch?v=tUoUdOdTkRw

Forward pass

Backward pass

Chain rule

---

### Activation functions

1. **Heaviside**: step function → 0 or 1
2. **Tanh**: S shaped → -1 to 1 → 
3. **Sigmoid**: S shaped → 0 to 1 → logistic function → if you need gradient (because step function has no gradient so no progress in gradient descent)
    
    If you want to guarantee that the output will always fall within a given range of values, then use the Sigmoid function.
    
4. **Rectified Linear Unit (ReLU)** : to  get positive output only  →  _/ shaped function
    
    If you want to guarantee that the output will always be positive, then use the ReLU activation function.
    
5. **Leaky ReLU:**  Slope z < 0 to ensure that leaky ReLU never dies.
6. **Randomized Leaky ReLU (RReLU):** The slope towards the negative side can be either 0, +ve or -ve depending on a hyperparameter.
7. **Parametric Leaky ReLU (PReLU):** The slope towards the -ve side is decided by a hyperparameter.
8.  **Exponential Linear unit (ELU):** performed better than all ReLU’s with lesser training time and better results only disadvantage being it is slower to compute than ReLU.
9. **Scaled Exponential Linear Unit (SELU):** about 1.05 times ELU
10. **Gaussian error linear unit** **(GELU):** other activation function discussed above but is computationally intensive.
11. **Swish and Mish: other** variants of ReLU. 

---

TensorFlow playground → understanding MLPs → effect of hyperparameters (number of layers, neurons, activation function and more)

---

**How to decide the number of neurons for each layer in a neural network?**

https://medium.com/geekculture/introduction-to-neural-network-2f8b8221fbd3

- The number of hidden neurons should be between the size of the input layer and the size of the output layer.
- The number of hidden neurons should be 2/3 the size of the input layer, plus the size of the output layer.
- The number of hidden neurons should be less than twice the size of the input layer.

GridSearchCV

RandomizedSearchCV

---

### The Vanishing/Exploding Gradients Problems

If weights are initialized with high variance even if i/p has low variance the o/p of a layer can have greater variance. The variance of the outputs of each layer is much greater than the variance of its inputs.

Therefore the gradient can be close to 0 leading to vanishing gradients.

So how to solve this problem?

So we have to solve 2 problems 

1. How to initialize the weights?
2. Which activation function to use?

Solving problem 1

**Glorot and He Initialization** 

For weight initialization :- 

Main idea 

1. The variance of the outputs of each layer to be equal to the variance of its inputs
2. The gradients should have equal variance before and after flowing through a layer in the reverse direction
- Fan-in: The number of input units to a layer.
- Fan-out: The number of output units from a layer.

Therefore connection weights should be initialized randomly → Glorot initialization

- **Mean**: The mean of the weights is typically 0.
- **Variance**: The variance of the weights is set to:
    
    Var(W) = $\sigma^{2} = \frac{2}{ fan_{in} + fan_{out}}$
    

This will balance the the variance between the i/p and o/p layer  uniform.

solving problem 2

Use Relu or other variants of Relu

**Batch Normalization**

However solving these two problems still does not ensure that the vanishing/exploding gradients problem does not reoccur during training. 

To address this we have batch normalization.

This is done just before or after the activation function in each layer

Steps in BN:

1. Compute the Mean and Variance: For a given mini-batch, compute the mean and variance of the activations.
    
    $\mu_{batch} = \frac{1}{m} \sum_{i=1}^{m}x_i$
    
    $\sigma_{batch}^2 = \frac{1}{m}\sum_{i=1}^{m}(x_i - \mu_{batch})^2$
    
2. Normalize the Activations: Subtract the mean and divide by the standard deviation to normalize the activations.
    
    $\hat{x_i} = \frac{x_i - \mu_{batch}}{\sqrt{\sigma_{batch}^2 + \epsilon}}$
    
3. **Scale and Shift**: Introduce two trainable parameters, γ (scale) and β (shift), to allow the model to learn the optimal scaling and shifting of the normalized activations.
    
    $y_i = \gamma \times \hat{x_i} + \beta$     → for each training instances and for each feature
    

BN performs scaling. 

BN also acts like a regularization thus eliminating the need of regularization techniques.

**Gradient Clipping**

used in RNN for exploding gradients mostly

clip the gradients during backpropagation so that they never exceed some threshold.

---

**Reusing  pretrained models**

1. Transfer learning
    
    If the task at hand is similar to a task that is already solve by a deep neural network (DNN) we can use some or many layer of the existing DNN to help increase the accuracy of the model. We can freeze the reused layers in the first few epochs so that out model are also able to learn and adjust. However it is usally very difficult to find good good configurations and is generally used only in CNNs.
    
2. Unsupervised Pretraining
    
    If you did not find any model trained on a similar task use GNN or autoencoders (RBMs)
    
3. Pretraining on an Auxiliary Task
    
    If not much labelled training data is present first train a neural network on an auxiliary task for which you can easily obtain or generate labeled training data, then reuse the lower layers of that network for your actual task.
    

---

### Faster Optimizers

Till now we have only used SGD where we simply udate the parameters based on the derivation values.
But we can speed up this process by using different optimizers

1. Momemtum
    
    A bowling ball rolling down a gentle slope on a smooth surface: it will start out slowly, but it will quickly pick up momentum until it eventually
    reaches terminal velocity.
    
    $\beta$ is a hyperparameter, $\alpha$ is the learning rate both are set during training
    
    optimizer = tf.keras.optimizers.SGD(learning_rate=0.001, momentum=0.9) ← like this
    
    1. Initialize Velocity (accumulated gradient)
    $v_t = 0$
    2. Compute gradient
    $g_t = \nabla_\theta J(\theta_t)$
    3. Update the velocity
    $v_t = \beta v_{t-1} + (1 - \beta) g_t$
    4. Update the parameters
    $\theta_{t+1} = \theta_t - \alpha v_t$

1. Nesterov Accelerated Gradient
    
    It measures the gradient of the cost function slightly ahead in the direction of the momentum
    
    1. Initialize Velocity (accumulated gradient)
    $v_t = 0$
    2. Lookahead position
    $\theta_{lookahead} = \theta_t - \beta v_{t-1}$
    3. Compute gradient
    $g_t = \nabla_\theta J(\theta_{lookahead})$
    4. Update the velocity
    $v_t = \beta v_{t-1} + \alpha g_t$
    5. Update the parameters
    $\theta_{t+1} = \theta_t - v_t$

1. Adagrad
    
    It maintains a running sum of the squares of the gradients for each parameter. 
    
    1. Initialize Accumulated Squared Gradients
    $G_t = 0$
    2. Compute gradient
    $g_t = \nabla_\theta J(\theta_t)$
    3. Update Accumulated Squared Gradients
    $G_t = G_{t-1} + g_t^2$
    4. Update parameters
    $\theta_{t-1} = \theta_t - \frac{\alpha}{\sqrt{G_t + \epsilon}
    } \bigodot  g_t$
    
    Here, α is the initial learning rate, ϵ is a small constant to prevent division by zero, and ⊙ denotes element-wise multiplication.
    

1. RMSProp
    
    AdaGrad runs the risk of slowing down a bit too fast and never converging to the global optimum
    
    1. Initialize Accumulated Squared Gradients
    $E[g^2]_t = 0$
    $E[g^2]_t$   is the exponentially decaying average of past squared gradients at time step t.
    2. Compute gradient
    $g_t = \nabla_\theta J(\theta_t)$
    3. Update accumulated squared gradient
    $E[g^2]_t = \beta E[g^2]_{t-1} + (1 - \beta) g_t^2$
    4. Update parameters
    $\theta_{t-1} = \theta_t - \frac{\alpha}{\sqrt{E[g^2]_t  + \epsilon}
    } \bigodot  g_t$
    α is the learning rate, ϵ is a small constant to prevent division by zero, and ⊙ denotes element-wise multiplication.

1. Adam
    
    a. Initialize Moment Estimates and Time Step
    
    $m_t = 0$ ( First moment estimate )
    
    $v_t = 0$  (Second moment estimate)
    
    t = 0  (Time step)
    
    b. Compute Gradient
    $g_t = \nabla_\theta J(\theta_t)$
    
    c. Update Time Step 
    
    t = t + 1
    
    d. Update Biased First Moment Estimate
    
    $m_t = \beta_1 m_{t-1} + (1 - \beta_1)g_t^2$
    
    $\beta_1$ is the decay rate for the first moment estimate (typically around 0.9).
    
    e. Update Biased Second Moment Estimate
    
    $v_t = \beta_2 c_{t-1} + (1 - \beta_2)g_t^2$
    
    $\beta_2$ is the decay rate for the second moment estimate (typically around 0.999).
    
    f. Compute Bias-Corrected First Moment Estimate
    
    $\hat{m_t} = \frac{m_t}{1 - \beta_1^t}$
    
    g. Compute Bias-Corrected Second Moment Estimate
    
    $\hat{v_t} = \frac{v_t}{1 - \beta_2^t}$
    
    h. Update Parameters
    
    $\theta_{t-1} = \theta_t - \frac{\alpha \hat{m_t}}{\sqrt{\hat{v_t}} + \epsilon}$
    
    Here, α is the learning rate, and ϵ is a small constant to prevent division by zero.
    The below are variations of adam only…
    
2. AdaMax
3. Nadam
4. AdamW

---

**Learning Rate Scheduling**

starting with a large learning rate and then reducing it once training stops making fast progress is better than a constant learning rate

or start with a low learning rate, increase it, then drop it again. These strategies are called learning schedules.

1. Power scheduling
    
    $\alpha_t = \frac{\alpha_0}{(1 + kt)^\gamma}$
    
    - $\alpha_t$ is the learning rate at time step t.
    - $\alpha_0$ is the initial learning rate.
    - k is a hyperparameter that controls how quickly the learning rate decays.
    - γ is the power factor that determines the rate of decay (usually between 0 and 1).
    - t is the current time step (or epoch).

1. Piecewise constant scheduling
    - $\eta_0$ as the initial learning rate
    - $\eta_i$ as the learning rate at the i-th interval
    - $T_i$ as the epoch or iteration at which the learning rate changes
    
    $\eta(t) = \eta_i$ for $T_{i-1} < t < T_i$
    
2. Performance scheduling
    
    Define
    
    - η(t) as the learning rate at time t (epoch or iteration)
    - $\eta_{new}$ as the updated learning rate
    - ρ as the reduction factor (a value between 0 and 1).
    - metric(t) as the performance metric at time t (e.g., validation loss or accuracy).
    - patience as the number of epochs or iterations to wait before reducing the learning rate after the performance metric stops improving.
    - min_delta as the minimum change in the monitored metric to qualify as an improvement.
    
    Learning rate is updated as follows
    
    1. Initialize the learning rate to $\eta_0$.
    2. For each epoch or iteration t track the best performance metric observed so far
    if → metric(t) − best_metric > min_delta:
        1. Update best_metric to metric(t) 
        2. Reset epochs_since_improvement to 0.
        
        Else:
        
        1.  Increment epochs_since_improvement
        2. If epochs_since_improvement exceeds patience:
        3. Update the learning rate $\eta(t)$ to $\eta_{new} = \eta(t) \times \rho$
        4. Reset epochs_since_improvement to 0.

1. Exponential scheduling
    - $\eta_0$ as the initial learning rate.
    - γ as the decay rate, a constant between 0 and 1.
    - t as the current time step (epoch or iteration).
    
    Learning rate is given by 
    
    $\eta(t) = \eta_0 . \gamma^t$
    
2. 1cycle scheduling
    
    It follows a cyclical pattern with a single cycle, starting from an initial value, increasing to a maximum value, and then decreasing back to a minimum value.
    
    - $\eta_{min}$ as the initial minimum learning rate.
    - $\eta_{max}$ as the maximum learning rate.
    - T as the total number of iterations or epochs.
    - t as the current time step (iteration or epoch).
    - phase_1_end as the time step at the end of the first phase (halfway point).
    
    Learning rate at any time t is given by
    
    if  $t \le phase\_1\_end$  then $\eta_{min} + \frac{t}{phase\_1\_end} (\eta_{max} - \eta_{min})$
    
    if  $t \le phase\_1\_end$  then $\eta_{min} + \frac{t}{phase\_1\_end} (\eta_{max} - \eta_{min})$
    

---

### Regularization for Neural Networks

Models are prone to overfitting as there are a lot of parameters, regularization can be used to prevent this

Early stopping and batch normalization are already acting as regularizers.

1. L1 and L2 regularization
2. Dropout
    
    Working of Dropout
    
    At every training step, every neuron has a probability p of being temporarily “dropped out”, meaning it will be entirely ignored during this training step, but it may be active during the next step. The hyper parameter p is called the dropout rate, and it is typically set between 10% and 50%: closer to 20%–30% in recurrent neural nets, and closer to 40%–50% in convolutional neural networks.
    
    Neurons cannot co-adapt with neighboring neurons and they have to be as useful as possible on their own. Typically, dropout is turned off during inference, where all neurons are used to make predictions.
    
3. Monte Carlo (MC) Dropout
    
    **Training Phase**
    
    1. Apply dropout to the neural network as usual with a dropout rate p.
    2. Train the model on the training data with the standard optimization process.
    
    **Inference Phase with MC Dropout**
    
    1. Enable dropout during the inference phase (which is normally turned off).
    2. Perform multiple stochastic forward passes through the network for each input sample. Let’s say we perform N forward passes.
    3. Each forward pass results in a different set of neurons being dropped out, creating an ensemble of N different predictions for each input. After which the mean of the different results and the variance (uncertainty) can we checked which can be used for confidence assessment.
4. Max-Norm Regularization
    - W be the weight matrix of a particular layer.
    - ∥W∥ be the norm of the weight matrix.
    - c be the maximum allowed norm.
    
    Procedure
    
    1. Initialize the weights of the neural network.
    2. Define the maximum norm threshold c.
    3. During training, after each weight update, check the norm of the weight matrix
        
        If ∥W∥>c, rescale W to have a norm of c
        
        If the norm of the weight matrix exceeds the threshold cc, the weights are rescaled as follows:
        
        $W \leftarrow W . \frac{c}{|W|}$