<a href="https://www.kaggle.com/code/tommasofacchin/fundamentals-of-neural-networks?scriptVersionId=266197821" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

Personal notes from my studies, sources are university lectures, YouTube, books and chatGPT.

# **NEURAL NETWORK ARCHITECTURE**
A **Neural Network** is a machine learning model inspired by the human brain. It consists of layers of interconnected neurons that process and learn from data.
<br>Neural networks are typically structured into **three types of layers**.

## **1. Input Layer**
The **input layer** is where the network receives raw data. It does NOT perform any computations—its only job is to pass the input values to the next layer.
<br>The number of input layer is equal to the number of features in the dataset.
<br>Example Inputs:
* Image Processing: Pixel values from an image
* Text Processing: Word embeddings (numerical representation of words)
* Tabular Data: Age, salary, temperature, etc.

If you're building a neural network for handwritten digit recognition, considering 28×28 pixels the input layer might have 784 neurons.

## **2. Hidden Layers**
Hidden layers are where the real learning happens, these layers perform transformations using **weights**, biases, and activation functions to extract important patterns.
<br>Each neuron in a hidden layer does the following:
1. **Weighted Sum Calculation**
   * Each neuron receives inputs, multiplies them by weights, and adds a bias term:
$$
     z=Wx+b
$$
    * $W$ = Weights, $x$ = Input values, $b$ = Bias
2. **Activation function**
   * The result, $z$, is passed through an activation function to introduce non-linearity, allowing the network to learn complex patterns.

<br>The number of hidden layers and neurons is a key design choice, more hidden layers mean more powerfull network, but also risk of **overfitting**.
<br>A deep neural network might have **multiple hidden layers**, where the first learns edges in an image, the next learns textures, and deeper layers recognize objects like faces or digits.

## **3. Output Layer**
The output layer produces the **final prediction**. It contains neurons equal to the number of possible outputs.
<br>If we are classifying digits (0–9), the output layer will have **10 neurons**, each representing the probability of a specific digit.

## **Fully Connected (Dense) Layers vs. Convolutional Layers**
**Fully Connected (Dense) Layers:**
* Every neuron in one layer is connected to **every** neuron in the next layer.
* Used in standard **feedforward neural networks (FNNs)**.
* Good for structured/tabular data but **inefficient for images** due to large weight matrices.

**Convolutional Layers (CNNs):**
* Used in **Convolutional Neural Networks (CNNs)** for image processing.
* Instead of connecting every neuron, it applies **kernels** that scan small parts of the image to detect edges, textures, and objects.
* More efficient for vision tasks.


 ## **Kernel**
**A kernel** (also called a filter) is a small square matrix used in CNNs to detect patterns such as **edges**, **textures**, and **objects** in images, it slides over an image (or feature map) and performs **convolution** to extract features.
* The kernel is applied to small regions of the image at a time.
* Multiply each pixel value by the corresponding kernel value.
* Sum up all the multiplied values to get a new pixel value.
* Move to the next region and repeat the process.

If we apply a **blur kernel**, it smoothens the image. If we apply an **edge-detection kernel**, it highlights edges.
<br>**CNNs don’t use predefined kernels**, instead, they learn the best kernels during training.

# **PERCEPTRON & MULTI-LAYER PERCEPTRON (MLP)**
A Perceptron is the simplest type of artificial neuron, introduced by Frank Rosenblatt in 1958. It mimics a biological neuron, performing the following steps:
1. Takes multiple inputs $x_1,x_2,...,x_n$.
2. Each input is multiplied by a **weight** $w_1,w_2,...,w_n$.
3. The weighted sum is passed through an **activation function**.
4. Produces an **output** (either 0 or 1 for binary classification).
$$
 z = w_1x_1 + w_2x_2 + ... + w_nx_n + b
$$
$$ 
 y=f(z)
$$
where :
* $x_i$ = input
* $w_i$ = weight
* $b$ = bias
* $f(z)$ = activation function (usually a step function in Perceptrons)

Works well for **linearly separable** problems (e.g. AND, OR gates), but cannot solve **non linearly separable** problems like **XOR** or hidden layers.

## **Multi-Layer Perceptron (MLP)**
A **MLP** is a **neural network** with one or more **hidden layers** that perform transformations using weights and activation functions.
<br>For each layer:
$$
 z^{(l)} = W^{(l)}a^{(l-1)} + b^{(l)}
$$
$$ 
 a^{(l)}=f(z^{(l)})
$$
where:
* **$W^{(l)}$** = weights of layer **l**
* **$a^{(l)}$** = activations of layer **l**
* **$f(z)$** = activation function

## **Why is MLP Powerful?**
* Solves Non-Linear Problems (e.g., XOR).
* Multiple Layers Enable Feature Extraction.
* Works for Classification & Regression.

# **ACTIVATION FUNCTIONS**

An **activation function** is a mathematical function that determines whether a neuron in a neural network should be activated or not, **Activation Unit** refers to a single neuron in a neural network that applies an activation function to the weighted sum of its inputs.

## **Why Are Activation Functions Important?**
Neural networks rely on activation functions to:
* Introduce **non-linearity** - Without activation functions, deep networks behave like a simple linear model.
* Control **neuron outputs** - They decide how much information passes through each neuron.
* Help with **backpropagation** - Some functions improve gradient-based learning (e.g., ReLU, sigmoid).

## **Types of Activation Functions**
* **Threshold Function:** outputs 1 if input is above a threshold, otherwise 0.
  <br>Not used because it's not differentiable, making backpropagation impossible, and only works for binary classification.
  $$
f(x) =
\begin{cases} 
1 & \text{if } x \geq 0 \\
0 & \text{if } x < 0
\end{cases}
$$
* **Sigmoid Function:** Converts inputs into a probability-like value between 0 and 1.
  <br>Used in binary classification, can cause **vanishing gradients** in deep networks (very small gradients slow learning).
  $$
f(x) = \frac{1}{1 + e^{-x}}
$$
* **Tanh Function:** Similar to sigmoid, but outputs values between -1 and 1.
  <br>it's zero-centered, helping optimization, but still suffers from vanishing gradients.
  $$
f(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}
$$
* **ReLU (Rectified Linear Unit):** Outputs x if positive, otherwise 0.
  <br>Solves vanishing gradients (gradient is 1 for x>0) and works well in deep networks, but can suffer from dying neurons (if x < 0, gradient = 0).
  $$
f(x) = \max(0, x)
$$
* **Leaky ReLU:** Has a small slope for negative x instead of zero.
  $$
f(x) =
\begin{cases}
x & \text{if } x > 0 \\
\alpha x & \text{if } x \leq 0
\end{cases}
$$
* **Softmax (For Multi-Class Classification):** Converts scores into probabilities summing to 1.
  <br>Used in final layer of multi-class classifiers, ensure outputs sum to 1(like probabilities).
  $$
f(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}}
$$

## **Vanishing Gradient Problem**
The vanishing gradient problem occurs when gradients **become too small** during backpropagation, causing **slow or no learning** in deep neural network.

**Why does it happen?**
<br>During backpropagation, gradients are computed using the **chain rule**, if each layer's gradient is less than 1, mulpitplying many small numbers **shrinks** the gradient exponentially, meaning that **earlier layers stop learning** (since their weight updates are almost zero).

**When does it happen?**
* **Sigmoid & Tanh Activation Functions:** These functions squash outputs between small ranges (Sigmoid: 0 to 1, Tanh: -1 to 1), their **derivatives are small** for large or small inputs, this results in **tiny gradients**, slowing down learning.
* **Deep Networks (Many Layers):** The more layers, the more times we multiply gradients, small gradients **vanish** as they backpropagate through layers.
* **Poor Weight Initialization:** If weights are poorly initialized (too small), activations **saturate**, worsening the problem.

  
**Solutions to Vanishing Gradients**
* **Use ReLU Instead of Sigmoid/Tanh:** ReLU has a gradient of 1 for positive inputs, avoiding vanishing gradients. However, ReLU can cause dying neurons (where gradients become zero for negative inputs).
* **Batch Normalization (BN):** BN normalizes activations, preventing them from shrinking. Helps stabilize learning and allows higher learning rates.
* **Use Advanced Optimizers:** Adam, RMSprop dynamically adjust learning rates. Help gradients stay large enough to continue learning.

# **BACKPROPAGATION**
Backpropagation is the **core learning algorithm** for neural networks. It efficiently updates weights to minimize the **loss function** by propagating errors **backward** through the network using **gradient descent** and the **chain rule**.

## **Why Do We Need Backpropagation?**
When training a neural network, we want to find the optimal weights $𝑊$ and biases $b$ that minimize the error (loss function). However, computing how much each weight contributes to the total error is not straightforward in deep networks.

- **Forward Pass:** Data flows from input → hidden layers → output.
- **Compute Loss:** Compare predicted output to actual output.
- **Backward Pass:** Adjust weights proportionally to their impact on the loss using gradients.

The goal is to find the gradients of the loss function with respect to each weight efficiently.

## **Step-by-Step Backpropagation**
**1. Forward Pass (Compute Predictions)**
   Each neuron in the network performs:
   - **Linear transformation:** Compute weighted sum of inputs:
   $$
     z=Wx+b
   $$
* **Apply activation function**
   $$
     a=f(z)
    $$
* The process repeats layer by layer until we get the final output $\hat{y}$.
     
**2. Compute Loss**
<br>The loss function $L$ measures how different the predicted output $\hat{y}$ is from the actual output $y$.
   <br> For example, using **MSE** for regression or **Cross-Entropy Loss** for classification. 

**3. Backward Pass (Compute Gradients)**
<br>To update weights, we need to compute how much each weight contributed to the error. This is done using partial derivatives of the loss function.
<br>Backpropagation applies the **chain rule** to compute gradients step by step from output to input.

Gradient of Loss for the **output layer**:
$$
\frac{\partial L}{\partial W} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z} \cdot \frac{\partial z}{\partial W}
$$
Gradient of Loss for **hidden layers**:
$$
\frac{\partial L}{\partial W} = \frac{\partial L}{\partial a} \cdot \frac{\partial a}{\partial z} \cdot \frac{\partial z}{\partial W}
$$
where:
* $\frac{\partial L}{\partial a}$ is the **gradient from the next layer** (recursively computed).
* $\frac{\partial a}{\partial z}$ is the **derivative of the activation function**.
* $\frac{\partial z}{\partial W} = x$, since $z = Wx + b$.

Instead of manually computing gradients, frameworks like **TensorFlow** & **PyTorch** use **automatic differentiation** to handle backprop efficiently.

**4. Update Weights (Gradient Descent)**
<br>After computing gradients, we update weights using gradient descent:
$$
W = W - \eta \frac{\partial L}{\partial W}
$$
where:
* $\eta$ = learning rate (controls step size).
* $\frac{\partial L}{\partial W}$ = gradient of loss w.r.t weights.

This process repeats for multiple epochs (an epoch is one complete pass through the entire training dataset) until the loss is minimized.
<br>Backpropagation is repeated for multiple epochs, but training can stop early if:
* Loss stops decreasing significantly.
* Validation loss increases (to prevent overfitting).
* A predefined epoch limit is reached.

# **GRADIENT DESCENT**
Gradient Descent is an optimization algorithm used to minimize a function, typically a **loss function** (a loss function measures the difference between the predicted and actual values) in ML.
<br> It does this by iteratively adjusting parameters (weights) in diretions of **steepest descend** (negative gradient).

## **Why do we need Gradient Descent**
* Neural networks **learn by adjusting weights** to minimize error.
* The error is **measured by a loss function** (e.g. MSE for regression or cross-entropy for classification).
* Gradient Descent **helps find the optimal weights** that minimize the loss function.

## **How it works**
Gradient Descent follows this update rule : 
$$ \theta = \theta -\alpha\times\nabla J(\theta)$$
Where:
* $\theta$ : Model parameters (Weights, biases).
* $J(\theta)$ : Loss function.
* $\nabla J(\theta)$ : Gradient of the loss function.
* $\alpha$ : Learning rate.

**Step-by-step Process**
1. **Compute the Loss** (Difference between predictions & actual values).
2. **Compute the Gradient** (Partial derivatives of loss w.r.t. each parameter).
3. **Update Weights** (Move in the negative gradient direction).
4. **Repeat until convergence** (or stop when loss stops decreasing).

## **Types of Gradient Descent**
* **Batch Gradient Descent (BGD)**
  <br>Uses **entire dataset** to compute gradients in each step.
  <br>More stable updates but **slow for large datasets**. 
  <br>Used when dataset is small and computational power is high.
  
* **Stochastic Gradient Descent (SGD)**
  <br>Updates **after each data point** instead of the entire dataset.
  <br>Faster but more noisy (fluctuations in loss).
  <br>Helps escape local minima.

  
* **Mini-Batch Gradient Descent**
  <br>Uses **small batches** of data for updates.
  <br>**Balances** between BGD (stable) and SGD (fast).
  <br>Most commonly used in deep learning.

  
## **Learning Rate & Convergence**
The learning rate $\alpha$ determines step size:
* **Too High :** May overshoot and never converge.
* **Too Low :** Slow converge, stuck in local minima.
* **Optimal :** Reaches the minimum efficiently.
  
<br> **Adaptive methods** (like Adam or RMSprop) dynamically adjest learning rates for better training.

## **Types of Loss Functions**
1. **Regression Loss Function**
   * **Mean Squared Error (MSE):**
    $$
     MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2
    $$
     Penalizes large errors more than smaller ones, making it sensitive to outliers.
   * **Mean Absolute Error (MAE):**
     $$
     MAE = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|
     $$
     Less sensitive to outliers but may be harder to optimize
   * **Huber Loss:**
     A combination of MSE and MAE that is robust to outliers.

2. **Classification Loss Function**
     * **Binary Cross-Entropy (Log Loss)** (For binary classification):
    $$
     BCE = -\frac{1}{n} \sum_{i=1}^{n} \left[ y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i) \right]
    $$
   * **Categorical Cross-Entropy** (For multi-class classification):
     $$
     CCE = -\sum_{i=1}^{n} \sum_{j=1}^{k} y_{ij} \log (\hat{y}_{ij})
     $$
     Used when output probabilities sum to 1 (Softmax activation).
   * **Sparse Categorical Cross-Entropy:** Similar to categorical cross-entropy but used when class labels are integers instead of one-hot encoded vectors.

# **ADVANCED OPTIMIZATION ALGORITHMS**

Gradient Descent is the foundation for training neural networks, but there are several **advanced optimizers** that improve convergence speed, stability, and performance. These optimizers adapt the learning process to the model and dataset.


## **Momentum**

Momentum helps accelerate gradient descent by **accumulating past gradients** to build velocity.

### How It Works
Instead of updating weights purely based on the current gradient, momentum uses a fraction of the previous update:

$$
v_t = \beta v_{t-1} + (1 - \beta) \nabla_\theta J(\theta)
$$
$$
\theta = \theta - \alpha v_t
$$

Where:  
* $v_t$ = velocity (momentum term)  
* $\beta$ = momentum coefficient (usually 0.9)  
* $\alpha$ = learning rate  
* $\nabla_\theta J(\theta)$ = current gradient  

### Benefits
* Accelerates convergence in **consistent gradient directions**.  
* Reduces oscillations in **ravines** of the loss surface.  


## **Nesterov Accelerated Gradient (NAG)**

NAG is a **look-ahead version of momentum**, checking the future position before calculating the gradient.

$$
v_t = \beta v_{t-1} + \alpha \nabla_\theta J(\theta - \beta v_{t-1})
$$
$$
\theta = \theta - v_t
$$

### Benefits
* Provides **faster convergence** than standard momentum.  
* Reduces **overshooting** in high-curvature areas.  


## **Adagrad**

Adagrad adapts the learning rate **per parameter**, scaling it inversely with the sum of squared past gradients.

$$
\theta_{t+1} = \theta_t - \frac{\alpha}{\sqrt{G_t + \epsilon}} \nabla_\theta J(\theta_t)
$$

Where:  
* $G_t$ = sum of squares of gradients for each parameter  
* $\epsilon$ = small number for numerical stability  

### Benefits
* Automatically adjusts learning rates.  
* Works well for **sparse data**.  

### Limitation
* Learning rate can **shrink too much** over time, causing slow convergence.


## **RMSprop**

RMSprop modifies Adagrad to **avoid the diminishing learning rate problem** by using an exponential moving average of squared gradients:

$$
E[g^2]_t = \beta E[g^2]_{t-1} + (1-\beta) (\nabla_\theta J(\theta_t))^2
$$
$$
\theta_{t+1} = \theta_t - \frac{\alpha}{\sqrt{E[g^2]_t + \epsilon}} \nabla_\theta J(\theta_t)
$$

### Benefits
* Works well for **non-stationary objectives** (like RNNs).  
* Popular in deep learning due to **fast convergence**.  


## **Adam (Adaptive Moment Estimation)**

Adam combines **momentum** and **RMSprop**, tracking both first (mean) and second (variance) moments of gradients.

$$
m_t = \beta_1 m_{t-1} + (1 - \beta_1) \nabla_\theta J(\theta_t)
$$
$$
v_t = \beta_2 v_{t-1} + (1 - \beta_2) (\nabla_\theta J(\theta_t))^2
$$
Bias-corrected estimates:
$$
\hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t}
$$
Update rule:
$$
\theta_{t+1} = \theta_t - \alpha \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}
$$

### Default Hyperparameters
* $\alpha = 0.001$, $\beta_1 = 0.9$, $\beta_2 = 0.999$, $\epsilon = 10^{-8}$

### Benefits
* Fast convergence.  
* Robust to **sparse gradients**.  
* Often the **default choice** in deep learning.  


## **AdamW (Adam with Weight Decay)**

AdamW is a modification of Adam that **decouples weight decay from the gradient update**, improving regularization and generalization.

### Motivation
In standard Adam, applying L2 regularization is equivalent to **adding weight decay to the gradients**, which interacts poorly with Adam's adaptive learning rates. AdamW separates these two concepts:

* Weight decay is applied **directly to the weights**, independent of the gradient moments.
* Leads to more **consistent regularization**, especially in deep networks.

### Update Rule
AdamW modifies the parameter update as follows:

1. Compute Adam updates (momentum + RMSprop):
$$
m_t = \beta_1 m_{t-1} + (1 - \beta_1) \nabla_\theta J(\theta_t)
$$
$$
v_t = \beta_2 v_{t-1} + (1 - \beta_2) (\nabla_\theta J(\theta_t))^2
$$
$$
\hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t}
$$

2. Apply weight decay **directly** to parameters:
$$
\theta_{t+1} = \theta_t - \alpha \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} - \alpha \lambda \theta_t
$$

Where:  
* $\lambda$ = weight decay coefficient  
* $\alpha, \beta_1, \beta_2, \epsilon$ = same as Adam  

### Benefits
* Improves **generalization** compared to standard Adam with L2 regularization.  
* Widely used in **large-scale deep learning** (e.g., Transformers).  
* More stable training for very deep networks.  

### Notes
* Often used with **learning rate schedules** like cosine decay or linear warmup.  
* Default hyperparameters are usually similar to Adam: $\alpha=0.001$, $\beta_1=0.9$, $\beta_2=0.999$, $\epsilon=10^{-8}$.


## **Adam Variants**

* **Adamax**: Uses infinity norm instead of $L_2$ norm, more stable in some cases.  
* **Nadam**: Adam + Nesterov momentum. Can improve convergence speed.  
* **AMSGrad**: Modifies Adam to guarantee **convergence**, solving some theoretical issues.


# Learning Rate Schedules

A fixed learning rate ($\alpha$) may not be optimal during training: all layers and training phases may benefit from different learning rates. **Learning rate schedules** adjust $\alpha$ over time to improve convergence and stability. They help avoid getting stuck in local minima, reducing oscillations as training converges.They are often combined with optimizers like Adam, RMSprop, or SGD.

### Common Strategies

**Step Decay**
   - Reduces the learning rate by a factor every few epochs.
   $$
   \alpha_t = \alpha_0 \cdot \text{decay\_factor}^{\lfloor t / \text{step\_size} \rfloor}
   $$
   - Simple and widely used, especially for CNNs.

**Exponential Decay**
   - Learning rate decreases exponentially over epochs:
   $$
   \alpha_t = \alpha_0 \cdot e^{-k t}
   $$
   - $k$ = decay rate.

**Polynomial Decay**
   - Learning rate decreases following a polynomial:
   $$
   \alpha_t = \alpha_0 \cdot \left(1 - \frac{t}{T}\right)^p
   $$
   - $T$ = total steps, $p$ = power.

**Cosine Annealing**
   - Smoothly decreases the learning rate using a cosine function:
   $$
   \alpha_t = \alpha_{\min} + \frac{1}{2} (\alpha_0 - \alpha_{\min}) \left(1 + \cos\frac{\pi t}{T}\right)
   $$
   - Often used with **warm restarts** (SGDR) to escape local minima.

**Warmup**
   - Starts with a very small learning rate, gradually increasing to $\alpha_0$.
   - Helps prevent instability at the beginning of training, especially in Transformers and very deep networks.

**Cyclical Learning Rate (CLR)**
   - Learning rate oscillates between $\alpha_{\min}$ and $\alpha_{\max}$ periodically.
   - Can improve generalization and escape shallow local minima.


# **WEIGHT INITALIZATION**

Proper weight initialization is crucial for training deep neural networks, as poor initialization can lead to problems like **vanishing gradients** or **exploding gradients**.

## **Random Initialization**
Weights are initialized randomly, usually from a normal or uniform distribution. However, this can cause:
- **Vanishing gradients** (if weights are too small).
- **Exploding gradients** (if weights are too large).
$$
W \sim \mathcal{N} \left( 0, 1\right)
$$
this approach often fails in deep networks due to unstable gradients.
## **Xavier (Glorot) Initialization**
Designed for **sigmoid** and **tanh** activations, Xavier initialization ensures that the variance of activations remains stable across layers. 
<br>The main idea is to set the initial weights of the network in a way that allows the activations and gradients to flow effectively during both forward and backpropagation. It considers the number of input and output units of each layer to determine the scale of the random initialization, the higher the number of outputs, the higher the need to spread weights.
$$
W \sim \mathcal{N} \left( 0, \frac{1}{n_{in} + n_{out}} \right) \quad \text{or} \quad W \sim U \left( -\frac{\sqrt{6}}{\sqrt{n_{in} + n_{out}}}, \frac{\sqrt{6}}{\sqrt{n_{in} + n_{out}}} \right)
$$
- **$n_{in}$** = number of input neurons.
- **$n_{out}$** = number of output neurons.


## **He Initialization**
Designed for **ReLU and Leaky ReLU** activations. Since ReLU only activates half of the neurons (zeroing out negative values), it scales variance to avoid vanishing gradients.
$$
W \sim \mathcal{N} \left( 0, \frac{2}{n_{in}} \right) \quad \text{or} \quad W \sim U \left( -\sqrt{\frac{6}{n_{in}}}, \sqrt{\frac{6}{n_{in}}} \right)
$$

 # **BATCH NORMALIZATION (BN)**
Batch Normalization (BN) is a technique used to improve the training of deep neural networks by **normalizing the inputs** to each layer. It helps address issues like vanishing/exploding gradients, which can slow down training or make the network hard to optimize. It also allows the network to train faster and potentially improve generalization.
<br> BN is typically applied after the linear transformation of the layer and before the activation function (e.g., ReLU). This is because we want the activations to have normalized distributions.

## **How Batch Normalization Works**
In deep networks, the distribution of each layer's inputs can change during training, causing instability and slow learning. This is known as **internal covariate shift**. Batch Normalization addresses this by normalizing the activations of each layer so they have a mean of 0 and a standard deviation of 1.
1. **Normalize** the output of a layer (before applying the activation function) by subtracting the mean and dividing by the standard deviation of the batch. This ensures the activations are centered around zero and have a unit variance.
$$
\hat{x} = \frac{x - \mu_B}{\sigma_B}
$$
Where:
* $\mu_B$ is the mean of the batch.
* $\sigma_B$ is the standard deviation of the batch.

2. **Scale and shift** the normalized values using two learnable parameters, $\gamma$ (scale) and $\beta$ (shift), which allow the model to undo normalization if necessary. This helps retain the network's capacity to model complex relationships.
$$
y = \gamma \cdot \hat{x} + \beta
$$
Where:
* $\hat{x}$ is the normalized value.
* $\gamma$ and $\beta$ are the learnable parameters that allow the model to scale and shift the normalized output.
* $y$ is the final output after applying the scaling and shifting.

3. The **batch statistics** (mean and variance) are computed for each mini-batch during training, but during inference, fixed values are used (the running averages of the batch mean and variance from training).
## **Benefits of Batch Normalization**
1. **Stabilizes Training:** By normalizing the inputs to each layer, BN reduces the effect of outliers and helps the model converge faster.
2. **Faster Convergence:** BN allows for higher learning rates since the network is less likely to diverge during training.
3. **Reduced Need for Initialization:** BN reduces the sensitivity to weight initialization, which can help mitigate the risk of poor initializations.
4. **Improved Regularization:** It introduces some noise into the learning process, acting as a form of regularization. This can help reduce overfitting, though it is not a replacement for other regularization methods like dropout.

# **DROPOUT**
**Dropout** is a regularization technique used in training deep neural networks to prevent overfitting and improve generalization. It works by randomly "dropping out" or deactivating a fraction of neurons during each forward and backward pass in training.

**How Dropout Works**
1. **Randomly Drop Neurons:** During training, dropout randomly sets the output of each neuron to zero with a given probability, typically referred to as the **dropout rate** (denoted as $p$, where $p$ is the probability of dropping a neuron, a higher dropout rate might be used for very large networks or more complex tasks.). For example, if $p=0.2$ , then 20% of the neurons in the layer will be dropped out (set to zero).
2. **Scaling Neuron Outputs:** The remaining neurons that are not dropped out are scaled by $\frac{1}{1-p}$. This is done to maintain the overall scale of the activations, ensuring that the expected value of the outputs during training remains the same as during inference (when no neurons are dropped out).
3. **Training vs Inference:** During training, dropout is applied to the network, but during inference (when making predictions), no neurons are dropped out. Instead, all neurons are used, and their outputs are scaled down by $1-p$ to account for the fact that dropout was applied during training.

## **Dropout in Practice**
* **Training Phase:** Randomly deactivate a fraction of neurons.
* **Inference Phase:** Use all neurons, but scale their outputs to match the training behavior.
  
## **Why Dropout Helps**
* **Prevents Overfitting:** Dropout forces the network to not rely too heavily on any one neuron or connection, as different random subsets of neurons are activated during each training step. This prevents the model from becoming too specialized to the training data, improving its ability to generalize to new, unseen data.
* **Reduces Co-adaptation of Neurons:** In the absence of dropout, neurons may co-adapt, meaning they might rely on each other too much. Dropout ensures that each neuron must independently contribute to learning, which leads to a more robust model.
* **Improves Model Robustness:** By forcing the model to make predictions without certain features during training, it forces the network to learn more robust features and not memorize specific patterns in the training data.

# **REGULARIZATION**

Regularization is a set of techniques used to **prevent overfitting** and improve the ability of a neural network to generalize on unseen data. While Dropout and Batch Normalization are common methods, there are additional strategies that can help your model perform better.


## **L1 and L2 Regularization**

L1 and L2 are two common forms of regularization that **penalize large weights** in the network. By adding a penalty term to the loss function, the network is encouraged to keep weights small and simple, reducing overfitting.

### **L2 Regularization (Ridge)**
L2 regularization adds the **sum of squared weights** to the loss function:
$$
L_{new} = L_{original} + \lambda \sum_{i} W_i^2
$$
* $L_{original}$ = original loss (e.g., MSE or Cross-Entropy)  
* $\lambda$ = regularization strength (hyperparameter)  
* $W_i$ = weights of the network  

**Effect:**  
* Penalizes large weights more strongly.  
* Encourages weights to be small but **rarely exactly zero**.  

### **L1 Regularization (Lasso)**
L1 regularization adds the **sum of absolute weights** to the loss function:
$$
L_{new} = L_{original} + \lambda \sum_{i} |W_i|
$$

**Effect:**  
* Encourages sparsity in weights (some weights become exactly zero).  
* Can be used for **feature selection**, removing irrelevant inputs automatically.  

**Practical Tip:**  
- Combine L1 and L2 in what’s called **Elastic Net** to get benefits of both.


## **Early Stopping**

Early stopping is a **training strategy** where the network **stops training when performance on a validation set stops improving**.  

### How It Works:
1. Split your data into **training** and **validation** sets.  
2. Train the model and monitor **validation loss** at each epoch.  
3. Stop training when the validation loss **does not decrease** for a set number of epochs (patience).  

**Benefits:**  
* Prevents overfitting by not letting the network memorize the training data.  
* Often works well with high-capacity networks.  

**Example in practice:**  
If after 10 epochs your validation loss hasn’t improved, stop training and use the model from the epoch with the **lowest validation loss**.


## **Data Augmentation**

Data augmentation artificially **increases the size of your dataset** by applying random transformations to existing data. This helps the model generalize better by **seeing more diverse examples**.  

### Common Techniques for Images:
* Rotation  
* Flipping (horizontal/vertical)  
* Zooming or scaling  
* Color jitter  
* Random cropping  

### Common Techniques for Text:
* Synonym replacement  
* Random insertion or deletion of words  
* Back-translation (translate to another language and back)  

**Effect:**  
* Reduces overfitting by making the model **less dependent on specific features** in the training data.  
* Especially useful when training data is limited.


## **Noise Injection**

Adding small amounts of **noise** to inputs, weights, or activations during training can act as a regularizer.  

### How it works:
* Add Gaussian noise to input data or hidden layer activations.  
* Slightly perturb the weights during training.  

**Effect:**  
* Forces the network to be robust to small variations in the data.  
* Helps generalization, similar to Dropout, but can be applied in different ways.


## **Combining Techniques**

In practice, the most powerful networks **combine multiple regularization methods**:

| Technique | Best Use Case |
|-----------|---------------|
| Dropout | Deep networks, CNNs, RNNs |
| Batch Normalization | Stabilize training, allows higher learning rate |
| L1 / L2 | Any neural network, especially fully connected layers |
| Early Stopping | When training for many epochs |
| Data Augmentation | When dataset is small or overfitting is strong |
| Noise Injection | Adds robustness in sensitive tasks |

> **Rule of thumb:** Start simple (Dropout + L2), then add data augmentation or early stopping as needed.
 


# **OVERFITTING AND UNDERFITTING**
**Overfitting** and **underfitting** are two common problems in machine learning and statistical modeling, both of which can negatively affect the model’s ability to generalize to new, unseen data.

## **Overfitting**
**Overfitting** occurs when a model learns not only the underlying patterns in the training data but also the noise or random fluctuations. As a result, the model performs very well on the training set but fails to generalize to new data.
* **High training accuracy, low test accuracy:** The model memorizes the training data, so it achieves very low error on the training set but performs poorly on the validation or test set.
* **Model is too complex:** The model has too many parameters relative to the amount of data available. This can happen if the model is too flexible or has too many layers, features, or nodes (in the case of neural networks).
* **Noise fitting:** The model may start fitting noise or irrelevant patterns in the training data that don't generalize to unseen data.

**Causes of Overfitting**
* **Too many features:** When there are more features than necessary, the model may find patterns that don't exist in the real world.
* **Too complex a model:** High-capacity models, such as deep neural networks with many layers or high-degree polynomial regression, are more prone to overfitting, especially when the dataset is small or noisy.
* **Insufficient training data:** When there's not enough data, the model has fewer examples to learn from and may start to memorize rather than generalize.
  
**How to Combat Overfitting**
* **Cross-validation:** Use techniques like k-fold cross-validation to evaluate the model's performance on different subsets of data and avoid bias.
* **Simplify the model:** Reduce the model’s complexity, such as using fewer features, fewer layers, or regularization techniques.
* **Regularization:** Methods like L1 (Lasso), L2 (Ridge) regularization, or Dropout (in deep learning) can help penalize the model for overly large weights, which encourages simpler models.
* **Early stopping:** Stop training before the model starts to memorize the data. For neural networks, this can be done by monitoring the validation loss and halting training when it starts to increase.
* **More data:** Collecting more data can help the model better generalize and avoid overfitting to noise.

## **Underfitting**
**Underfitting** occurs when the model is too simple to capture the underlying patterns in the data. The model does not perform well on either the training data or the test data because it has not learned the relevant features or relationships.
* **High bias:** The model makes strong assumptions that are too simplistic, resulting in poor performance on both the training and test sets.
* **Poor training accuracy:** The model may have high error on the training data itself, indicating that it has not learned the patterns in the data properly.
* **Model is too simple:** The model lacks sufficient complexity to capture the underlying structure in the data. For example, using a linear regression model for data that follows a non-linear pattern can result in underfitting.

**Causes of Overfitting**
* **Too simple a model:** If the model is too basic (e.g., using a linear model for non-linear data), it will not have enough capacity to learn the complexities of the data.
* **Not enough features:** The model may not have enough relevant features to represent the data well.
* **Insufficient training:** If the model hasn’t been trained long enough or the learning rate is too high, it may not have had time to capture the patterns in the data.

**How to Combat Overfitting**
* **Increase model complexity:** Use more sophisticated models or algorithms that have the capacity to capture more complex patterns, such as switching from linear regression to polynomial regression, or using more layers in a neural network.
* **Use more features:** Include more relevant features to give the model more information to work with.
* **Improve training:** Train the model for a longer period or adjust the learning rate to help the model converge to a better solution.

## **Finding the Balance**
The goal in machine learning is to find a model that strikes a balance between overfitting and underfitting. This balance is often referred to as the **bias-variance trade-off**:
* **High bias (underfitting):** The model is too simple and cannot capture the underlying patterns, leading to poor performance on both training and test data.
* **High variance (overfitting):** The model is too complex and learns the noise in the training data, leading to great performance on the training set but poor generalization to new data.

<br>The optimal model will have:
* **Low bias:** It captures the underlying patterns in the data.
* **Low variance:** It generalizes well to unseen data, not overfitting to the noise.

To achieve the best performance, you want to find a model with **just the right complexity**, complex enough to capture the underlying patterns, but simple enough to avoid memorizing the data.

# **EVALUATION METRICS**

Evaluation metrics are tools used to **measure the performance** of a machine learning model. Choosing the right metric is crucial, because it tells you whether your model is actually doing what you want.

Metrics vary depending on the **type of problem**: regression, classification, or ranking tasks.


## **Metrics for Regression**

Regression models predict **continuous values**, so metrics focus on measuring the **difference between predicted and actual values**.

### **Mean Squared Error (MSE)**
$$
MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2
$$`
* Measures **average squared difference** between predictions and actual values.  
* Penalizes **large errors more than small errors**.  
* Good for emphasizing big mistakes.

### **Root Mean Squared Error (RMSE)**
$$
RMSE = \sqrt{MSE}
$$
* Expressed in the **same units as the target**, making it easier to interpret.  

### **Mean Absolute Error (MAE)**
$$
MAE = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|
$$
* Measures the **average absolute difference** between predictions and actual values.  
* Less sensitive to **outliers** than MSE.

### **R-Squared ($R^2$)**
$$
R^2 = 1 - \frac{\sum_i (y_i - \hat{y}_i)^2}{\sum_i (y_i - \bar{y})^2}
$$
* Measures how well the model explains the **variance of the target**.  
* $R^2 = 1$ → perfect prediction, $R^2 = 0$ → model predicts as well as the mean.  


## **Metrics for Classification**

Classification models predict **discrete categories**, so metrics focus on **how often predictions match the true labels**.

### **Confusion Matrix**
A table that summarizes **true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN)**.  

| Actual \ Predicted | Positive | Negative |
|------------------|---------|---------|
| Positive          | TP      | FN      |
| Negative          | FP      | TN      |

**Useful for calculating other metrics.**


### **Accuracy**
$$
Accuracy = \frac{TP + TN}{TP + TN + FP + FN}
$$
* Percentage of **correct predictions**.  
* Works well for **balanced datasets**.  
* Misleading if classes are **imbalanced**.

### **Precision**
$$
Precision = \frac{TP}{TP + FP}
$$
* Measures **how many predicted positives are actually positive**.  
* Important when **false positives are costly** (e.g., spam detection).

### **Recall (Sensitivity)**
$$
Recall = \frac{TP}{TP + FN}
$$
* Measures **how many actual positives were correctly predicted**.  
* Important when **missing a positive is costly** (e.g., disease detection).

### **F1 Score**
$$
F1 = 2 \cdot \frac{Precision \cdot Recall}{Precision + Recall}
$$
* Harmonic mean of precision and recall.  
* Good for **imbalanced datasets**.  


### **ROC Curve & AUC**
* **ROC Curve:** Plots **True Positive Rate (Recall)** vs **False Positive Rate (FPR)** at different thresholds.  
* **AUC (Area Under the Curve):** Measures **overall ability to distinguish classes**.  
  * AUC = 1 → perfect classifier  
  * AUC = 0.5 → random guessing  


## **Metrics for Multi-Class Classification**

For multi-class tasks (more than 2 classes), metrics are extended:

* **Macro Average:** Compute metric per class, then average (treats all classes equally).  
* **Weighted Average:** Compute metric per class, then average weighted by class frequency.  
* **Confusion Matrix:** Expanded to **n x n** table for n classes.  


## **Metrics for Ranking & Probabilistic Models**

Some models output **scores or probabilities** rather than exact classes. Metrics include:

* **Log Loss (Cross-Entropy Loss):**
$$
L = -\frac{1}{n} \sum_{i=1}^{n} \sum_{j=1}^{k} y_{ij} \log(\hat{y}_{ij})
$$
* Measures **difference between predicted probabilities and true labels**.  
* Lower is better.  

* **Precision / Recall:** Used in **recommendation systems**, measures accuracy for top k predictions.  
* **Mean Average Precision (MAP):** Average precision across all queries or samples.  

# Useful resources

### **Youtube**
*3blue1brow*: Neural networks Course 
www.youtube.com/playlist?list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi

*Normalized Nerd*: Why do we need Cross Entropy Loss? (Visualized)
www.youtube.com/watch?v=gIx974WtVb4

*StatQuest with Josh Starmer*: Gradient Descent, Step-by-Step
www.youtube.com/watch?v=sDv4f4s2SB8

*Emergent Garden*: Watching Neural Networks Learn
www.youtube.com/watch?v=TkwXa7Cvfr8

*Emergent Garden*: Why Neural Networks can learn (almost) anything
www.youtube.com/watch?v=0QczhVg5HaI

*Artem Kirsanov*: The Most Important Algorithm in Machine Learning
www.youtube.com/watch?v=SmZmBKc7Lrs

*Artem Kirsanov*: What Textbooks Don't Tell You About Curve Fitting
www.youtube.com/watch?v=q7seckj1hwM

*Artem Kirsanov*: The Key Equation Behind Probability
www.youtube.com/watch?v=KHVR587oW8I

*Samson Zhang*: Building a neural network FROM SCRATCH
www.youtube.com/watch?v=w8yWXqWQYmU

*Layerwise Lectures*: Hopfield network: How are memories stored in neural networks? 
www.youtube.com/watch?v=piF6D6CQxUw