---
## Dropout
---
**Why do we need Dropout?**
- Dropout is a Regularization technique to prevent overfitting in Deep Neural Networks.
- It helps the model generalize better by introducing randomness in training, making it more robust.
--- 
**What is Overfitting?** 
![image.png](attachment:84ce5afb-a29d-4973-9886-417add1d13d2.png)

- What you can see in the image above is exactly overfitting.
- DNNs contain multiple non-linear hidden layers and this makes them very expressive models that can learn very complicated relationships between inputs and outputs, however many of this complicated relationships are result of sampling noise, which may not exist in test data, which ultimately leads to Overfitting. 
- It occurs when the machine learning model learns the noise in the training data, instead of capturing underlying pattern.
- Which results in poor generalization (the model performs well on training data but fails on unseen data.)
- In the above image like we can see in the left graph, the model captures the general trend in the data, maintaining a balance between bias and variance. This results in good generalization.
- In the right graph (Overfitting), you can see, the model fits the data points too closely, capturing even small fluctuations(noise). Which leads to poor performance on test data.
- It can be fixed using generalization methods. In our case 'Dropout'. 
---
**Dropout Explained** 
![image.png](attachment:5161e6ff-031f-4fb5-aefa-6b9000c3216e.png)

- We prevent overfitting using Dropout by randomly deactivating a fraction of neurons during training.
- By doing so, it forces the network to develop more robust features rather than relying on specific neurons.
- Like we can see in the above Image,
  - (a) Standard Neural Net: It is a fully connected network with two hidden layers. Every neuron
    is connected to all neurons in the next layer.
  - (b) After Applying Dropout: Some neurons (indicated by "X") are randomly dropped, meaning they do not participate in forward or backward propagation during that training step.
  - By dropping neurons, the model avoids neurons becoming overly independent on each other and learns redundant, more generalized patterns.
---
**How Dropout works?** 
- It operates by randomly setting a fraction of neurons' activations to zero during each forward pass in training.
1. **During Training**
   - For each neuron, set its output to zero with probability $p$ (dropout rate).
   - Scale the remaining neurons by $1 / (1 - p)$ to maintain the expected output magnitude.
2. **During Testing/Validation**
   - Dropout is turned off (all neurons are active).
   - No scaling is applied, as **all neurons are available**.
---
**Mathematical Formulation** 
- Let:

- $h^{(l)} \text{ be the activations of layer } l$
- $p \text{ be the dropout rate (e.g., 0.5 means 50% neurons are dropped)}$
- $$M^{(l)} \text{ be a mask vector, where each element is sampled from a Bernoulli distribution } \text{Bernoulli}(1-p)$$
---
Forward Propagation with Dropout:
- $$M^{(l)} \sim \text{Bernoulli}(1 - p)$$
- $$\tilde{h}^{(l)} = M^{(l)} \odot h^{(l)}$$
- $$h^{(l+1)} = f(W^{(l)} \tilde{h}^{(l)} + b^{(l)})$$

Where:

- $\odot$ represents element-wise multiplication
- $W^{(l)}$ and $b^{(l)}$ are the weights and biases
- $f(\cdot)$ is the activation function (e.g., ReLU, Sigmoid, etc.)

Inference Scaling:
- $$h^{(l)} = (1 - p) \tilde{h}^{(l)}$$
---

**Conclusion** : This is how dropout helps us solve overfitting. 