## Neural Networks From Scratch - Lec 9 - ReLU Activation Function

The video explains the **ReLU (Rectified Linear Unit) activation function**, its properties, advantages, and drawbacks in neural networks.

### **Key Points:**

### **What is ReLU?**
![ReLU activation function](https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSzWrwwaceB2GuWOReRTEbLXBIWqD1xr3STkQ&s)
ReLU is a simple yet powerful activation function used in neural networks. It is defined as:

$$
 f(x) = \max(0, x) 
$$

- Returns **0** for negative inputs and passes positive inputs unchanged.
- It is a **piecewise linear function** – linear for positive values but non-linear for negative values.

### **Why Use ReLU?**
- Unlike sigmoid and tanh, ReLU does not suffer from the **vanishing gradient problem**.
- Enables **faster training** and helps the model learn complex patterns.
- **Computationally efficient** and widely used in deep learning.

### **Limitations of ReLU:**
- **Dying Neurons:** If weights and biases cause too many negative inputs, neurons can become permanently inactive (always outputting 0).
- **Exploding Gradient:** With high learning rates, values can grow too large, leading to NaN errors.

### **How to Mitigate Issues?**
- Use **He initialization** for weights:
  $$ W = \mathcal{N}(0, \sqrt{\frac{2}{n_{in}}}) $$
- Keep **bias values small** to prevent dead neurons.
- Normalize input data to **zero mean and unit variance**:
  $$ x' = \frac{x - \mu}{\sigma} $$
- Apply **L1/L2 regularization** to prevent large weight values:
  $$ L_2 = \lambda \sum W^2 $$

### **Practical Insights:**
- **Use ReLU for hidden layers**, but **softmax** for output layers in classification tasks.
- Variants of ReLU (like **Leaky ReLU**) address the dying neuron problem:
  $$ f(x) = \max(\alpha x, x), \quad \alpha > 0 $$

### **Python Implementation:**
```python
import numpy as np

def relu(x):
    return np.maximum(0, x)

def relu_derivative(x):
    return np.where(x > 0, 1, 0)
```

The video emphasizes that despite its simplicity, ReLU is one of the most effective activation functions for deep learning models.
