# Activation Functions, Their Ranges, and Derivatives

| **Activation Function** | **Formula** | **Output Range** | **Derivative Range** | **Notes** |
|--------------------------|-------------|------------------|-----------------------|-----------|
| **Sigmoid (Logistic)** | $ \sigma(x) = \frac{1}{1 + e^{-x}} $ | (0, 1) | (0, 0.25) | Saturates at extremes → causes **vanishing gradients** |
| **tanh (Hyperbolic Tangent)** | $ \tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} $ | (-1, 1) | (0, 1), max = 1 at $x=0$ | Zero-centered → better than sigmoid |
| **ReLU (Rectified Linear Unit)** | $ f(x) = \max(0, x) $ | [0, ∞) | {0, 1} | Simple & efficient, but **dead ReLU problem** (neurons stuck at 0) |
| **Leaky ReLU** | $ f(x) = \max(\alpha x, x), \; \alpha \approx 0.01 $ | (-∞, ∞) | {α, 1} | Fixes dead ReLU by allowing small negative slope |
| **ELU (Exponential Linear Unit)** | $ f(x) = \begin{cases} x, & x > 0 \\ \alpha(e^x - 1), & x \leq 0 \end{cases} $ | (-α, ∞) | (0, 1] | Smooth near zero, helps convergence |
| **Softmax** | $ \sigma(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}} $ | (0, 1), sums to 1 | Depends on input | Used in **classification output layer** |
| **Swish** (Google) | $ f(x) = x \cdot \sigma(x) $ | (-∞, ∞) | ~ (0, 1) | Smooth & often better than ReLU |
| **GELU (Gaussian Error Linear Unit)** | $ f(x) = x \cdot \Phi(x), \;\; \Phi(x) = \text{Gaussian CDF} $ | (-∞, ∞) | ~ (0, 1) | Default in **Transformers** (BERT, GPT) |

---

## ✅ Quick Insights
- **Sigmoid/tanh** → prone to vanishing gradients (derivatives < 1).  
- **ReLU family (ReLU, Leaky ReLU, ELU)** → reduce vanishing gradient, but ReLU can “die”.  
- **Swish/GELU** → smooth, modern, widely used in deep models (esp. NLP).  
- **Softmax** → not for hidden layers, but for **multi-class classification output**.


# Problems with RNNs

Now, we will understand some of the problems associated with RNNs.  
This discussion will lead us to the next neural network variants, specifically **LSTM RNN** and **GRU RNN**, which address these issues.

---

## Weight Parameters in RNN
In the RNN, we have several weight matrices:
- **(W_I):** Input weight matrix  
- **(W_H):** Hidden state weight matrix  
- **(W_O):** Output weight matrix  

These weights are shared across time steps and updated during training to minimize the loss.

---

## Short-Term Dependencies in RNN
- Example: *"The food is good."*  
- The output depends on nearby words within a short context window.  
- Simple RNNs handle this case well.

---

## Long-Term Dependencies and Challenges
- Example: *"Hey, my name is Krish, and I like sports like cricket and volleyball. I also like to make..."*  
- Prediction depends on words much earlier in the sequence.  
- Simple RNNs **struggle** when sequence length grows large (e.g., 50–100 words).

---

## Vanishing Gradient Problem in RNN
- Main issue: **Vanishing gradient problem**.  
- During **backpropagation through time (BPTT)**, gradients are products of derivatives at each time step.  
- Since derivatives < 1, repeated multiplication causes gradients to shrink exponentially → earlier steps have little influence.

---

## Diagram overview:

![RNN problems](images\Problems_with_RNN_1.png)

![RNN problems](images\Problems_with_RNN_2.png)

---

## Mathematical Illustration of Vanishing Gradient

The derivative of the loss **L** with respect to the hidden weight **W_H** at time step **t=1**:

$$
\frac{\partial L}{\partial W_H} = \sum_{t=1}^T 
\frac{\partial L}{\partial \hat{y}} \cdot 
\frac{\partial \hat{y}}{\partial O_T} \cdot 
\prod_{k=t+1}^T \frac{\partial O_k}{\partial O_{k-1}} \cdot 
\frac{\partial O_t}{\partial W_H}
$$

Here, each term $ \frac{\partial O_k}{\partial O_{k-1}} $ is typically less than one, causing the product to shrink exponentially as **T** increases.

---

## Impact of Activation Functions
- **Sigmoid:** Derivatives between 0 and 0.25 → strong vanishing effect.  
- **tanh:** Derivatives between 0 and 1 → still suffers for long sequences.  
- **ReLU / Leaky ReLU:** Derivatives closer to 1 → helps reduce vanishing gradient.

---

## Mitigating Vanishing Gradient
- Use **ReLU-based activations**.  
- Use advanced architectures:  
  - **LSTM (Long Short-Term Memory)**  
  - **GRU (Gated Recurrent Unit)**  

These introduce gating mechanisms to control information and preserve gradients.

---

## Summary
- **Problem:** Simple RNNs fail to capture long-term dependencies due to vanishing gradients.  
- **Cause:** Repeated multiplication of small derivatives during BPTT.  
- **Impact:** Earlier time steps have negligible influence.  
- **Solution:** Use ReLU activations or advanced architectures like **LSTM** and **GRU** that handle long sequences effectively.
