# Understanding Slope, Derivatives, and Gradient Descent

### 1. Slope
- Slope tells us **how steep a line is**.  
- In a straight line equation `y = mx + c`,  
  - `m` = slope  
  - It means: **change in y / change in x (Δy/Δx)**.  
- Example: If slope = 2, then for every +1 change in `x`, `y` increases by +2.  

---

### 2. Derivatives
- Derivative is a generalization of slope for **any curve** (not just straight lines).  
- It tells us: **rate of change of a function w.r.t. its variable**.  
- Example:  
  - For `y = x²`, derivative = `dy/dx = 2x`.  
  - At `x = 3`, slope of curve = `2(3) = 6`.  

---

### 3. Partial Derivatives
- Used when function has **multiple variables**.  
- Example: `f(x, y) = x² + y²`  
  - Partial derivative wrt `x`: `∂f/∂x = 2x` (treat `y` as constant)  
  - Partial derivative wrt `y`: `∂f/∂y = 2y` (treat `x` as constant)  
- In Machine Learning, cost functions depend on **many weights**, so we use partial derivatives.

---

### 4. Linear vs Non-linear Case
- **Linear case**: slope is constant (e.g., straight line).  
- **Non-linear case**: slope keeps changing at every point, so we use **derivatives** to calculate slope dynamically.  

---

### 5. Gradient Descent
- Gradient = vector of **all partial derivatives**.  
- Descent = moving in the direction of **negative gradient** (downhill).  
- Why? Because we want to minimize the **loss function**.  

**Steps of Gradient Descent:**
1. Start with random weights.  
2. Compute the loss (error).  
3. Calculate derivatives (slopes).  
4. Update weights:  
   `new_weight = old_weight - learning_rate * derivative`  
5. Repeat until convergence.  

---

✅ Example in ML:  
- Suppose we are predicting house prices.  
- Loss function = Mean Squared Error (MSE).  
- Gradient descent helps us adjust weights (like `size`, `rooms`) to minimize error.  

---
---


# RNN Backpropagation Through Time (BPTT)

Previously, we explored the forward propagation process and the basic architecture of a simple RNN. Now, after computing the **loss function** at the end of the forward pass, our goal is to **reduce this loss by updating the weights through backpropagation**.

---
### Diagram Overview
![RNN Backward Propagation with time](images\RNN_backward_propagation_with_time.png)

---

## Loss Calculation and Objective
After obtaining the predicted output $\hat{y}$, we calculate the loss by comparing it with the true label $y$.  

Our objective is to minimize this loss by updating the weights:
- $w_I$ (input weights)
- $w_H$ (hidden weights)
- $w_O$ (output weights)

---

## Weight Update Formula
The general weight update rule using gradient descent is:

$$
w_{new} = w_{old} - \eta \times \frac{\partial Loss}{\partial w_{old}}
$$

where:
- $w_{old}$ = current weight  
- $w_{new}$ = updated weight  
- $\eta$ = learning rate  
- $\frac{\partial Loss}{\partial w_{old}}$ = gradient of the loss w.r.t. the weight  

---

## Updating Output Weights ($w_O$)

Using the chain rule:

$$
\frac{\partial Loss}{\partial w_O} = \frac{\partial Loss}{\partial \hat{y}} \times \frac{\partial \hat{y}}{\partial w_O}
$$

Update rule:

$$
w_O^{new} = w_O^{old} - \eta \times \frac{\partial Loss}{\partial w_O}
$$

---

## Updating Hidden Weights ($w_H$)

The hidden weights are shared across all time steps, so we **sum gradients over all $t$**.

- At $t=3$:  
$$
\frac{\partial Loss}{\partial w_H} = 
\frac{\partial Loss}{\partial \hat{y}} \times 
\frac{\partial \hat{y}}{\partial o_3} \times 
\frac{\partial o_3}{\partial w_H}
$$

- At $t=2$:  
$$
\frac{\partial Loss}{\partial w_H} += 
\frac{\partial Loss}{\partial \hat{y}} \times 
\frac{\partial \hat{y}}{\partial o_3} \times 
\frac{\partial o_3}{\partial o_2} \times 
\frac{\partial o_2}{\partial w_H}
$$

- At $t=1$:  
$$
\frac{\partial Loss}{\partial w_H} += 
\frac{\partial Loss}{\partial \hat{y}} \times 
\frac{\partial \hat{y}}{\partial o_3} \times 
\frac{\partial o_3}{\partial o_2} \times 
\frac{\partial o_2}{\partial o_1} \times 
\frac{\partial o_1}{\partial w_H}
$$

Update rule:

$$
w_H^{new} = w_H^{old} - \eta \times \frac{\partial Loss}{\partial w_H}
$$

---

## Updating Input Weights ($w_I$)

Similar to $w_H$, but gradients flow from input:

- At $t=3$:  
$$
\frac{\partial Loss}{\partial w_I} = 
\frac{\partial Loss}{\partial \hat{y}} \times 
\frac{\partial \hat{y}}{\partial o_3} \times 
\frac{\partial o_3}{\partial w_I}
$$

- At $t=2$:  
$$
\frac{\partial Loss}{\partial w_I} += 
\frac{\partial Loss}{\partial \hat{y}} \times 
\frac{\partial \hat{y}}{\partial o_3} \times 
\frac{\partial o_3}{\partial o_2} \times 
\frac{\partial o_2}{\partial w_I}
$$

- At $t=1$:  
$$
\frac{\partial Loss}{\partial w_I} += 
\frac{\partial Loss}{\partial \hat{y}} \times 
\frac{\partial \hat{y}}{\partial o_3} \times 
\frac{\partial o_3}{\partial o_2} \times 
\frac{\partial o_2}{\partial o_1} \times 
\frac{\partial o_1}{\partial w_I}
$$

Update rule:

$$
w_I^{new} = w_I^{old} - \eta \times \frac{\partial Loss}{\partial w_I}
$$

---

# ✅ Summary
- **Loss** is computed between predicted $\hat{y}$ and true label $y$.  
- **Weights ($w_I, w_H, w_O$)** are updated using **gradient descent**.  
- **$w_O$** is updated directly since it connects to output.  
- **$w_H$** and **$w_I$** require **Backpropagation Through Time (BPTT)**: gradients are summed across all time steps using the chain rule.  
- Updates continue iteratively until loss converges (ideally to a global minimum).  
