```{contents}
```

## Gradient Descent Optimizer

Gradient Descent (GD) is an **optimizer** used to minimize the loss (or cost) function by updating the weights of a neural network during **backpropagation**.

---

### Purpose of an Optimizer

* Forward propagation calculates predicted outputs ( \hat{y} ) from inputs ( X ).
* Loss function measures error between predicted ( \hat{y} ) and actual ( y ).
* Optimizer updates **weights** to reduce this loss iteratively.

**Weight update formula:**
$$
w_{\text{new}} = w_{\text{old}} - \eta \frac{\partial L}{\partial w_{\text{old}}}
$$

Where:

* $\eta$ = learning rate (step size)
* $\frac{\partial L}{\partial w}$ = gradient of loss w.r.t weights

---

### Gradient Descent Process

1. Initialize weights randomly.
2. Forward propagate all data points to calculate predictions.
3. Compute the **cost function** over all data points.

   * For example, MSE:
     $$
     \text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2
     $$
4. Backpropagate gradients and update weights using GD formula.
5. Repeat for multiple **epochs** until convergence.

---

### Epochs vs Iterations

* **Epoch:** One complete pass through all training data.
* **Iteration:** One weight update using a subset of data (mini-batch).
* In **classic gradient descent**, all data points are used for one update →
  **1 epoch = 1 iteration**.

---

### Visualization

* Loss surface is typically convex (parabola for simple cases).
* Gradient descent moves weights **downhill** toward the **global minimum** of the loss.
* When gradient = 0, weights stop updating (convergence).

---

### Advantages

* Guaranteed convergence for convex loss surfaces.
* Simple and foundational concept for all other optimizers.

---

### Disadvantages

* **Resource intensive:** Requires all data in memory for each update.

  * For large datasets, memory (RAM) and GPU usage is very high.
* Slow for very large datasets.

---

**Key takeaway:**

Gradient Descent works well for small datasets but struggles with very large datasets due to memory and computational requirements. Variants like **Stochastic Gradient Descent (SGD)** or **Mini-batch GD** address this limitation.

