```{contents}
```

## Gradient Descent

Gradient Descent is an optimization algorithm used to minimize a cost function by updating model parameters (θ). The types mainly differ in **how much data** they use at each update step.

---

## Types of Gradient Descent

### Batch Gradient Descent

* **Definition**: Uses the **entire training dataset** to compute the gradient of the cost function in each iteration.
* **Formula**:

  $$
  \theta := \theta - \alpha \cdot \frac{1}{m} \sum_{i=1}^{m} \nabla_\theta J(\theta; x^{(i)}, y^{(i)})
  $$

  where $m$ = number of training examples.
* **Pros**:

  * Converges to the **global minimum** for convex functions (like linear regression).
  * Stable updates.
* **Cons**:

  * Very slow for large datasets.
  * Requires huge memory since it must load all data at once.
* ✅ Best suited for **small to medium datasets**.

---

### Stochastic Gradient Descent (SGD)

* **Definition**: Updates parameters **for each training example** one at a time.
* **Formula**:

  $$
  \theta := \theta - \alpha \cdot \nabla_\theta J(\theta; x^{(i)}, y^{(i)})
  $$
* **Pros**:

  * Much faster (frequent updates).
  * Can escape **local minima** due to noisy updates.
* **Cons**:

  * Updates are noisy → cost function fluctuates rather than smoothly converging.
  * Harder to reach exact global minimum (oscillates around it).
* ✅ Best for **very large datasets** or **online learning**.

---

### Mini-Batch Gradient Descent

* **Definition**: A compromise between Batch and SGD. Uses **small random subsets (mini-batches)** of the data to update parameters.
* **Formula**:

  $$
  \theta := \theta - \alpha \cdot \frac{1}{b} \sum_{i=1}^{b} \nabla_\theta J(\theta; x^{(i)}, y^{(i)})
  $$

  where $b$ = mini-batch size (e.g., 32, 64, 128).
* **Pros**:

  * Faster and more efficient than pure Batch.
  * Less noisy than SGD.
  * Can leverage **vectorization** (parallel processing on GPUs).
* **Cons**:

  * Choosing batch size is tricky (too small → noisy, too large → slow).
* ✅ Best for **deep learning and neural networks**.

### Comparison Summary

| Type              | Update Frequency      | Speed            | Stability   | Use Case                        |
| ----------------- | --------------------- | ---------------- | ----------- | ------------------------------- |
| **Batch GD**      | After full dataset    | Slow             | Very stable | Small datasets                  |
| **Stochastic GD** | After each data point | Fast per step    | Noisy       | Large datasets, online learning |
| **Mini-Batch GD** | After subset (batch)  | Fast + efficient | Balanced    | Deep Learning                   |
