```{contents}
```

## Stochastic Gradient Descent (SGD) Optimizer

---

### 1️⃣ Concept

* **Goal:** Reduce the loss function by updating weights during training.
* **Difference from Gradient Descent:**

  * Gradient Descent (GD) uses **all training data** in one forward and backward pass.
  * SGD uses **one data point at a time** for forward and backward propagation.

**Weight update formula:**

$$
w_{\text{new}} = w_{\text{old}} - \eta \frac{\partial L}{\partial w_{\text{old}}}
$$

* $\eta$ = learning rate
* $L$ = loss for **one sample**

---

### 2️⃣ How it works

1. Pick **one data point**.
2. Compute predicted output $\hat{y}$.
3. Calculate loss $L(y, \hat{y})$.
4. Update weights using gradient.
5. Repeat for next data point.

* **One epoch:** passing **all data points once**.
* If 1000 points → 1000 iterations per epoch.

---

### 3️⃣ Advantages

* **Memory efficient:** Can train on systems with limited RAM or GPU.
* **Scales to large datasets:** Avoids loading all data at once.

---

### 4️⃣ Disadvantages

1. **Time-consuming:** Updating weights per data point is slow for large datasets.
2. **Noisy updates:** Convergence path jumps around, not smooth like GD.
3. **Slower convergence:** May take more epochs to reach the global minimum.

---

### 5️⃣ Noise explanation

* Because each weight update is based on **one sample**, the loss function fluctuates.
* The training curve is jagged rather than smooth.
* The global minimum is eventually reached, but the path is less stable.

---

### 6️⃣ Solution to noise

* **Mini-batch SGD:** Uses a small batch of data points (e.g., 32, 64) instead of one.
* Reduces noise and speeds up convergence while keeping memory requirements low.

---



**Summary:**

* **SGD = weight update per sample** → memory efficient, introduces noise, slower convergence.
* **Mini-batch SGD = weight update per batch** → balances speed, stability, and memory.

