```{contents}
```


## Optmization Techniques

### 1. Weight Initialization

Proper initialization prevents slow learning or gradient problems.

* **Random Initialization** – Basic start, but can cause issues
* **Xavier/Glorot Initialization** – For sigmoid/tanh activations
* **He Initialization** – For ReLU and variants
  Goal: Keep activations and gradients stable across layers.

---

### 2. Learning Rate Scheduling

Controls how fast weights are updated.

* **Step Decay** – Reduce learning rate every few epochs
* **Exponential Decay** – Reduce LR gradually every step
* **Reduce on Plateau** – Lower LR when validation stops improving
* **Warm Restarts / Cyclical LR** – Vary LR in cycles
  Goal: Fast initial learning + stable convergence.

---

### 3. Gradient Clipping

Prevents **exploding gradients**, especially in deep and recurrent networks.

* Sets a max norm or value for gradients
* Keeps training stable

---

### 4. Regularization

Prevents overfitting by reducing model complexity.

* **L1 Regularization** – Forces sparsity (weights can become 0)
* **L2 Regularization (Weight Decay)** – Shrinks large weights
* **Dropout** – Randomly drops neurons during training
* **Max Norm Constraints** – Limits weight values

---

### 5. Batch Normalization

Normalizes layer inputs during training.

* Stabilizes and speeds up convergence
* Reduces internal covariate shift
* Acts as mild regularization

---

### 6. Early Stopping

Stops training when validation loss stops improving.

* Prevents overfitting
* Saves compute time

---

### 7. Data Augmentation (for vision/NLP tasks)

Artificially expands training data.

* Rotations, flips, crops (for images)
* Noise addition, masking, synonyms (for text/audio)
* Improves generalization

---

### 8. Momentum-Based Training (if supported by optimizer)

Helps escape local minima and speeds up convergence.

* Adds a fraction of past gradients to new updates
  (Not an optimizer itself but used inside many optimizers.)

---

### 9. Label Smoothing

Prevents overconfident predictions by softening labels.

Example: Instead of class "cat" = 1.0, others = 0.0 → use 0.9 and 0.1.

---

### 10. Gradient Accumulation

Useful when batch size is limited due to memory.

* Accumulates gradients over small batches
* Updates weights after N steps

---

### ✅ In Short:

| Goal                | Techniques                                                |
| ------------------- | --------------------------------------------------------- |
| Prevent Overfitting | Dropout, L1/L2, Early Stopping, Data Augmentation         |
| Stabilize Training  | Batch Norm, Gradient Clipping, Weight Init                |
| Improve Convergence | Learning Rate Scheduling, Momentum, Gradient Accumulation |

