```{contents}
```


# Cross-Validation

## Why do we need Cross-Validation

* When training ML models, we split data into:

  * **Training set** → model learns patterns.
  * **Validation set** → used for hyperparameter tuning.
  * **Test set** → unseen data, used only at the end to check performance.

If we simply split data once (say 70% train, 30% test), the result depends on the random split.
👉 Cross-validation helps us get a **more reliable performance estimate** by training and validating on multiple splits.

---

## Types of Cross-Validation

### 1. **Leave-One-Out CV (LOOCV)**

* Take one data point as validation, rest as training.
* Repeat for every data point.
* Accuracy = average of all experiments.

✅ Advantage: maximum use of training data.
❌ Disadvantages:

* Computationally expensive (n experiments for n records).
* Prone to overfitting (since validation set = 1 record).

---

### 2. **Leave-P-Out CV**

* Instead of leaving 1 record, leave *p* records as validation.
* Train on the rest.
* Repeat for all possible combinations.

✅ More flexible than LOOCV.
❌ Impractical for large datasets (combinatorial explosion).

---

### 3. **K-Fold Cross-Validation**

* Split dataset into *k* equal folds.
* Train on *k-1* folds, validate on the remaining fold.
* Repeat *k* times (each fold used once as validation).
* Final score = average of all folds.

✅ Balance between efficiency and reliability.
✅ Most common method in ML.
❌ Validation set may not preserve class distribution (in classification).

---

### 4. **Stratified K-Fold CV**

* Same as K-Fold, but ensures **class distribution** is preserved in each fold.
* Example: if dataset has 60% positive and 40% negative labels → each fold keeps \~60:40 ratio.

✅ Very important for classification tasks with **imbalanced data**.

---

### 5. **Time Series Cross-Validation**

* In time series, order matters → can’t randomly shuffle.
* Train on past data, validate on future data.
* Example:

  * Train = Day 1–100, Validate = Day 101–120
  * Train = Day 1–120, Validate = Day 121–140

✅ Used in forecasting, stock prediction, sentiment analysis.
❌ Training size keeps growing → computationally heavier.

---

## 🔹 Summary Table

| CV Type               | Works Best For            | Pros                   | Cons                             |
| --------------------- | ------------------------- | ---------------------- | -------------------------------- |
| **Hold-Out**          | Large datasets            | Simple, fast           | High variance, depends on split  |
| **LOOCV**             | Very small data           | Uses max training data | Very slow, overfitting           |
| **Leave-P-Out**       | Small datasets            | Flexible               | Impractical for big data         |
| **K-Fold**            | General ML tasks          | Reliable, efficient    | Random split may cause imbalance |
| **Stratified K-Fold** | Classification            | Maintains class ratios | Slightly slower                  |
| **Time Series CV**    | Forecasting/temporal data | Respects time order    | Increasing training size         |

---

## 🔹 Why Cross-Validation?

* Provides **stable estimate** of performance.
* Helps avoid **overfitting** by testing model on multiple validation sets.
* Essential for **hyperparameter tuning** (GridSearchCV, RandomizedSearchCV, Bayesian Optimization).

