# CROSS VALIDATION TECHNIQUES

# K-Fold Cross Validation (Explained)

**What is it?**  

K-Fold Cross Validation is a way to check how well a machine learning model will perform on **unseen data**.  

Instead of keeping *one* fixed test set, we split the dataset into **K equal parts (folds)**.  
- In each round, **1 fold** is used as the **test set**, and the other **K-1 folds** are used for **training**.  
- This repeats **K times**, so every sample gets a chance to be in the test set exactly once.  
- Finally, we average the performance across all folds.

---

### üîé Example: 12 samples with 4-fold CV
Imagine our dataset has 12 points (0‚Äì11). If we split into **4 folds**, each fold will have **3 samples**:

```
Fold 1: Test = [0,1,2]      | Train = [3..11]
Fold 2: Test = [3,4,5]      | Train = [0..2, 6..11]
Fold 3: Test = [6,7,8]      | Train = [0..5, 9..11]
Fold 4: Test = [9,10,11]    | Train = [0..8]
```

So each sample is tested exactly once.  
This avoids the risk of testing on the same data you trained on.

---

### ‚öñÔ∏è Why is this useful?
- More **reliable estimate** of model accuracy (less dependent on a lucky/unlucky test split).  
- Helps spot problems like **overfitting**.  
- Gives mean + standard deviation ‚Üí so you know if your model is **stable**.

---

### üõ†Ô∏è Steps
1. Choose number of folds `K`.  
2. Split dataset into K parts.  
3. For each fold:  
   - Train on K-1 folds.  
   - Test on the remaining fold.  
4. Collect performance scores.  
5. Average them ‚Üí this is your **cross-validation score**.

---

### ‚ö†Ô∏è Disadvantages & When Not to Use K-Fold Cross Validation

- **Computationally expensive:** Training and evaluating the model K times can be slow, especially for large datasets or complex models.
- **Not ideal for time series data:** K-Fold randomly splits data, which can break the temporal order. For time series, use techniques like TimeSeriesSplit or walk-forward validation.
- **Imbalanced data:** If classes are not evenly distributed, some folds may have very few samples of a class. Use StratifiedKFold for classification tasks with imbalanced classes.
- **Data leakage risk:** If data points are not independent (e.g., multiple rows per user), splitting randomly can leak information between train and test sets. GroupKFold or custom splits are better in such cases.
- **Small datasets:** On very small datasets, the variance between folds can be high, making results less reliable.

In [7]:
# Example: K-Fold Cross Validation in Python (Manual and sklearn)
import numpy as np
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold, cross_val_score
np.set_printoptions(precision=2, suppress=True, floatmode='fixed', linewidth=100)
# Create a synthetic dataset
X, y = make_classification(n_samples=12, n_features=2, n_informative=2, n_redundant=0, random_state=42)
kf = KFold(n_splits=4, shuffle=False)
model = LogisticRegression()
scores = cross_val_score(model, X, y, cv=kf)# cd means which cross-validation method to use
print("Sklearn cross_val_score scores:", np.array2string(scores, separator=','))
print("Sklearn Mean accuracy:", np.mean(scores))
print("Sklearn Std deviation:", np.std(scores))

Sklearn cross_val_score scores: [1.00,1.00,0.33,0.00]
Sklearn Mean accuracy: 0.5833333333333334
Sklearn Std deviation: 0.4330127018922193


## 2). Hold-Out Method (Explained)

**What is it?**  
The Hold-Out method is the simplest way to evaluate a machine learning model. You split your dataset into two (sometimes three) parts:
- **Training set:** Used to train the model.
- **Test set:** Used to evaluate the model's performance on unseen data.
- (Optional) **Validation set:** Used to tune model parameters before the final test.

---

### ? Example: 12 samples with Hold-Out (8 train, 4 test)
Imagine our dataset has 12 points (0‚Äì11). If we use an 8/4 split:

```
Train = [0,1,2,3,4,5,6,7]   | Test = [8,9,10,11]
```

Or, visually:

```
[Train, Train, Train, Train, Train, Train, Train, Train, Test, Test, Test, Test]
[   0 ,    1 ,    2 ,    3 ,    4 ,    5 ,    6 ,    7 ,   8 ,   9 ,  10 , 11 ]
```

So only the last 4 samples are used for testing, the rest for training. (The split can be random or sequential.)

---

### ?üõ†Ô∏è Steps
1. Shuffle the dataset (optional, but recommended).
2. Split the data into training and test sets (commonly 70/30, 80/20, or 60/20/20 for train/validation/test).
3. Train the model on the training set.
4. Evaluate the model on the test set.

---

### ‚úÖ Advantages
- **Simple and fast:** Only one split and one training/testing cycle.
- **Good for very large datasets:** When you have lots of data, a single split is often enough.

---

### ‚ö†Ô∏è Disadvantages
- **High variance:** Results depend heavily on how the data is split. A "lucky" or "unlucky" split can give misleading results.
- **Not ideal for small datasets:** You might waste valuable data for testing instead of training.
- **No estimate of model stability:** You only get one performance score, not an average or standard deviation.

---

**Summary:**
- The Hold-Out method is quick and easy, but less reliable for small datasets or when you want a robust estimate of model performance.

In [9]:
# Example: Hold-Out Method in Python
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import numpy as np

# Create a synthetic dataset
X, y = make_classification(n_samples=12, n_features=2, n_informative=2, n_redundant=0, random_state=42)

# Split into 8 train and 4 test samples
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=4, random_state=42, shuffle=False)

# Show which samples are in train/test
print("Train indices:", np.arange(len(X_train)))
print("Test indices:", np.arange(len(X_train), len(X)))

# Train and evaluate
model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print("Test set predictions:", y_pred)
print("Test set true labels:", y_test)
print("Test set accuracy:", accuracy_score(y_test, y_pred))

Train indices: [0 1 2 3 4 5 6 7]
Test indices: [ 8  9 10 11]
Test set predictions: [1 1 0 1]
Test set true labels: [1 0 0 0]
Test set accuracy: 0.5


## 3). Stratified K-Fold Cross-Validation (Explained)

**What is it?**  
Stratified K-Fold Cross-Validation is a variation of K-Fold that ensures each fold has (as much as possible) the same proportion of each class label as the full dataset. This is especially important for classification problems with imbalanced classes.

---

### üîé Visual Example: 12 samples, 2 classes (0 and 1), 4-fold stratified
Suppose our dataset labels are:

```
Labels: [0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1]
```

- **Normal K-Fold** might split the data like this (class balance not guaranteed):

```
Fold 1: [0, 0, 0]   (all class 0)
Fold 2: [0, 1, 1]   (mixed)
Fold 3: [1, 1, 0]   (mixed)
Fold 4: [0, 1, 1]   (mixed)
```

- **Stratified K-Fold** tries to keep the class ratio in each fold similar to the whole dataset:

```
Fold 1: [0, 1, 0]
Fold 2: [0, 1, 1]
Fold 3: [0, 1, 1]
Fold 4: [0, 1, 0]
```

So, if the dataset is 50% class 0 and 50% class 1, each fold will be close to 50/50.

---

### üõ†Ô∏è Steps
1. Choose number of folds `K`.
2. Split the dataset so each fold has a similar class distribution as the whole dataset.
3. For each fold:
   - Train on K-1 folds.
   - Test on the remaining fold.
4. Collect and average performance scores.

---

### üîÑ Difference: K-Fold vs. Stratified K-Fold
- **K-Fold:** Splits data into K parts randomly, class balance in each fold is not guaranteed.
- **Stratified K-Fold:** Splits data so each fold has a similar class distribution as the full dataset (better for imbalanced data).

---

### ‚úÖ When to use Stratified K-Fold?
- When you have **classification problems** (especially with imbalanced classes).
- When you want more reliable and fair evaluation across all classes.

---

**Summary:**
- Use Stratified K-Fold for classification tasks to avoid folds with missing or underrepresented classes.

In [10]:
# Example: Stratified K-Fold Cross-Validation in Python
from sklearn.model_selection import StratifiedKFold
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import numpy as np

# Create a synthetic imbalanced dataset
X, y = make_classification(n_samples=12, n_features=2, n_informative=2, n_redundant=0, n_classes=2, weights=[0.5, 0.5], random_state=42)

# Show class distribution
print("Labels:", y)

skf = StratifiedKFold(n_splits=4, shuffle=False)
fold = 1
scores = []
for train_idx, test_idx in skf.split(X, y):
    print(f"Fold {fold} - Test indices: {test_idx}, Test labels: {y[test_idx]}")
    X_train, X_test = X[train_idx], X[test_idx]
    y_train, y_test = y[train_idx], y[test_idx]
    model = LogisticRegression()
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    print(f"  Accuracy: {acc:.2f}")
    scores.append(acc)
    fold += 1
print("Stratified K-Fold scores:", np.array2string(np.array(scores), separator=','))
print("Mean accuracy:", np.mean(scores))
print("Std deviation:", np.std(scores))

Labels: [0 0 1 1 1 0 1 1 1 0 0 0]
Fold 1 - Test indices: [0 1 2], Test labels: [0 0 1]
  Accuracy: 1.00
Fold 2 - Test indices: [3 5 9], Test labels: [1 0 0]
  Accuracy: 0.67
Fold 3 - Test indices: [ 4  6 10], Test labels: [1 1 0]
  Accuracy: 0.67
Fold 4 - Test indices: [ 7  8 11], Test labels: [1 1 0]
  Accuracy: 0.67
Stratified K-Fold scores: [1.00,0.67,0.67,0.67]
Mean accuracy: 0.7499999999999999
Std deviation: 0.14433756729740646


## 4). Leave-One-Out Cross-Validation (LOO or LOOCV)

**What is it?**  
Leave-One-Out Cross-Validation is an extreme case of K-Fold where the number of folds equals the number of samples in the dataset. Each sample is used once as a test set (of size 1), and the rest as the training set.

---

### üîé Visual Example: 5 samples
Suppose our dataset has 5 samples (0‚Äì4):

```
Iteration 1: Test = [0] | Train = [1,2,3,4]
Iteration 2: Test = [1] | Train = [0,2,3,4]
Iteration 3: Test = [2] | Train = [0,1,3,4]
Iteration 4: Test = [3] | Train = [0,1,2,4]
Iteration 5: Test = [4] | Train = [0,1,2,3]
```

So, each sample is tested exactly once, and the model is trained on all the others each time.

---

### üõ†Ô∏è Steps
1. For each sample in the dataset:
   - Use that sample as the test set.
   - Use all other samples as the training set.
   - Train and evaluate the model.
2. Collect all performance scores and average them.

---

### ‚úÖ Advantages
- **Maximizes training data:** Each model is trained on as much data as possible (all but one sample).
- **No randomness:** Every possible train/test split is used.
- **Good for very small datasets:** Makes the most of limited data.

---

### ‚ö†Ô∏è Disadvantages
- **Very slow for large datasets:** Number of models trained = number of samples.
- **High variance:** Each test set is just one sample, so results can be noisy.
- **Not practical for big data.**

---

**Summary:**
- Leave-One-Out is best for small datasets where you want to use every sample for both training and testing, but is too slow for large datasets.

In [12]:
# Example: Leave-One-Out Cross-Validation in Python
from sklearn.model_selection import LeaveOneOut
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import numpy as np

# Create a small synthetic dataset
X, y = make_classification(n_samples=8, n_features=2, n_informative=2, n_redundant=0, random_state=42)

loo = LeaveOneOut()
scores = []
iteration = 1
for train_idx, test_idx in loo.split(X):
    print(f"Iteration {iteration} - Test index: {test_idx[0]}, Test label: {y[test_idx][0]}")
    X_train, X_test = X[train_idx], X[test_idx]
    y_train, y_test = y[train_idx], y[test_idx]
    model = LogisticRegression()
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    print(f"  Prediction: {y_pred[0]}, Accuracy: {acc:.2f}")
    scores.append(acc)
    iteration += 1
print("LOO scores:", np.array2string(np.array(scores), separator=','))
print("Mean accuracy:", np.mean(scores))
print("Std deviation:", np.std(scores))

Iteration 1 - Test index: 0, Test label: 0
  Prediction: 0, Accuracy: 1.00
Iteration 2 - Test index: 1, Test label: 1
  Prediction: 1, Accuracy: 1.00
Iteration 3 - Test index: 2, Test label: 0
  Prediction: 1, Accuracy: 0.00
Iteration 4 - Test index: 3, Test label: 0
  Prediction: 0, Accuracy: 1.00
Iteration 5 - Test index: 4, Test label: 1
  Prediction: 0, Accuracy: 0.00
Iteration 6 - Test index: 5, Test label: 1
  Prediction: 0, Accuracy: 0.00
Iteration 7 - Test index: 6, Test label: 0
  Prediction: 0, Accuracy: 1.00
Iteration 8 - Test index: 7, Test label: 1
  Prediction: 1, Accuracy: 1.00
LOO scores: [1.00,1.00,0.00,1.00,0.00,0.00,1.00,1.00]
Mean accuracy: 0.625
Std deviation: 0.4841229182759271


## 5). Leave-P-Out Cross-Validation (LPOCV)

**What is it?**  
Leave-P-Out Cross-Validation is a generalization of Leave-One-Out. Instead of leaving out one sample at a time, you leave out every possible combination of `P` samples as the test set, and use the remaining samples for training. This is repeated for all possible combinations.

---

### üîé Visual Example: 4 samples, P=2
Suppose our dataset has 4 samples (0‚Äì3) and we set P=2:

All possible ways to leave out 2 samples:
```
Iteration 1: Test = [0,1] | Train = [2,3]
Iteration 2: Test = [0,2] | Train = [1,3]
Iteration 3: Test = [0,3] | Train = [1,2]
Iteration 4: Test = [1,2] | Train = [0,3]
Iteration 5: Test = [1,3] | Train = [0,2]
Iteration 6: Test = [2,3] | Train = [0,1]
```

So, for each possible pair of samples, those are used as the test set, and the rest for training.

---

### üõ†Ô∏è Steps
1. For every possible combination of `P` samples:
   - Use those `P` samples as the test set.
   - Use the remaining samples as the training set.
   - Train and evaluate the model.
2. Collect all performance scores and average them.

---

### ‚úÖ Advantages
- **Very thorough:** Every possible way to split out `P` samples is tested.
- **Good for very small datasets:** Makes the most of limited data.
- **No randomness:** All possible splits are used.

---

### ‚ö†Ô∏è Disadvantages
- **Extremely slow for large datasets or large P:** The number of combinations grows rapidly (combinatorial explosion).
- **Not practical for big data.**
- **High computational cost:** Not used in practice for anything but very small datasets.

---

**Summary:**
- Leave-P-Out is a very exhaustive method, best for small datasets and small values of P. For larger datasets, it quickly becomes impractical.

In [13]:
# Example: Leave-P-Out Cross-Validation in Python (P=2)
from sklearn.model_selection import LeavePOut
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import numpy as np

# Create a very small synthetic dataset
X, y = make_classification(n_samples=6, n_features=2, n_informative=2, n_redundant=0, random_state=42)

lpo = LeavePOut(p=2)
scores = []
iteration = 1
for train_idx, test_idx in lpo.split(X):
    print(f"Iteration {iteration} - Test indices: {test_idx}, Test labels: {y[test_idx]}")
    X_train, X_test = X[train_idx], X[test_idx]
    y_train, y_test = y[train_idx], y[test_idx]
    model = LogisticRegression()
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    print(f"  Predictions: {y_pred}, Accuracy: {acc:.2f}")
    scores.append(acc)
    iteration += 1
print("LPOCV scores:", np.array2string(np.array(scores), separator=','))
print("Mean accuracy:", np.mean(scores))
print("Std deviation:", np.std(scores))

Iteration 1 - Test indices: [0 1], Test labels: [0 0]
  Predictions: [1 1], Accuracy: 0.00
Iteration 2 - Test indices: [0 2], Test labels: [0 0]
  Predictions: [1 1], Accuracy: 0.00
Iteration 3 - Test indices: [0 3], Test labels: [0 1]
  Predictions: [0 1], Accuracy: 1.00
Iteration 4 - Test indices: [0 4], Test labels: [0 1]
  Predictions: [0 1], Accuracy: 1.00
Iteration 5 - Test indices: [0 5], Test labels: [0 1]
  Predictions: [0 0], Accuracy: 0.50
Iteration 6 - Test indices: [1 2], Test labels: [0 0]
  Predictions: [0 1], Accuracy: 0.50
Iteration 7 - Test indices: [1 3], Test labels: [0 1]
  Predictions: [0 1], Accuracy: 1.00
Iteration 8 - Test indices: [1 4], Test labels: [0 1]
  Predictions: [0 1], Accuracy: 1.00
Iteration 9 - Test indices: [1 5], Test labels: [0 1]
  Predictions: [0 0], Accuracy: 0.50
Iteration 10 - Test indices: [2 3], Test labels: [0 1]
  Predictions: [0 1], Accuracy: 1.00
Iteration 11 - Test indices: [2 4], Test labels: [0 1]
  Predictions: [0 1], Accuracy: 1.

## 6). Monte Carlo Cross-Validation (Shuffle-Split)

**What is it?**  
Monte Carlo Cross-Validation (also called Shuffle-Split) is a resampling method where the dataset is randomly split into training and test sets multiple times. Each split is independent, and the process is repeated for a specified number of iterations.

---

### üîé Visual Example: 8 samples, 3 iterations, 75% train / 25% test
Suppose our dataset has 8 samples (0‚Äì7). For each iteration, we randomly split the data:

```
Iteration 1: Train = [0,2,3,4,5,7] | Test = [1,6]
Iteration 2: Train = [1,2,4,5,6,7] | Test = [0,3]
Iteration 3: Train = [0,1,2,3,5,6] | Test = [4,7]
```

- The splits are random and can overlap.
- Each sample may appear in the test set multiple times, or not at all.

---

### üõ†Ô∏è Steps
1. Choose the number of iterations and the train/test split ratio.
2. For each iteration:
   - Randomly split the data into training and test sets.
   - Train and evaluate the model.
3. Collect all performance scores and average them.

---

### ‚úÖ Advantages
- **Flexible:** You can control the number of splits and the size of train/test sets.
- **Good for small to medium datasets:** More robust than a single hold-out split.
- **Less computationally expensive than exhaustive methods (like LPOCV).**

---

### ‚ö†Ô∏è Disadvantages
- **Randomness:** Some samples may never be in the test set, others may be in the test set multiple times.
- **Not as systematic as K-Fold:** Not all samples are guaranteed to be tested exactly once.
- **Results can vary between runs unless you set a random seed.**

---

**Summary:**
- Monte Carlo Cross-Validation is a flexible, randomized method for estimating model performance, especially useful when you want to repeat random splits many times.

In [14]:
# Example: Monte Carlo Cross-Validation (Shuffle-Split) in Python
from sklearn.model_selection import ShuffleSplit
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import numpy as np

# Create a small synthetic dataset
X, y = make_classification(n_samples=8, n_features=2, n_informative=2, n_redundant=0, random_state=42)

ss = ShuffleSplit(n_splits=3, test_size=2, random_state=42)
scores = []
iteration = 1
for train_idx, test_idx in ss.split(X):
    print(f"Iteration {iteration} - Train indices: {train_idx}, Test indices: {test_idx}")
    X_train, X_test = X[train_idx], X[test_idx]
    y_train, y_test = y[train_idx], y[test_idx]
    model = LogisticRegression()
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    print(f"  Predictions: {y_pred}, True: {y_test}, Accuracy: {acc:.2f}")
    scores.append(acc)
    iteration += 1
print("ShuffleSplit scores:", np.array2string(np.array(scores), separator=','))
print("Mean accuracy:", np.mean(scores))
print("Std deviation:", np.std(scores))

Iteration 1 - Train indices: [0 7 2 4 3 6], Test indices: [1 5]
  Predictions: [1 0], True: [1 1], Accuracy: 0.50
Iteration 2 - Train indices: [0 4 5 2 1 6], Test indices: [3 7]
  Predictions: [1 1], True: [0 1], Accuracy: 0.50
Iteration 3 - Train indices: [3 1 4 5 2 7], Test indices: [0 6]
  Predictions: [1 1], True: [0 0], Accuracy: 0.00
ShuffleSplit scores: [0.50,0.50,0.00]
Mean accuracy: 0.3333333333333333
Std deviation: 0.23570226039551584
