# üéØ Cross-Validation in Machine Learning
### Siva.Jasthi@metrostate.edu
### Machine Learning and Data Mining

---

## üìö What You'll Learn Today

In this notebook, you'll discover:
- What cross-validation is and why it's crucial for ML
- Different types of cross-validation techniques
- When to use each technique
- How to implement them in Python
- How to interpret the results

---
## ü§î What is Cross-Validation?

### The Problem
Imagine you're studying for a test. If you only practice with the exact same questions that will be on the test, you might memorize the answers without really learning. That's called **overfitting** in machine learning!

### The Solution: Cross-Validation
Cross-validation is like practicing with different practice tests to make sure you really understand the material, not just memorize specific questions.

### Real-World Analogy üéÆ
Think of it like testing a video game:
- **Bad approach:** Only test one level over and over
- **Good approach (Cross-Validation):** Test different levels to make sure the game works everywhere

### Why Do We Need It?
1. **Reliability:** Get a better estimate of how your model performs on new data
2. **Fairness:** Test on multiple different subsets of data
3. **Confidence:** Know how consistent your model's performance is
4. **Optimization:** Compare different models fairly

---

## üìä Types of Cross-Validation

We'll explore these techniques:

| Technique | When to Use | Pros | Cons |
|-----------|-------------|------|------|
| **K-Fold** | Most situations | Fast, reliable | May miss patterns |
| **Stratified K-Fold** | Imbalanced classes | Preserves class distribution | Only for classification |
| **Leave-One-Out (LOO)** | Small datasets | Uses all data | Very slow |
| **Leave-P-Out (LPO)** | Very small datasets | Thorough | Extremely slow |
| **Time Series Split** | Time-based data | Respects time order | Only for time series |

---

## üîß Setup: Import Libraries

Let's import all the tools we'll need for our cross-validation journey!

In [None]:
# Import cross-validation tools
from sklearn.model_selection import (
    cross_val_score,      # Function to perform cross-validation
    KFold,                # K-Fold CV
    StratifiedKFold,      # Stratified K-Fold CV
    LeaveOneOut,          # Leave-One-Out CV
    LeavePOut,            # Leave-P-Out CV
    TimeSeriesSplit       # Time Series CV
)

# Import machine learning models
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.ensemble import RandomForestClassifier

# Import datasets
from sklearn.datasets import (
    make_classification,      # Create synthetic classification data
    load_iris,                # Classic flower classification dataset
    fetch_california_housing, # Housing price prediction dataset
    load_wine                 # Wine classification dataset
)

# Import utilities
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Set random seed for reproducibility (so we all get the same results!)
np.random.seed(42)

print("‚úÖ All libraries imported successfully!")
print("üöÄ Ready to explore Cross-Validation!")

---
# 1Ô∏è‚É£ K-Fold Cross-Validation

## üéØ The Main Technique

### How It Works (Deck of Cards Analogy üÉè)
Imagine you have a deck of 100 cards:
1. **Shuffle** the deck
2. **Divide** into 5 equal piles (20 cards each)
3. **Test** using one pile, **train** using the other 4 piles
4. **Repeat** 5 times, using each pile as the test set once
5. **Average** all 5 scores for final result

### Visual Representation
```
Fold 1: [TEST] [TRAIN] [TRAIN] [TRAIN] [TRAIN]
Fold 2: [TRAIN] [TEST] [TRAIN] [TRAIN] [TRAIN]
Fold 3: [TRAIN] [TRAIN] [TEST] [TRAIN] [TRAIN]
Fold 4: [TRAIN] [TRAIN] [TRAIN] [TEST] [TRAIN]
Fold 5: [TRAIN] [TRAIN] [TRAIN] [TRAIN] [TEST]
```

### Key Parameters
- `n_splits=5`: Number of folds (commonly 5 or 10)
- `shuffle=True`: Randomly mix data before splitting
- `random_state=42`: Ensures same shuffle every time (reproducibility)

---

## üìù Example 1: K-Fold on Logistic Regression

Let's classify data into two categories (like spam vs. not spam emails)

In [None]:
print("="*60)
print("K-FOLD CROSS-VALIDATION ON LOGISTIC REGRESSION")
print("="*60)

# STEP 1: Create a synthetic dataset for binary classification
# Think of this as creating fake data about two groups (like cats vs dogs)
X, y = make_classification(
    n_samples=100,        # 100 data points
    n_features=10,        # 10 measurements per data point
    n_classes=2,          # 2 categories (binary classification)
    n_informative=8,      # 8 features are actually useful
    n_redundant=2,        # 2 features are just noise
    random_state=42       # For reproducibility
)

print(f"üìä Dataset created: {X.shape[0]} samples, {X.shape[1]} features")
print(f"üìà Classes distribution: {np.bincount(y)}")
print()

# STEP 2: Create the model
# Logistic Regression is good for yes/no, true/false predictions
model = LogisticRegression(max_iter=1000)  # max_iter prevents warnings

# STEP 3: Set up K-Fold Cross-Validation
kfold = KFold(
    n_splits=5,           # Split data into 5 parts
    shuffle=True,         # Randomly shuffle before splitting
    random_state=42       # Same shuffle every time
)

# STEP 4: Run cross-validation
# This trains and tests the model 5 times, each time with different test data
results = cross_val_score(
    model,                # The model to test
    X,                    # The features
    y,                    # The labels
    cv=kfold,             # The cross-validation strategy
    scoring='accuracy'    # How to measure success (% correct)
)

# STEP 5: Display and interpret results
print("üîç K-Fold Cross-Validation Results:")
print("-" * 50)
for i, score in enumerate(results, 1):
    print(f"Fold {i}: {score:.3f} ({score*100:.1f}% accuracy)")

print("-" * 50)
print(f"\nüìä Summary Statistics:")
print(f"   Mean Accuracy: {results.mean():.3f} ({results.mean()*100:.1f}%)")
print(f"   Std Deviation: {results.std():.3f} ({results.std()*100:.1f}%)")
print(f"   Min Accuracy:  {results.min():.3f} ({results.min()*100:.1f}%)")
print(f"   Max Accuracy:  {results.max():.3f} ({results.max()*100:.1f}%)")

# INTERPRETATION GUIDE
print("\nüí° What does this mean?")
if results.std() < 0.05:
    print("   ‚úÖ Low standard deviation = Consistent performance!")
else:
    print("   ‚ö†Ô∏è  High standard deviation = Performance varies a lot")

if results.mean() > 0.85:
    print("   ‚úÖ High mean accuracy = Model performs well!")
elif results.mean() > 0.70:
    print("   ‚ö†Ô∏è  Moderate accuracy = Room for improvement")
else:
    print("   ‚ùå Low accuracy = Model needs work")

print("\n" + "="*60)

### üéì Understanding the Results

**Mean Accuracy**: The average performance across all 5 folds
- Think of it as your overall grade

**Standard Deviation**: How much the scores vary
- Low std dev (< 0.05): Your model is consistent! üéØ
- High std dev (> 0.10): Your model's performance is unpredictable üé≤

**Why 5 folds?**
- Common choices: 5 or 10 folds
- More folds = more training data per fold, but slower
- Fewer folds = faster, but less reliable

---

---
# 2Ô∏è‚É£ Stratified K-Fold Cross-Validation

## üéØ The "Fair Distribution" Technique

### The Problem It Solves
Imagine you have a class with:
- 90 students who like pizza üçï
- 10 students who like salad ü•ó

If you randomly split into groups, one group might have NO salad lovers!

### How Stratified K-Fold Helps
It ensures each fold has the **same proportion** of each class:
- Each fold will have ~90% pizza lovers and ~10% salad lovers

### When to Use
‚úÖ **Use when:** You have imbalanced classes (unequal numbers in categories)
‚ùå **Don't use when:** You have regression problems (predicting numbers, not categories)

---

In [None]:
print("="*60)
print("STRATIFIED K-FOLD CROSS-VALIDATION")
print("="*60)

# STEP 1: Create an IMBALANCED dataset
# This simulates real-world scenarios like fraud detection
# (where fraud cases are rare)
X_imbalanced, y_imbalanced = make_classification(
    n_samples=1000,
    n_features=10,
    n_classes=2,
    weights=[0.9, 0.1],    # 90% class 0, 10% class 1
    flip_y=0.01,           # Add 1% label noise
    random_state=42
)

# Check the imbalance
class_counts = np.bincount(y_imbalanced)
print(f"\nüìä Dataset Distribution:")
print(f"   Class 0: {class_counts[0]} samples ({class_counts[0]/len(y_imbalanced)*100:.1f}%)")
print(f"   Class 1: {class_counts[1]} samples ({class_counts[1]/len(y_imbalanced)*100:.1f}%)")
print(f"   ‚ö†Ô∏è  This is IMBALANCED! Perfect for Stratified K-Fold.")
print()

# STEP 2: Create model
model = LogisticRegression(max_iter=1000)

# STEP 3: Compare Regular K-Fold vs Stratified K-Fold

# Regular K-Fold (might create unfair splits)
kfold = KFold(n_splits=5, shuffle=True, random_state=42)
regular_results = cross_val_score(model, X_imbalanced, y_imbalanced, cv=kfold)

# Stratified K-Fold (ensures fair class distribution)
stratified_kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
stratified_results = cross_val_score(model, X_imbalanced, y_imbalanced, cv=stratified_kfold)

# STEP 4: Compare results
print("üîÑ Regular K-Fold Results:")
print(f"   Scores: {[f'{s:.3f}' for s in regular_results]}")
print(f"   Mean: {regular_results.mean():.3f}, Std: {regular_results.std():.3f}")
print()

print("‚ú® Stratified K-Fold Results:")
print(f"   Scores: {[f'{s:.3f}' for s in stratified_results]}")
print(f"   Mean: {stratified_results.mean():.3f}, Std: {stratified_results.std():.3f}")
print()

print("üìä Comparison:")
print(f"   Difference in Std Dev: {abs(regular_results.std() - stratified_results.std()):.4f}")
if stratified_results.std() < regular_results.std():
    print("   ‚úÖ Stratified K-Fold is MORE CONSISTENT!")
else:
    print("   Similar consistency in this case")

print("\n" + "="*60)

### üéì Why Use Stratified K-Fold?

**Real-World Examples:**
1. **Medical Diagnosis**: Rare diseases (few positive cases)
2. **Fraud Detection**: Most transactions are legitimate
3. **Spam Detection**: Most emails aren't spam

**Key Benefit:**
- Each fold has the same ratio of classes as the original dataset
- More reliable performance estimates
- Fairer comparison between models

---

---
# 3Ô∏è‚É£ Leave-One-Out Cross-Validation (LOO)

## üéØ The "Test Each Sample" Technique

### How It Works
Imagine you have 100 data points:
1. Use 1 point for testing, 99 for training
2. Repeat 100 times, each time using a different point for testing
3. Average all 100 results

### Visual Representation
```
Iteration 1:  [TEST] [TRAIN] [TRAIN] ... [TRAIN]  (99 training points)
Iteration 2:  [TRAIN] [TEST] [TRAIN] ... [TRAIN]
Iteration 3:  [TRAIN] [TRAIN] [TEST] ... [TRAIN]
...
Iteration 100: [TRAIN] [TRAIN] [TRAIN] ... [TEST]
```

### Pros and Cons
‚úÖ **Pros:**
- Maximum use of data (99% for training each time!)
- No randomness involved
- Good for small datasets

‚ùå **Cons:**
- VERY slow (trains 100 models for 100 data points)
- Computationally expensive
- High variance in results

---

In [None]:
print("="*60)
print("LEAVE-ONE-OUT CROSS-VALIDATION")
print("="*60)

# STEP 1: Create a SMALL dataset (LOO is slow, so we use small data)
# Using only 50 samples to keep it fast
X_small, y_small = make_classification(
    n_samples=50,         # Small dataset
    n_features=5,         # Fewer features
    n_classes=2,
    n_informative=3,
    random_state=42
)

print(f"\nüìä Small Dataset: {X_small.shape[0]} samples")
print(f"   ‚ö†Ô∏è  LOO will train {X_small.shape[0]} different models!")
print()

# STEP 2: Create model
model = LogisticRegression(max_iter=1000)

# STEP 3: Set up Leave-One-Out
loo = LeaveOneOut()

# Count how many iterations we'll have
n_iterations = loo.get_n_splits(X_small)
print(f"üîÑ Running {n_iterations} iterations...")

# STEP 4: Run LOO cross-validation
import time
start_time = time.time()

loo_results = cross_val_score(model, X_small, y_small, cv=loo)

end_time = time.time()
elapsed_time = end_time - start_time

# STEP 5: Display results
print(f"‚úÖ Completed in {elapsed_time:.2f} seconds")
print()

# Show first 10 results (otherwise too many to display!)
print("üìä First 10 Results (out of 50):")
print(f"   {loo_results[:10]}")
print(f"   ...")
print()

print("üìà Summary Statistics:")
print(f"   Mean Accuracy: {loo_results.mean():.3f} ({loo_results.mean()*100:.1f}%)")
print(f"   Std Deviation: {loo_results.std():.3f}")
print(f"   Correct Predictions: {np.sum(loo_results)} out of {len(loo_results)}")
print()

# STEP 6: Compare with K-Fold
kfold = KFold(n_splits=5, shuffle=True, random_state=42)
start_time = time.time()
kfold_results = cross_val_score(model, X_small, y_small, cv=kfold)
kfold_time = time.time() - start_time

print("‚ö° Speed Comparison:")
print(f"   LOO:    {elapsed_time:.3f} seconds ({n_iterations} models)")
print(f"   K-Fold: {kfold_time:.3f} seconds (5 models)")
print(f"   LOO is {elapsed_time/kfold_time:.1f}x SLOWER!")

print("\n" + "="*60)

### üéì When to Use LOO?

**Use LOO when:**
- Dataset is VERY small (< 100 samples)
- You need maximum use of training data
- Computational time is not a concern

**Avoid LOO when:**
- Dataset is large (> 1000 samples)
- You need fast results
- K-Fold gives similar accuracy in less time

**Pro Tip:** For most situations, **5-Fold or 10-Fold is better** than LOO!

---

---
# 4Ô∏è‚É£ Cross-Validation on Regression Problems

## üè† Predicting Housing Prices

So far we've done **classification** (predicting categories).
Now let's try **regression** (predicting numbers)!

### Real-World Example
Predicting house prices based on:
- Number of rooms üõèÔ∏è
- Location üìç
- Age of house üèöÔ∏è
- Population density üë•

---

In [None]:
print("="*60)
print("K-FOLD CROSS-VALIDATION ON LINEAR REGRESSION")
print("Predicting California Housing Prices")
print("="*60)

# STEP 1: Load the California Housing dataset
# This is real data about houses in California!
housing = fetch_california_housing()

print("\nüìä Dataset Information:")
print(f"   Samples: {housing.data.shape[0]}")
print(f"   Features: {housing.data.shape[1]}")
print(f"\nüè† Features (what we measure):")
for i, feature in enumerate(housing.feature_names, 1):
    print(f"   {i}. {feature}")
print(f"\nüéØ Target: {housing.target_names[0]} (in $100,000s)")
print()

# Prepare data
X, y = housing.data, housing.target

# STEP 2: Create Linear Regression model
# This finds the best line to fit the data
lin_reg = LinearRegression()

# STEP 3: Set up K-Fold
k = 5
kfold = KFold(n_splits=k, shuffle=True, random_state=42)

# STEP 4: Perform K-fold cross-validation MANUALLY
# (to see what's happening behind the scenes)
print("üîÑ Running K-Fold Cross-Validation...")
print("-" * 50)

kfold_scores = []
for fold, (train_index, test_index) in enumerate(kfold.split(X), 1):
    # Split data
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

    # Train model
    lin_reg.fit(X_train, y_train)

    # Test model (R¬≤ score: 1.0 is perfect, 0.0 is random)
    score = lin_reg.score(X_test, y_test)
    kfold_scores.append(score)

    print(f"Fold {fold}: R¬≤ = {score:.4f}")
    print(f"   Training samples: {len(X_train)}")
    print(f"   Testing samples:  {len(X_test)}")

print("-" * 50)

# STEP 5: Calculate and display statistics
mean_score = np.mean(kfold_scores)
std_score = np.std(kfold_scores)

print(f"\nüìä Final Results:")
print(f"   Mean R¬≤ Score: {mean_score:.4f}")
print(f"   Std Deviation: {std_score:.4f}")
print(f"   Min Score: {min(kfold_scores):.4f}")
print(f"   Max Score: {max(kfold_scores):.4f}")

# STEP 6: Interpret R¬≤ score
print(f"\nüí° What does R¬≤ = {mean_score:.4f} mean?")
variance_explained = mean_score * 100
print(f"   The model explains {variance_explained:.1f}% of the variance in house prices")

if mean_score > 0.7:
    print("   ‚úÖ Good! The model captures most of the patterns")
elif mean_score > 0.5:
    print("   ‚ö†Ô∏è  Moderate. Some patterns are missed")
else:
    print("   ‚ùå Poor. Model needs improvement")

print("\n" + "="*60)

### üéì Understanding R¬≤ Score

**What is R¬≤ (R-squared)?**
- Measures how well your model predicts values
- Range: 0.0 to 1.0 (can be negative for terrible models)

**Interpretation Guide:**
- **R¬≤ = 1.0**: Perfect predictions! üéØ
- **R¬≤ = 0.8**: Explains 80% of variance (very good) ‚úÖ
- **R¬≤ = 0.6**: Explains 60% of variance (okay) üëç
- **R¬≤ = 0.3**: Explains 30% of variance (needs work) ‚ö†Ô∏è
- **R¬≤ = 0.0**: Model is useless (just guessing the average) ‚ùå

**Note:** For regression, we can't use Stratified K-Fold (that's only for classification!)

---

---
# 5Ô∏è‚É£ Cross-Validation on Decision Trees

## üå≥ Classification: Iris Flowers

Decision Trees make decisions like a flowchart:
- "Is petal length > 2.5cm?"
  - If YES ‚Üí "Is petal width > 1.7cm?"
  - If NO ‚Üí It's a Setosa!

Let's classify iris flowers into 3 species! üå∏

---

In [None]:
print("="*60)
print("DECISION TREE CLASSIFICATION - IRIS FLOWERS")
print("="*60)

# STEP 1: Load the famous Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

print("\nüå∏ Iris Dataset:")
print(f"   Samples: {X.shape[0]}")
print(f"   Features: {X.shape[1]}")
print(f"\nüìè Features:")
for i, feature in enumerate(iris.feature_names, 1):
    print(f"   {i}. {feature}")

print(f"\nüéØ Target Classes (Flower Species):")
for i, species in enumerate(iris.target_names):
    count = np.sum(y == i)
    print(f"   {i}. {species}: {count} samples")
print()

# STEP 2: Create Decision Tree model
dtc = DecisionTreeClassifier(
    max_depth=3,          # Limit tree depth to prevent overfitting
    random_state=42
)

# STEP 3: Compare K-Fold vs Stratified K-Fold
k = 5

# Regular K-Fold
kfold = KFold(n_splits=k, shuffle=True, random_state=42)
kfold_scores = cross_val_score(dtc, X, y, cv=kfold)

# Stratified K-Fold (better for multi-class classification)
strat_kfold = StratifiedKFold(n_splits=k, shuffle=True, random_state=42)
strat_scores = cross_val_score(dtc, X, y, cv=strat_kfold)

# STEP 4: Display results side by side
print("üìä Comparison of Cross-Validation Methods:")
print("-" * 50)
print(f"{'Fold':<10} {'K-Fold':<15} {'Stratified K-Fold':<20}")
print("-" * 50)

for i in range(k):
    print(f"Fold {i+1:<5} {kfold_scores[i]:.3f} ({kfold_scores[i]*100:>5.1f}%)  "
          f"{strat_scores[i]:.3f} ({strat_scores[i]*100:>5.1f}%)")

print("-" * 50)
print(f"{'Mean':<10} {kfold_scores.mean():.3f} ({kfold_scores.mean()*100:>5.1f}%)  "
      f"{strat_scores.mean():.3f} ({strat_scores.mean()*100:>5.1f}%)")
print(f"{'Std Dev':<10} {kfold_scores.std():.3f} ({kfold_scores.std()*100:>5.1f}%)  "
      f"{strat_scores.std():.3f} ({strat_scores.std()*100:>5.1f}%)")

print("\nüí° Notice:")
if strat_scores.std() < kfold_scores.std():
    print("   ‚úÖ Stratified K-Fold has LOWER variance (more consistent)")
    print("   This is because it maintains class balance in each fold!")

print("\n" + "="*60)

---
# 6Ô∏è‚É£ Time Series Cross-Validation

## ‚è∞ The "Respect Time Order" Technique

### Why Time Series is Different
Imagine predicting tomorrow's weather:
- ‚úÖ You CAN use yesterday's data to predict tomorrow
- ‚ùå You CANNOT use tomorrow's data to predict yesterday!

### The Problem with Regular K-Fold
Regular K-Fold randomly shuffles data, which breaks time order:
```
‚ùå Wrong: Train on [Future] ‚Üí Test on [Past]
```

### Time Series Split Solution
Always trains on past, tests on future:
```
Split 1: [Train ‚ñà‚ñà‚ñà‚ñà        ] [Test ‚ñà]
Split 2: [Train ‚ñà‚ñà‚ñà‚ñà‚ñà       ] [Test ‚ñà]
Split 3: [Train ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà      ] [Test ‚ñà]
Split 4: [Train ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà     ] [Test ‚ñà]
Split 5: [Train ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà    ] [Test ‚ñà]
         ‚Üê‚îÄ‚îÄ‚îÄ Past    Future ‚îÄ‚îÄ‚Üí
```

### Use Cases üìà
- Stock prices prediction
- Weather forecasting
- Sales forecasting
- Any data with time stamps!

---

In [None]:
print("="*60)
print("TIME SERIES CROSS-VALIDATION")
print("Forecasting Sales Over Time")
print("="*60)

# STEP 1: Create a time series dataset
# Simulating daily sales data with trend and seasonality
dates = pd.date_range(start='2023-01-01', end='2023-12-31', freq='D')
n_days = len(dates)

# Create synthetic sales data with:
# - Upward trend (business growing)
# - Weekly seasonality (weekends are busier)
# - Random noise
trend = np.linspace(100, 150, n_days)  # Growing from 100 to 150
seasonality = 20 * np.sin(2 * np.pi * np.arange(n_days) / 7)  # Weekly pattern
noise = np.random.randn(n_days) * 5  # Random variation
sales = trend + seasonality + noise

# Create DataFrame
data = pd.DataFrame({
    'date': dates,
    'sales': sales
})
data.set_index('date', inplace=True)

print(f"\nüìÖ Time Period: {dates[0].date()} to {dates[-1].date()}")
print(f"üìä Total Days: {n_days}")
print(f"üí∞ Sales Range: ${sales.min():.2f} to ${sales.max():.2f}")
print()

# STEP 2: Set up Time Series Cross-Validation
n_splits = 5
tscv = TimeSeriesSplit(n_splits=n_splits)

print(f"üîÑ Running Time Series Cross-Validation ({n_splits} splits)...")
print("-" * 70)

# STEP 3: Visualize the splits
print(f"\n{'Split':<8} {'Train Period':<25} {'Test Period':<25} {'Train Size':<12} {'Test Size'}")
print("-" * 70)

for i, (train_index, test_index) in enumerate(tscv.split(data), 1):
    train_start = data.index[train_index[0]].date()
    train_end = data.index[train_index[-1]].date()
    test_start = data.index[test_index[0]].date()
    test_end = data.index[test_index[-1]].date()

    print(f"Split {i}  {train_start} to {train_end}  "
          f"{test_start} to {test_end}  "
          f"{len(train_index):<12} {len(test_index)}")

print("-" * 70)

# STEP 4: Train and evaluate model on each split
predictions_list = []
true_labels_list = []
scores = []

model = LinearRegression()

for fold, (train_index, test_index) in enumerate(tscv.split(data), 1):
    # Split data
    train_data = data.iloc[train_index]
    test_data = data.iloc[test_index]

    # Prepare features (using day number as feature)
    X_train = train_data.index.astype('int64').values.reshape(-1, 1)
    y_train = train_data['sales'].values
    X_test = test_data.index.astype('int64').values.reshape(-1, 1)
    y_test = test_data['sales'].values

    # Train model
    model.fit(X_train, y_train)

    # Make predictions
    y_pred = model.predict(X_test)

    # Calculate R¬≤ score
    score = model.score(X_test, y_test)
    scores.append(score)

    # Store for later
    predictions_list.append(y_pred)
    true_labels_list.append(y_test)

# STEP 5: Calculate overall performance
predictions = np.concatenate(predictions_list)
true_labels = np.concatenate(true_labels_list)
mse = ((predictions - true_labels) ** 2).mean()
mae = np.abs(predictions - true_labels).mean()

print(f"\nüìä Cross-Validation Results:")
print(f"   Mean R¬≤ Score: {np.mean(scores):.4f}")
print(f"   Std Deviation: {np.std(scores):.4f}")
print(f"\nüìà Error Metrics:")
print(f"   Mean Squared Error: ${mse:.2f}")
print(f"   Mean Absolute Error: ${mae:.2f}")
print(f"   (On average, predictions are off by ${mae:.2f})")

print("\nüí° Key Insight:")
print("   Notice how training size GROWS with each split!")
print("   This mimics real-world forecasting: more history = better predictions")

print("\n" + "="*60)

### üéì Time Series Best Practices

**DO:**
‚úÖ Use TimeSeriesSplit for any time-ordered data
‚úÖ Keep data in chronological order
‚úÖ Train on past, test on future
‚úÖ Consider seasonality (daily, weekly, monthly patterns)

**DON'T:**
‚ùå Use regular K-Fold (it shuffles data!)
‚ùå Use future data to predict the past
‚ùå Shuffle time series data

**Real-World Applications:**
- üìà Stock market prediction
- üå°Ô∏è Temperature forecasting
- üè™ Retail sales forecasting
- üìä Website traffic prediction
- üí∞ Cryptocurrency price prediction

---

---
# 7Ô∏è‚É£ Visualizing Cross-Validation Results

## üìä Let's Make Beautiful Charts!

Visualizations help us understand:
- How consistent our model is
- Which folds performed best/worst
- How different CV methods compare

---

In [None]:
print("="*60)
print("VISUALIZING CROSS-VALIDATION RESULTS")
print("="*60)

# STEP 1: Generate results from multiple CV methods
X, y = load_iris(return_X_y=True)
model = RandomForestClassifier(n_estimators=100, random_state=42)

# Different CV methods
cv_methods = {
    '3-Fold': KFold(n_splits=3, shuffle=True, random_state=42),
    '5-Fold': KFold(n_splits=5, shuffle=True, random_state=42),
    '10-Fold': KFold(n_splits=10, shuffle=True, random_state=42),
    'Stratified 5-Fold': StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
}

results = {}
for name, cv in cv_methods.items():
    scores = cross_val_score(model, X, y, cv=cv)
    results[name] = scores
    print(f"‚úÖ {name}: Mean = {scores.mean():.3f}, Std = {scores.std():.3f}")

print()

# STEP 2: Create visualizations
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
fig.suptitle('Cross-Validation Results Comparison', fontsize=16, fontweight='bold')

# Plot 1: Box Plot
ax1 = axes[0, 0]
ax1.boxplot(results.values(), labels=results.keys())
ax1.set_title('Box Plot: Score Distribution', fontweight='bold')
ax1.set_ylabel('Accuracy Score')
ax1.set_ylim([0.85, 1.0])
ax1.grid(True, alpha=0.3)
ax1.tick_params(axis='x', rotation=45)

# Plot 2: Bar Chart with Error Bars
ax2 = axes[0, 1]
means = [np.mean(scores) for scores in results.values()]
stds = [np.std(scores) for scores in results.values()]
x_pos = np.arange(len(results))
ax2.bar(x_pos, means, yerr=stds, capsize=5, alpha=0.7, color='skyblue', edgecolor='navy')
ax2.set_xticks(x_pos)
ax2.set_xticklabels(results.keys(), rotation=45, ha='right')
ax2.set_title('Mean Accuracy with Error Bars', fontweight='bold')
ax2.set_ylabel('Mean Accuracy')
ax2.set_ylim([0.85, 1.0])
ax2.grid(True, alpha=0.3, axis='y')

# Plot 3: Individual Fold Scores
ax3 = axes[1, 0]
for name, scores in results.items():
    ax3.plot(range(1, len(scores) + 1), scores, marker='o', label=name, linewidth=2)
ax3.set_title('Scores Across Folds', fontweight='bold')
ax3.set_xlabel('Fold Number')
ax3.set_ylabel('Accuracy Score')
ax3.legend(loc='lower right')
ax3.grid(True, alpha=0.3)
ax3.set_ylim([0.85, 1.0])

# Plot 4: Violin Plot
ax4 = axes[1, 1]
positions = range(1, len(results) + 1)
parts = ax4.violinplot(results.values(), positions=positions, showmeans=True, showmedians=True)
ax4.set_xticks(positions)
ax4.set_xticklabels(results.keys(), rotation=45, ha='right')
ax4.set_title('Violin Plot: Score Distribution', fontweight='bold')
ax4.set_ylabel('Accuracy Score')
ax4.set_ylim([0.85, 1.0])
ax4.grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

print("\nüí° Reading the Charts:")
print("   üì¶ Box Plot: Shows median, quartiles, and outliers")
print("   üìä Bar Chart: Shows average with error bars (¬±1 std dev)")
print("   üìà Line Plot: Shows how each fold performed")
print("   üéª Violin Plot: Shows full distribution shape")

print("\n" + "="*60)

---
# 8Ô∏è‚É£ Best Practices & Common Pitfalls

## ‚úÖ Best Practices

### 1. Choosing the Right Number of Folds
- **Small datasets (< 1000):** Use 5-10 folds
- **Large datasets (> 10,000):** Use 3-5 folds
- **Very small datasets (< 100):** Consider LOO

### 2. Always Shuffle (Except Time Series!)
```python
# ‚úÖ Good
KFold(n_splits=5, shuffle=True, random_state=42)

# ‚ùå Bad (unless data is pre-shuffled)
KFold(n_splits=5, shuffle=False)
```

### 3. Use Stratified for Imbalanced Data
```python
# When class 1 is only 10% of data
StratifiedKFold(n_splits=5)  # ‚úÖ Better
```

### 4. Set Random State for Reproducibility
```python
# ‚úÖ Results will be the same every time
KFold(n_splits=5, shuffle=True, random_state=42)
```

### 5. Look at Standard Deviation
- Low std dev (< 0.05): Consistent model! üéØ
- High std dev (> 0.10): Unstable model ‚ö†Ô∏è

---

## ‚ùå Common Pitfalls to Avoid

### 1. Data Leakage
```python
# ‚ùå WRONG: Scaling before splitting
X_scaled = scaler.fit_transform(X)  # Test data leaked into training!
cross_val_score(model, X_scaled, y)

# ‚úÖ CORRECT: Scale inside each fold
# (We'll learn this in the Pipeline chapter)
```

### 2. Using K-Fold on Time Series
```python
# ‚ùå WRONG: Random splits break time order
KFold(n_splits=5, shuffle=True)

# ‚úÖ CORRECT: Use TimeSeriesSplit
TimeSeriesSplit(n_splits=5)
```

### 3. Ignoring Class Imbalance
```python
# When you have: 90% class A, 10% class B

# ‚ùå WRONG: Regular K-Fold
KFold(n_splits=5)

# ‚úÖ CORRECT: Stratified K-Fold
StratifiedKFold(n_splits=5)
```

### 4. Too Many or Too Few Folds
```python
# ‚ùå Too few: Not reliable
KFold(n_splits=2)

# ‚ùå Too many: Slow and high variance
KFold(n_splits=50)

# ‚úÖ Just right: Standard choice
KFold(n_splits=5)
```

### 5. Not Setting Random State
```python
# ‚ùå Results change every time
KFold(n_splits=5, shuffle=True)

# ‚úÖ Reproducible results
KFold(n_splits=5, shuffle=True, random_state=42)
```

---

---
# 9Ô∏è‚É£ Quick Reference Guide

## üéØ Decision Tree: Which CV Method Should I Use?

```
START
  |
  ‚îú‚îÄ Is it time series data?
  ‚îÇ   ‚îú‚îÄ YES ‚Üí Use TimeSeriesSplit
  ‚îÇ   ‚îî‚îÄ NO ‚Üì
  |
  ‚îú‚îÄ Is it classification or regression?
  ‚îÇ   ‚îú‚îÄ REGRESSION ‚Üí Use KFold (5-10 folds)
  ‚îÇ   ‚îî‚îÄ CLASSIFICATION ‚Üì
  |
  ‚îú‚îÄ Are classes balanced?
  ‚îÇ   ‚îú‚îÄ YES ‚Üí Use KFold (5-10 folds)
  ‚îÇ   ‚îî‚îÄ NO ‚Üí Use StratifiedKFold (5-10 folds)
  |
  ‚îú‚îÄ Is dataset very small (< 100 samples)?
  ‚îÇ   ‚îú‚îÄ YES ‚Üí Consider LeaveOneOut
  ‚îÇ   ‚îî‚îÄ NO ‚Üí Stick with K-Fold
```

---

## üìã Cheat Sheet

| Scenario | Best Method | Why? |
|----------|-------------|------|
| Balanced classification | K-Fold (5-10) | Fast and reliable |
| Imbalanced classification | Stratified K-Fold | Preserves class ratios |
| Regression | K-Fold (5-10) | Standard approach |
| Time series | TimeSeriesSplit | Respects time order |
| Very small dataset | LeaveOneOut | Maximizes training data |
| Large dataset | K-Fold (3-5) | Faster computation |

---

## üíª Code Templates

### Template 1: Basic K-Fold
```python
from sklearn.model_selection import cross_val_score, KFold
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
kfold = KFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=kfold)

print(f"Mean: {scores.mean():.3f}, Std: {scores.std():.3f}")
```

### Template 2: Stratified K-Fold
```python
from sklearn.model_selection import cross_val_score, StratifiedKFold

model = LogisticRegression()
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=skf)

print(f"Mean: {scores.mean():.3f}, Std: {scores.std():.3f}")
```

### Template 3: Time Series Split
```python
from sklearn.model_selection import TimeSeriesSplit
from sklearn.linear_model import LinearRegression

model = LinearRegression()
tscv = TimeSeriesSplit(n_splits=5)

for train_idx, test_idx in tscv.split(X):
    X_train, X_test = X[train_idx], X[test_idx]
    y_train, y_test = y[train_idx], y[test_idx]
    model.fit(X_train, y_train)
    score = model.score(X_test, y_test)
    print(f"Fold score: {score:.3f}")
```

---

---
# üéì Practice Exercises

## Exercise 1: Basic Cross-Validation
**Task:** Load the Wine dataset and perform 5-fold cross-validation using a Decision Tree Classifier.

**Your Goals:**
1. Calculate mean accuracy
2. Calculate standard deviation
3. Determine if the model is consistent

**Starter Code:**
```python
from sklearn.datasets import load_wine
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score, KFold

# Your code here!
```

---

## Exercise 2: Compare CV Methods
**Task:** Using the Iris dataset, compare:
- 3-Fold CV
- 5-Fold CV
- 10-Fold CV
- Stratified 5-Fold CV

**Questions to Answer:**
1. Which method gives the highest mean score?
2. Which method has the lowest standard deviation?
3. Which would you choose and why?

---

## Exercise 3: Time Series Challenge
**Task:** Create a synthetic time series dataset and:
1. Apply regular K-Fold (wrong approach)
2. Apply TimeSeriesSplit (correct approach)
3. Compare the results
4. Explain why one is better than the other

**Hint:** Create data with a clear trend!

---

## Exercise 4: Fix the Mistakes
**Task:** This code has several cross-validation mistakes. Find and fix them!

```python
# Load data
X, y = load_iris(return_X_y=True)

# Create imbalanced dataset (90% class 0, 10% class 1)
mask = (y == 0) | ((y == 1) & (np.random.rand(len(y)) < 0.1))
X, y = X[mask], (y[mask] > 0).astype(int)

# Cross-validation (FIND THE MISTAKES!)
model = LogisticRegression()
kfold = KFold(n_splits=2)  # Mistake 1?
scores = cross_val_score(model, X, y, cv=kfold)  # Mistake 2?

print(f"Score: {scores[0]}")  # Mistake 3?
```

**Questions:**
1. What are the 3 mistakes?
2. How would you fix them?
3. Why are they mistakes?

---

In [None]:
# Exercise Space - Write your solutions here!

# Exercise 1: Your solution
print("=" * 60)
print("EXERCISE 1 SOLUTION")
print("=" * 60)

# Your code here


print("\n" + "=" * 60)
print("EXERCISE 2 SOLUTION")
print("=" * 60)

# Your code here


print("\n" + "=" * 60)
print("EXERCISE 3 SOLUTION")
print("=" * 60)

# Your code here

---
# üéâ Congratulations!

## üìö What You've Learned

You now understand:
- ‚úÖ What cross-validation is and why it's important
- ‚úÖ Different types of CV techniques (K-Fold, Stratified, LOO, Time Series)
- ‚úÖ When to use each technique
- ‚úÖ How to implement them in Python
- ‚úÖ How to interpret CV results
- ‚úÖ Common pitfalls and best practices

---

## üöÄ Next Steps

1. **Practice** with different datasets
2. **Experiment** with different numbers of folds
3. **Combine** CV with hyperparameter tuning (coming soon!)
4. **Apply** to your own projects

---

## üí° Remember

> "Cross-validation is like getting a second opinion, third opinion, fourth opinion... It makes you more confident in your model's performance!"

---

## üìñ Additional Resources

- [Scikit-learn Cross-Validation Guide](https://scikit-learn.org/stable/modules/cross_validation.html)
- [StatQuest: Cross Validation](https://www.youtube.com/watch?v=fSytzGwwBVw)
- Practice on [Kaggle](https://www.kaggle.com/)

---

## ‚ùì Questions?

Contact: **Siva.Jasthi@metrostate.edu**

---

**Happy Learning! üéì**