# Chapter 43: Cross-Validation and Model Selection
## Improving Model Reliability Through Smart Validation Techniques

---

## 43.1 Introduction to Cross-Validation

In machine learning, it's not enough for a model to perform well on the training data â€” it must **generalize well to new, unseen data**.  

**Cross-validation** is a technique used to assess model performance reliably by training and testing on different subsets of the dataset.

### Benefits of Cross-Validation

It helps us:

- Detect **overfitting** and **underfitting**
- Choose the **best model** or **hyperparameters**
- Estimate the model's **true accuracy**

---

## 43.2 The Problem with Train-Test Split

If we split the data once into training and test sets, the performance depends entirely on that **one random split**.  

A bad split may lead to:

- Biased or misleading performance estimates  
- Overly optimistic or pessimistic accuracy  
- Poor generalization measurement  

Cross-validation solves this by using **multiple train-test splits** and averaging results, giving a more **reliable estimate** of model performance.


# 43.3 Types of Cross-Validation
1. K-Fold Cross-Validation
The dataset is split into k equal parts (folds).
Each fold acts as a test set once, while the remaining k-1 folds are used for training. The final score is the average of all k evaluations.
2. Stratified K-Fold
Like K-Fold, but it ensures that each fold has the same class distribution as the original dataset (important for classification problems).
3. Leave-One-Out (LOO)-
Each data point is used once as a test set while the rest act as training data. Very slow for large datasets.
4. Shuffle Split / Repeated K-Fold-
Multiple random train-test splits for even more robust evaluation.

# libraries required

In [9]:
from sklearn.model_selection import cross_val_score, KFold, StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
import pandas as pd


In [10]:
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler

# Load dataset
iris = load_iris()

X = iris.data
y = iris.target

# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Create DataFrame for scaled features
X_scaled_df = pd.DataFrame(
    X_scaled,
    columns=iris.feature_names
)

print(X_scaled_df.head())

   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
0          -0.900681          1.019004          -1.340227         -1.315444
1          -1.143017         -0.131979          -1.340227         -1.315444
2          -1.385353          0.328414          -1.397064         -1.315444
3          -1.506521          0.098217          -1.283389         -1.315444
4          -1.021849          1.249201          -1.340227         -1.315444


# 43.5.3 Apply K-Fold Cross-Validation

In [14]:
kf = KFold(n_splits=5, shuffle=True, random_state=1)
model = LogisticRegression(max_iter=1000)
# Cross-validation
scores = cross_val_score(model, X_scaled, y, cv=kf)

print("Accuracy for each fold:", scores)
print("Average Accuracy:", scores.mean())


Accuracy for each fold: [0.96666667 0.96666667 0.96666667 0.93333333 0.93333333]
Average Accuracy: 0.9533333333333334


# stratified k fold


In [16]:
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X_scaled, y, cv=skf)
print("Stratified K-Fold Accuracy for each fold:", scores)
print("Average Accuracy:", scores.mean())

Stratified K-Fold Accuracy for each fold: [1.         0.96666667 0.9        1.         0.9       ]
Average Accuracy: 0.9533333333333334


# shuffle split

In [13]:
from sklearn.model_selection import ShuffleSplit

ss = ShuffleSplit(
    n_splits=10,      # number of re-shuffling & splitting iterations
    test_size=0.2,
    random_state=42
)

scores = cross_val_score(model, X_scaled, y, cv=ss)

print("Shuffle Split Accuracy for each split:", scores)
print("Average Accuracy:", scores.mean())


Shuffle Split Accuracy for each split: [1.         0.96666667 0.96666667 0.93333333 0.93333333 1.
 0.9        0.96666667 1.         0.93333333]
Average Accuracy: 0.96
