### üî∑ *Cross-validation & GridSearchCV*

* Understand the need for *cross-validation*

* Learn KFold, StratifiedKFold, and when to use 

* Implement *GridSearchCV* to tune XGBoost parameters

* üëâ *Practice*: Tune max_depth, n_estimators, and learning_rate using GridSearchCV

---

## üî∑ **Step 1: Why Do We Need Cross-validation?**

(And what actually happens behind the scenes)

---

### üß† First, Let‚Äôs Understand the Problem:

When we train a machine learning model, we usually split the data like this:

* **80%** ‚Üí training

* **20%** ‚Üí testing

You train on 80%, then test how well it performs on the 20%.

But here's the catch:

> ü§î What if your 20% test set just happens to be **really easy or really hard**?

> That single accuracy score could be **misleading**.

So how can we be more confident that our model will perform well on *any* data?

---

### ‚úÖ Solution: Cross-validation

**Cross-validation** is a smarter way of testing.

Instead of testing on just one slice of the data, we do it on **many slices**.

---

### üîÑ How Does It Work?

Let‚Äôs take an example of **5-Fold Cross-validation** (K=5):

Imagine your dataset has 100 rows.

In 5-Fold CV:

1. Split it into **5 equal parts** (each part has 20 rows).

2. For each round:

   * Use 4 parts to train (80 rows)

   * Use 1 part to test (20 rows)

4. Repeat this **5 times**, each time with a different part as the test set.

5. Finally, take the **average accuracy** of all 5 rounds.

```text

Fold 1: Train ‚Üí parts 2+3+4+5 | Test ‚Üí part 1  

Fold 2: Train ‚Üí parts 1+3+4+5 | Test ‚Üí part 2  

Fold 3: Train ‚Üí parts 1+2+4+5 | Test ‚Üí part 3  

Fold 4: Train ‚Üí parts 1+2+3+5 | Test ‚Üí part 4  

Fold 5: Train ‚Üí parts 1+2+3+4 | Test ‚Üí part 5

```

This gives you a **fairer estimate** of your model's performance.

---

### ‚úÖ Why Cross-validation is Better:

| Without CV (Train/Test Split)    | With Cross-validation           |
| -------------------------------- | ------------------------------- |
| Only one test score              | Multiple scores (more reliable) |
| Can be biased if test set is odd | Less bias, more generalization  |
| Good for quick testing           | Good for final model evaluation |

---

### üìò Key Terms:

* **K**: Number of folds (e.g., 5, 10)
* **Fold**: Each split (subset) of the data
* **Validation set**: The part used to test the model during CV
* **Training set**: The rest used to train the model

---

### üìå Important Variants:

| Method            | When to Use                                      |
| ----------------- | ------------------------------------------------ |
| `KFold`           | For balanced datasets                            |
| `StratifiedKFold` | For classification when some classes are rare    |
| `Leave-One-Out`   | When dataset is very small (very slow otherwise) |

---

### ‚ö†Ô∏è Without Cross-validation:

You may get:

* High variance (accuracy changes if you shuffle the data)

* Overfitting (you over-optimize for one test set)

With Cross-validation:

* You reduce the risk of choosing a model that just got lucky

---

## üî∑ **Step 2: Learn the Cross-validation Techniques**

We‚Äôll cover:

1. What is `KFold`

2. What is `StratifiedKFold`

3. When to use which

4. Complete code example 

---

### üî∑ **Part 1: What is KFold?**

### ‚úÖ Definition:

`KFold` (K-Fold Cross Validation) is a technique to evaluate your ML model more **reliably** than a single train-test split.

### üìå How it works:

* You divide your dataset into **K equal parts** (called **folds**).

* You train your model on **K‚àí1 folds**, and test on the **remaining fold**.

* You repeat this **K times**, each time changing the test fold.

* Then you **average** the performance across the K runs.

### üìä Example:

If you have 100 samples and choose `K=5`, then:

* Each fold = 20 samples

* The model will be trained/tested 5 times:

  * Train on 80, test on 20 ‚Üí five different combinations

### ‚úÖ Advantages:

* Gives a better idea of your model‚Äôs performance

* Helps you avoid overfitting to a single test set

---

### üî∑ **Part 2: What is StratifiedKFold?**

### ‚úÖ Definition:

`StratifiedKFold` is just like `KFold`, **but smarter** for classification problems.

### üìå Key Difference:

It **maintains the original class distribution** in each fold.

#### Why is that important?

If your target class is imbalanced (e.g., 90% Class A, 10% Class B), a normal KFold might accidentally place **all Class B in one fold**, making evaluation unfair.

**StratifiedKFold** ensures that **each fold has the same % of Class A and Class B** as the full dataset.

---

### üî∑ **Part 3: When to Use Which?**

| Technique         | Keeps Class Ratio | Best Used For                          |
| ----------------- | ----------------- | -------------------------------------- |
| `KFold`           | ‚ùå No              | Regression, Balanced Classification    |
| `StratifiedKFold` | ‚úÖ Yes             | Classification with Imbalanced Classes |

So:

* Use **`KFold`** when your data is balanced or you're doing regression.

* Use **`StratifiedKFold`** when you‚Äôre working on classification and your target classes are imbalanced.

---

### üî∑ **Part 4: Code Example with Full Line-by-Line Explanation**

We'll use the **Iris dataset** and try both KFold and StratifiedKFold.

---

In [4]:
from sklearn.datasets import load_iris
from sklearn.model_selection import KFold, StratifiedKFold, cross_val_score
# KFold, StratifiedKFold ‚Üí Cross-validation tools.
# cross_val_score ‚Üí Automates the CV process.
from xgboost import XGBClassifier

### ‚úÖ 2. Load the Dataset

---

In [5]:
iris = load_iris()
x = iris.data   # Features (petal length, sepal width, etc.)
y = iris.target # Labels (0, 1, 2 for 3 types of flowers)

### ‚úÖ 3. Initialize the XGBoost Classifier

---

In [17]:
#model = XGBClassifier(use_label_encoder = False, eval_matric = "mlogloss") # for newer model, there is no need to mention use_label_encoder and eval_matric
model = XGBClassifier()
# use_label_encoder=False ‚Üí Suppresses warning.
# eval_metric='mlogloss' ‚Üí Suitable for multi-class classification.

### ‚úÖ 4. Perform KFold Cross-validation

---

In [18]:
kfold = KFold(n_splits=5, shuffle=True, random_state=42)
scores_k = cross_val_score(model, x, y, cv=kfold)

print("KFold CV Scores: ", scores_k)
print("Average Accuracy (KFold):", scores_k.mean())

# n_splits=5 ‚Üí 5 folds (5 runs)
# shuffle=True ‚Üí Shuffles before splitting
# random_state=42 ‚Üí Ensures same results every time
# cross_val_score() ‚Üí Automates the train/test split, training, and scoring
# mean() ‚Üí Averages the 5 score

KFold CV Scores:  [1.         0.96666667 0.93333333 0.9        0.93333333]
Average Accuracy (KFold): 0.9466666666666667


### ‚úÖ 5. Perform StratifiedKFold Cross-validation

---

In [20]:
skfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores_sk = cross_val_score(model, x, y, cv=skfold)

print("StratifiedKFold CV Scored:", scores_sk)
print("Average Accuracy (KFold):", scores_sk.mean())

# Everything is the same as above ‚Äî only the CV strategy is smarter, as it keeps class balance in each fold.

StratifiedKFold CV Scored: [0.96666667 0.96666667 0.9        0.96666667 0.9       ]
Average Accuracy (KFold): 0.9400000000000001


üéØ **Conclusion:**
Both techniques are useful, but StratifiedKFold is **more reliable for classification**, especially when your data is not balanced.

---


## üî∑ **Step 3: What is GridSearchCV?**

### ‚úÖ Definition:

`GridSearchCV` is a **brute-force method** to find the **best hyperparameters** for your model by:

1. Trying out **all combinations** of hyperparameters you provide
2. Using **cross-validation** (like KFold/StratifiedKFold) to test each combination
3. Returning the **best parameters** based on scoring (e.g., accuracy)

---

### üí° Why is this helpful?

In real projects, a model like **XGBoost** has many hyperparameters:

* `max_depth` ‚Üí how deep each tree should be
* `n_estimators` ‚Üí how many trees
* `learning_rate` ‚Üí how much each tree corrects the error
* `subsample`, `colsample_bytree`, `gamma`, etc.

Tuning these manually is **slow and error-prone**.
So GridSearchCV automates it üîç

---

### ‚úÖ Code Example: XGBoost + GridSearchCV

Let‚Äôs tune 3 hyperparameters: `max_depth`, `n_estimators`, `learning_rate`
(You can try more later.)

---

In [21]:
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier

# GridSearchCV ‚Üí performs hyperparameter tuning
# load_iris() ‚Üí sample classification dataset
# train_test_split() ‚Üí split into train/test
# XGBClassifier ‚Üí the model we are tuning

In [22]:
# Load data
iris = load_iris()
x = iris.data
y = iris.target

# Train-test split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

In [26]:
# Define model
xgb = XGBClassifier()

### ‚úÖ Define Hyperparameter Grid

In [33]:
param_grid = {
    "max_depth" : [3, 4, 5],
    "n_estimators" : [50, 100, 150],
    "learning_rate" : [0.01, 0.1, 0.2]
}

# 'max_depth': how deep each tree should go (higher = more complex).
# 'n_estimators': number of trees to use.
# 'learning_rate': how much to correct previous error in boosting.

# 3 values of max_depth √ó 3 values of n_estimators √ó 3 values of learning_rate = 3 √ó 3 √ó 3 = 27 total models to train

### ‚úÖ Apply GridSearchCV

In [29]:
grid_search = GridSearchCV(
    estimator=xgb,
    param_grid=param_grid,
    cv=5, # 5-fold cross-validation
    scoring="accuracy",
    verbose=1, #It prints out which hyperparameter combination is being tested, and how many combinations are left.
    n_jobs=-1 # Use all CPU cores for speed
)

# cv=5 ‚Üí 5-fold cross-validation
# scoring='accuracy' ‚Üí metric to optimize
# verbose=1 ‚Üí shows progress
# n_jobs=-1 ‚Üí use all cores (faster)

In [30]:
# Fit GridSearchCV
grid_search.fit(x_train, y_train)

Fitting 5 folds for each of 27 candidates, totalling 135 fits


### ‚úÖ Get Best Results

In [31]:
print("Best Parameters:", grid_search.best_params_)
print("Best CV Score:", grid_search.best_score_)

# .best_params_ ‚Üí gives you the best hyperparameter combo
# .best_score_ ‚Üí best average accuracy across 5 folds

Best Parameters: {'learning_rate': 0.01, 'max_depth': 3, 'n_estimators': 50}
Best CV Score: 0.95


### ‚úÖ Evaluate on Test Set

In [32]:
best_model = grid_search.best_estimator_
test_accuracy = best_model.score(x_test, y_test)
print("Test Accuracy with Best Params:", test_accuracy)

Test Accuracy with Best Params: 1.0


## üîç What‚Äôs Happening Behind the Scenes?

* GridSearchCV trains 27 models (in this example)
* Each model is trained + evaluated using 5-fold CV
* That‚Äôs **135 total trainings** (27 √ó 5)
* It picks the one that worked best on average across CV folds ‚úÖ

---

### üîÅ Summary

| Step | What You Did                             |
| ---- | ---------------------------------------- |
| 1    | Defined hyperparameter search space      |
| 2    | Ran GridSearchCV to try all combinations |
| 3    | Found best model                         |
| 4    | Tested it on unseen test data            |

---