<a href="https://colab.research.google.com/github/samiha-mahin/Data-Analysis/blob/main/KFold_Cross_Validation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



### 🌟 What is K-Fold Cross Validation?

K-Fold Cross Validation is a technique used to **test how well a machine learning model will perform on unseen data**.

Instead of training your model once, you:

1. Split your dataset into **K equal parts** (called "folds").
2. Train your model **K times**, each time using a different fold as the **test set**, and the rest as the **training set**.
3. **Average the results** from each run to get a more **reliable score**.

---

### 🎨 Imagine This:

You have **10 paintings** (data points) and want to check how good an art-judging AI is.

Let’s use **5-Fold Cross Validation**:

#### Step-by-step:

1. **Split the 10 paintings into 5 equal parts (folds)**

   * Fold 1: Painting 1, 2
   * Fold 2: Painting 3, 4
   * Fold 3: Painting 5, 6
   * Fold 4: Painting 7, 8
   * Fold 5: Painting 9, 10

2. **Run the model 5 times:**

   * **1st run**: Train on Folds 2–5, Test on Fold 1
   * **2nd run**: Train on Folds 1,3,4,5, Test on Fold 2
   * **3rd run**: Train on Folds 1,2,4,5, Test on Fold 3
   * **4th run**: Train on Folds 1,2,3,5, Test on Fold 4
   * **5th run**: Train on Folds 1,2,3,4, Test on Fold 5

3. **Record the accuracy** of each run.

4. **Average the 5 accuracy scores** to get the final performance.

---

### ✅ Why Use It?

* Prevents **overfitting** (too good on training but bad on new data).
* Gives a **better estimate** of how your model will perform in real life.

---



In [1]:
from sklearn.model_selection import KFold
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score
import numpy as np

In [3]:
data = load_iris()

In [4]:
data.feature_names

['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

In [5]:
X = data.data
y = data.target

In [6]:
# Set up 5-fold
kf = KFold(n_splits=5)

In [7]:
accuracies = [
    accuracy_score(
        y[test_index],
        LogisticRegression(max_iter=200).fit(X[train_index], y[train_index]).predict(X[test_index])
    )
    for train_index, test_index in kf.split(X)
]

print("Average accuracy:", np.mean(accuracies))

Average accuracy: 0.9266666666666665
