# Cross Validation

🔹 The Problem

If you train and test on the same dataset, your model might look perfect, but it’s actually just memorizing (overfitting).
So we split data into train and test. But one single split might be misleading (maybe the test set is too easy/hard).

🔹 The Solution → Cross-Validation

Cross-validation avoids this by splitting the dataset into multiple folds (subsets) and training/testing the model on different combinations.

🟢 Example: k-Fold Cross Validation

Split the data into k equal parts (folds).

For each round:

Use k-1 folds for training

Use 1 fold for testing

Repeat k times (so each fold is used once as a test set).

Take the average score across all k runs.

👉 Example with k=5:

Round 1: Train on folds 2–5, test on fold 1

Round 2: Train on folds 1,3–5, test on fold 2

…

Round 5: Train on folds 1–4, test on fold 5

🟢 Benefits

More reliable estimate of performance (not dependent on one split).

Uses the entire dataset for both training and testing (different rounds).

Helps detect overfitting.

In [1]:
from sklearn.datasets import load_breast_cancer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler

In [2]:
X,y = load_breast_cancer(return_X_y=True)

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

In [3]:
from sklearn.model_selection import cross_val_score

clf = KNeighborsClassifier()

scores = cross_val_score(clf,X_scaled,y,cv=5)

In [4]:
scores

array([0.96491228, 0.95614035, 0.98245614, 0.95614035, 0.96460177])

In [5]:
import numpy as np

np.mean(scores)

np.float64(0.9648501785437045)