# **11. K-Fold Cross-Validation (CV)**

K-fold Cross-Validation (CV) is a technique used to evaluate the performance of a machine learning model by dividing the dataset into **K-equally sized Subsets/Fold**. In each iteration, one fold is used for testing, while the remaining K-1 folds are used for training. This process is repeated K times, with each fold serving as a test set exactly once. The results from all iterations are then averaged to provide a more reliable estimate of the model's performance.

K-fold CV helps reduce the risk of overfitting and provides a better generalization of the model on unseen data.

The choice of K is an important factor in K-fold CV. Typically, K is set between 5 or 10, though it can vary depending on the dataset size. A higher K requires a more accurate estimate of model performance but requires more comptation. One common variation is **Stratified K-Fold** where the data is split in such a way that the class distribution remains consistent across all folds, which is particularly useful for imbalanced datasets.

## **Pseudo-Code for K-Fold Cross Validation:**

1. Split the dataset into **K** equally sized folds.

2. For each fold (i from 1 to K): \

    a. Set i-th fold as the **test** set. \
    b. Combine the remaining **K-1** folds as the **training** set. \
    c. Train the model on the training set.\
    d. Evaluate the model on the test set, and record the performance metric. \

3. Calculate the average performance across all the K-iterations.

## **1. K-Fold CV implementation on the Digits Dataset**

The `load_digits()` function from the `sklearn.datasets` module loads the Digits dataset, which is used for digit recognition tasks. This dataset is not a DataFrame but an object of type `Bunch`. This object behaves like a dictionary and contains several attributes.

Here are some key components of the `digits` object:

- **`digits.data`**: A 2D NumPy array of shape (n_samples, n_features), where each row represents a sample (a digit image) and each column represents a pixel.
- **`digits.target`**: A 1D NumPy array of shape (n_samples,) that contains the label (0-9) corresponding to each digit image.
- **`digits.images`**: A 3D NumPy array of shape (n_samples, 8, 8), representing the actual images of the digits in an 8x8 format.
- **`digits.DESCR`**: A string describing the dataset.

In [17]:
import pandas as pd
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')


In [6]:
# 1. Load digits data and set up a dataframe
digits = load_digits()
df = pd.DataFrame(digits.data)
df['target'] = digits.target
df.head()


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,55,56,57,58,59,60,61,62,63,target
0,0.0,0.0,5.0,13.0,9.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,6.0,13.0,10.0,0.0,0.0,0.0,0
1,0.0,0.0,0.0,12.0,13.0,5.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,11.0,16.0,10.0,0.0,0.0,1
2,0.0,0.0,0.0,4.0,15.0,12.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,3.0,11.0,16.0,9.0,0.0,2
3,0.0,0.0,7.0,15.0,13.0,1.0,0.0,0.0,0.0,8.0,...,0.0,0.0,0.0,7.0,13.0,13.0,9.0,0.0,0.0,3
4,0.0,0.0,0.0,1.0,11.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,2.0,16.0,4.0,0.0,0.0,4


In [14]:
# 2. Set up models for Logistic, SVM and Random Forest
logistic = LogisticRegression()
svm = SVC()
rf = RandomForestClassifier(n_estimators = 100)

In [8]:
# 3. Create train-test splits
X = df.drop('target', axis=1)
y = df.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = 42)

In [22]:
# 4. Fit the 3 different models and print the accuracy scores:
for model in (logistic, svm, rf):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    print(f"{model.__class__.__name__} accuracy: {round(accuracy, 2)}")

# Recognize that the samples for test and train are not necessarily uniform
# This means that the scores will keep changing if the random_state is not initialized
# This happens because the underlying samples keep on changing

LogisticRegression accuracy: 0.97
SVC accuracy: 0.99
RandomForestClassifier accuracy: 0.97


In [25]:
# 5. Implement a K-Fold CV process to evaluate the models
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

kf = KFold(n_splits=3) # Number of folds


The line `kf = KFold(n_splits=3)` initializes a **KFold cross-validation** object in scikit-learn, with the following meaning:

- **KFold**: It is a class in scikit-learn used to split the dataset into `k` number of smaller subsets (called "folds") for cross-validation. This is a method used to evaluate the performance of a machine learning model.

- **n_splits=3**: This specifies that the data should be split into **3 subsets (or folds)**. In cross-validation, each fold gets a turn to be the test set, while the remaining folds are used for training. The process helps assess how well the model generalizes to new, unseen data.

### Breakdown:
- For `n_splits=3`, your dataset will be divided into 3 roughly equal parts.
- The model will be trained 3 times:
  - In each iteration, it will be trained on 2/3 of the data and tested on the remaining 1/3.
  - The performance score will be averaged over the 3 runs to give a more robust estimate of how the model performs on unseen data.

### Example:

If you have a dataset with 9 samples and you set `n_splits=3`, it will split the dataset into 3 folds, each containing 3 samples. Each fold will take a turn being used as the test set.

Here’s a breakdown of how it might work with an example dataset:

1. **Fold 1 (Test)**, **Fold 2 + Fold 3 (Train)**
2. **Fold 2 (Test)**, **Fold 1 + Fold 3 (Train)**
3. **Fold 3 (Test)**, **Fold 1 + Fold 2 (Train)**

In each iteration, the model will be trained on 2/3 of the data and tested on 1/3. Finally, the model's performance scores will be averaged to give an overall estimate of how well the model is performing.

kf.split creates an iterator over the dataset.
* for train_index in kf.split(dataset): loop over the training indices
* for test_index in kf.split(dataset): loop over the test indices

In [29]:
# Example: Test the test-train split on a dataset: [1,2,3,4,5,6,7,8,9]
for train_index, test_index in kf.split([1, 2, 3, 4, 5, 6, 7, 8, 9]):
    print(f"train on: {train_index}, test on: {test_index}")

train on: [3 4 5 6 7 8], test on: [0 1 2]
train on: [0 1 2 6 7 8], test on: [3 4 5]
train on: [0 1 2 3 4 5], test on: [6 7 8]


In [34]:
# 6. Write a function that returns the score by taking as input:
  # 1. model
  # 2. X_train
  # 3. X_test
  # 4. y_train
  # 5. y_test

def get_score(model, X_train, X_test, y_train, y_test):
    model.fit(X_train, y_train)
    return round(model.score(X_test, y_test),2)

In [35]:
# Example: Apply it on LogisticRegression
get_score(logistic, X_train, X_test, y_train, y_test)

0.97

In [36]:
# 7. Set up a Stratified K-fold CV:
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=3)


We now iterate through the models. For each iteration, apply the `get_score()` method and append the scores to a set of predefined vectors:

* scores_logistc
* scores_svm
* scores_rf

In [37]:
# 8. Set up the score vectors/lists
scores_logistic = scores_svm = scores_rf = []

In [42]:
# 9. Apply the get_score() method on the splits
for train_index, test_index in skf.split(digits.data, digits.target):

    X_train = digits.data[train_index]   # Feature data for training, selected by train_index
    X_test = digits.data[test_index]     # Feature data for testing, selected by test_index
    y_train = digits.target[train_index] # Target labels for training, selected by train_index
    y_test = digits.target[test_index]   # Target labels for testing, selected by test_index

    # print(get_score(logistic, X_train, X_test, y_train, y_test))
    scores_logistic.append(get_score(logistic, X_train, X_test, y_train, y_test))
    # print(get_score(svm, X_train, X_test, y_train, y_test))
    scores_svm.append(get_score(svm, X_train, X_test, y_train, y_test))
    # print(get_score(rf, X_train, X_test, y_train, y_test))
    scores_rf.append(get_score(rf, X_train, X_test, y_train, y_test))




In [45]:
# 10. Take the average of all the scores
scores_logistic_average = np.mean(scores_logistic)
scores_svm_average = np.mean(scores_svm)
scores_rf_average = np.mean(scores_rf)

for score in (scores_logistic_average, scores_svm_average, scores_rf_average):
    print(f"Average score: {round(score, 2)}")

Average score: 0.94
Average score: 0.94
Average score: 0.94


In [49]:
# 11. we can do all of the above using the cross_val_score() function:
# cross_val_score(logistic, digits.data, digits.target)
# cross_val_score(svm, digits.data, digits.target)
# cross_val_score(rf, digits.data, digits.target)

for i in (logistic, svm, rf):
  score = cross_val_score(i, digits.data, digits.target, cv=3)
  print(f"{i.__class__.__name__} average score: {round(np.mean(score), 2)}")

LogisticRegression average score: 0.93
SVC average score: 0.97
RandomForestClassifier average score: 0.94


## **2. Apply K-Fold CV to Iris dataset**


In [53]:
# 1. Load the Iris dataset, perform encoding
url = "https://gist.githubusercontent.com/curran/a08a1080b88344b0c8a7/raw/0e7a9b0a5d22642a06d3d5b9bcbad9890c8ee534/iris.csv"
iris = pd.read_csv(url)
iris['species'] = pd.Categorical(iris['species'])
iris['species_n'] = iris.species.cat.codes
iris.head()

# Setosa = 0
# Versicolor = 1
# Virginica = 2

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species,species_n
0,5.1,3.5,1.4,0.2,setosa,0
1,4.9,3.0,1.4,0.2,setosa,0
2,4.7,3.2,1.3,0.2,setosa,0
3,4.6,3.1,1.5,0.2,setosa,0
4,5.0,3.6,1.4,0.2,setosa,0


In [54]:
# 2. Create models
logistic = LogisticRegression()
svm = SVC()
rf = RandomForestClassifier(n_estimators = 100)

In [55]:
# 3. Create a stratified k-fold with 3 splits
skf = StratifiedKFold(n_splits=3)

In [56]:
# 4. Create a function that takes in the test train splits and calculates the score
def get_score(model, X_train, X_test, y_train, y_test):
    model.fit(X_train, y_train)
    score = model.score(X_test, y_test)
    return round(score,2)

In [62]:
# 5. Initialize empty lists, and create X input matrix, y target vector
logistic_scores = svm_scores = rf_scores = []
X = np.array(iris.drop(['species', 'species_n'], axis=1))
y = iris['species_n']


In [61]:
X

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2
...,...,...,...,...
145,6.7,3.0,5.2,2.3
146,6.3,2.5,5.0,1.9
147,6.5,3.0,5.2,2.0
148,6.2,3.4,5.4,2.3


In [63]:
# 6. Iterate over the 3 folds, apply the get_score() method to each fold, for every model
for train_index, test_index in skf.split(X, y): # has to be np.array()
    X_train = X[train_index]
    X_test = X[test_index]
    y_train = y[train_index]
    y_test = y[test_index]

    logistic_scores.append(get_score(logistic, X_train, X_test, y_train, y_test))
    svm_scores.append(get_score(svm, X_train, X_test, y_train, y_test))
    rf_scores.append(get_score(rf, X_train, X_test, y_train, y_test))




In [65]:
# 7. Get the means for logistic_scores, svm_scores, rf_scores. Evaluate model performances
logistic_scores_average = np.mean(logistic_scores)
svm_scores_average = np.mean(svm_scores)
rf_scores_average = np.mean(rf_scores)

for score in (logistic_scores_average, svm_scores_average, rf_scores_average):
    print(f"Average score: {round(score, 2)}")

Average score: 0.97
Average score: 0.97
Average score: 0.97


In [67]:
# 8. Do the same but with cross_val_score() method
# cross_val_score(logistic, X, y, cv=3)
# cross_val_score(svm, X, y, cv=3)
# cross_val_score(rf, X, y, cv=3)

for model in (logistic, svm, rf):
    score = cross_val_score(model, X, y, cv=3)
    print(f"{model.__class__.__name__} average score: {round(np.mean(score), 2)}")

LogisticRegression average score: 0.97
SVC average score: 0.96
RandomForestClassifier average score: 0.96
