<a href="https://colab.research.google.com/github/singh00in/AIML/blob/main/CrossValidations/K_fold_cross_validation_Notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# K-Fold Cross Validation

## Context

Diabetes is one of the most frequent diseases worldwide and the number of diabetic patients is growing over the years. The main cause of diabetes remains unknown, yet scientists believe that both genetic factors and environmental lifestyle play a major role in diabetes.

Individuals with diabetes face a risk of developing some secondary health issues such as heart diseases and nerve damage. Thus, early detection and treatment of diabetes can prevent complications and assist in reducing the risk of severe health problems.
Even though it's incurable, it can be managed by treatment and medication.

Researchers at the Bio-Solutions lab want to get better understanding of this disease among women and are planning to use machine learning models that will help them to identify patients who are at risk of diabetes.

We will use pima indians diabetes dataset to see how can we perform cross validation on the train set and how to interpret the results.

## Data Dictionary

* **Pregnancies:** Number of times pregnant
* **Glucose:** Plasma glucose concentration over 2 hours in an oral glucose tolerance test
* **BloodPressure:** Diastolic blood pressure (mm Hg)
* **SkinThickness:** Triceps skinfold thickness (mm)
* **Insulin:** 2-Hour serum insulin (mu U/ml)
* **BMI:** Body mass index (weight in kg/(height in m)^2)
* **Pedigree:** Diabetes pedigree function - A function that scores likelihood of diabetes based on family history.
* **Age:** Age in years
* **Class:** Class variable (0: the person is not diabetic or 1: the person is diabetic)

## Installing and Importing the Necessary Libraries

In [None]:
# Installing the libraries with specific versions
!pip install pandas==2.2.2 numpy==2.0.2 scikit-learn==1.6.1 -q

**Note:**

- After running the above cell, kindly restart the notebook kernel (for Jupyter Notebook) or runtime (for Google Colab) and run all cells sequentially from the next cell.
- On executing the above line of code, you might see a warning regarding package dependencies. This error message can be ignored as the above code ensures that all necessary libraries and their dependencies are maintained to successfully execute the code in this notebook.

In [None]:
# to work with dataframes
import pandas as pd
import numpy as np

# to split data into train and test
from sklearn.model_selection import train_test_split

# to build logstic regression model
from sklearn.linear_model import LogisticRegression

# to create k folds of data and get cross validation score
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

# to ignore warnings
import warnings
warnings.filterwarnings('ignore')

## Load and view the dataset

In [None]:
# uncomment and run the below code snippets if the dataset is present in the Google Drive
# from google.colab import drive
# drive.mount('/content/drive')

In [None]:
df = pd.read_csv('/content/pima-indians-diabetes.csv')

df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,Pedigree,Age,Class
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [None]:
# separating data into X and Y
X = df.drop(['Class'], axis = 1)
Y = df['Class']

In [None]:
# creating train and test sets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.30, random_state=1, stratify = Y)

In [None]:
# defining kfold
kfold = KFold(n_splits=10, random_state=1, shuffle = True)

# number of splits = 10

In [None]:
# defining the model
model = LogisticRegression(random_state = 1)

# storing accuracy values of model for every fold in "results"
results = cross_val_score(model, X_train, Y_train, cv=kfold, scoring='accuracy')

In [None]:
# let's see the value of accuracy for every fold
print(results)

[0.77777778 0.83333333 0.81481481 0.83333333 0.72222222 0.85185185
 0.74074074 0.66037736 0.77358491 0.71698113]


In [None]:
# let's see the mean accuracy score
print("Accuracy: %.3f%% (%.3f%%)" % (results.mean()*100.0, results.std()*100.0))

Accuracy: 77.250% (5.891%)


## Interpretation of scores
- Mean accuracy is 77.82% with a standard deviation of 5.46%
- So, we can say that the accuracy on any unseen data will lie between 72.4 (mean - standard deviation) and 83.3 (mean + standard deviation) with a confidence of 67%.