### Dataset Decription:
The dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

The datasets consists of several medical predictor variables and one target variable, Outcome.

    Preg: Number of times pregnant
    Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
    BloodPressure: Diastolic blood pressure (mm Hg)
    SkinThickness: Triceps skin fold thickness (mm)
    Insulin: 2-Hour serum insulin (mu U/ml)
    BMI: Body mass index (weight in kg/(height in m)^2)
    DiabetesPedigreeFunction: Diabetes pedigree function
    Age: Age (years)
    Outcome: Class variable (0 or 1)


### Importing Required Packages

In [2]:
import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score

### Prepare the model

In [3]:
diabetes_data = pd.read_csv("diabetes.csv")
diabetes_data.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


Extracting features and labels from the diabetes data

In [4]:
x_data = diabetes_data.iloc[:,:-1].values
y_data = diabetes_data.iloc[:,-1].values
x_data.shape, y_data.shape

((768, 8), (768,))

### Apply KFold

In [5]:
def crossvalidation(data):
    scores_Test = []
    for train_index, test_index in data.split(x_data):
        # Split the data into train and test
        x_train, x_test = x_data[train_index], x_data[test_index]
        y_train, y_test  = y_data[train_index], y_data[test_index]

        # Create DecisionTree classifier object with hyper parameters
        decision_tree2 = DecisionTreeClassifier(max_depth=2)

        # Fit the data into the model
        decision_tree2.fit(x_train, y_train)
        scores_Test.append(decision_tree2.score(x_test, y_test))
    print("Average score of the Testing set %.2f"%np.mean(scores_Test))

In [12]:
# Set the KFold module for 5 splits:
kf = KFold(n_splits=10)

# crossvalidation function returns the average score of the test data
crossvalidation(kf)

Average score of the Testing set 0.76
