# Applications of Cross Validation

1. Prevent the fluctuating values of **Accuracy** caused due to **random state**.
2. Select the best model between different models.

## Application 1 - Prevent fluctuating values of Accuracy caused due to random state

#### About the Dataset

The data was collected and made available by “National Institute of Diabetes and Digestive and Kidney Diseases” as part of the Pima Indians Diabetes Database. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here belong to the Pima Indian heritage (subgroup of Native Americans), and are females of ages 21 and above.

The columns of the dataset are as follows:

1. Pregnencies
2. Glucose
3. BloodPressure
4. Skin Thickness
5. Insulin
6. BMI
7. DiabetesPedigreeFunction
8. Age
9. Outcome

You can download the dataset from here: https://www.kaggle.com/kandij/diabetes-dataset

Before getting into implementing the first application of cross validation, let us first implement **Logistic Regression** classifier to show how the **random state** affects the accuracy. Once we have implemented that, we will go ahead and see how to prevent the **fluctuating values of accuracy caused due to random state**

#### 1. Import the dataset

In [106]:
import pandas as pd
import numpy as np

In [107]:
df = pd.read_csv("diabetes2.csv")

In [108]:
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [109]:
df.shape

(768, 9)

#### 2. Shuffling the dataset to prevent any kind of order effects 

In [110]:
from sklearn.utils import shuffle
df_shuffle = shuffle(df)

In [111]:
df_shuffle.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
496,5,110,68,0,0,26.0,0.292,30,0
656,2,101,58,35,90,21.8,0.155,22,0
609,1,111,62,13,182,24.0,0.138,23,0
325,1,157,72,21,168,25.6,0.123,24,0
571,2,130,96,0,0,22.6,0.268,21,0


#### 3. Separating the dependent and independent variable data

In [112]:
DV = 'Outcome'
X = df_shuffle.drop(DV,axis=1)
Y = df_shuffle[DV]

In [113]:
X.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
496,5,110,68,0,0,26.0,0.292,30
656,2,101,58,35,90,21.8,0.155,22
609,1,111,62,13,182,24.0,0.138,23
325,1,157,72,21,168,25.6,0.123,24
571,2,130,96,0,0,22.6,0.268,21


In [114]:
Y.head()

496    0
656    0
609    0
325    0
571    0
Name: Outcome, dtype: int64

#### 4. Splitting the data into Train and Test set. Check the random state variable is present in the train_test_split

In [115]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,Y,test_size =0.3, random_state = 30)

In [116]:
X_train.shape

(537, 8)

In [117]:
X_test.shape

(231, 8)

#### 5. Importing the Logistic Regression model

In [118]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()

#### 5. Fittinf the model to the training data

In [119]:
model.fit(X_train,y_train)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

#### 6. Predicting the outcome variable by passing the test independent feature

In [120]:
predicted_score = model.predict(X_test)

In [121]:
predicted_score

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,
       0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0,
       1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1,
       0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0,
       1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1,
       0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0], dtype=int64)

#### 7. Evaluating the model accuracy based on the results predicted. 

In this case if we re-run the model again, our model accuracy will change. To prevent this fluctuating accuracy which is caused due to the random state, we will implement different types of **cross validation** 

In [122]:
model.score(X_test,y_test)

0.7619047619047619

## 1. K- Fold Cross Validation

**K- Fold Cross Validation**  will work as illustrated in the below sample figure:

Suppose, we have 10K records in our dataset, and the K value we are selecting here is 10. Below figure shows how that data is splitted as Train Test Split for 10 iterations as our K value is 10.


So,				 K = 10000/10 => 1000


This means that each Test split would be having 1000 records. 

Let’s try to understand how the K Fold Cross Validation works. 
In the first iteration, the first set of 1000 records would be considered as the Test Split and the rest would be the Train Split.Similarly for the 2nd experiment/iteration, the next block of 1000 records will be considered as the Test Split and the 1st block and the remaining blocks after the 2nd block of Test records would be considered as the Train Split. This process will continue till the end of the 10th experiment/iteration. For every Experiment/Iteration we will get the accuracy of the model until we reach the end of the experiment/iteration.


Based on the results we get for the accuracy, we could take the mean of all the 10 accuracy results and provide the overall result for model performance or we could also provide the upper bound and lower bound percentage for the model specifying the model performance.	

    

<img src = 'http://ai-ml-analytics.com/wp-content/uploads/2020/07/KFold.png'>

#### 1.Importing the cross_val_score from  model_selection and passing the cross validation value as 15.

In [123]:
from sklearn.model_selection import cross_val_score
score = cross_val_score(model, X,Y,cv=15)



#### 2.Scores for the 15 cross validation values

In [124]:
score

array([0.82692308, 0.82692308, 0.73076923, 0.73076923, 0.76923077,
       0.80392157, 0.82352941, 0.68627451, 0.74509804, 0.7254902 ,
       0.74509804, 0.82352941, 0.74509804, 0.74      , 0.78      ])

#### 3. Mean of the scores

In [125]:
score.mean()

0.7668436400201107

## Stratified K-Fold Cross Validation

In [132]:
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score
'''Creating a list to store the accuracy scores for the splits'''
accuracy = []

'''Initiating the StratifiedKFold function by passing the split value(n_splits) and random state as none'''
str_fied = StratifiedKFold(n_splits=15,random_state=None)

##Splitting the dependent(Y) and Independent(X) feature with the StratifiedKFold function 
    #and looping through their train(train_index) and test index(test_index)
    
for train_index, test_index in str_fied.split(X,Y):
    #print("Train Index: ",train_index, "Test Index: ",test_index)
    
#Retrieving the values for training( X0_train,Y0_train) and testing(X0_test,Y0_test) set by passing the
#index values to the dependent(Y) and independent(X) features 

    X0_train, X0_test = X.iloc[train_index],X.iloc[test_index]
    Y0_train, Y0_test = Y.iloc[train_index],Y.iloc[test_index]
    
# Fitting the model with the training set
    model.fit(X0_train,Y0_train)
    
# predicting the model on the testing set'''  
    predict = model.predict(X0_test)
    
# Evaluating the accuracy for all the 15 splits'''
    score = accuracy_score(predict,Y0_test)
    accuracy.append(score)
print(accuracy)

[0.8269230769230769, 0.8269230769230769, 0.7307692307692307, 0.7307692307692307, 0.7692307692307693, 0.803921568627451, 0.8235294117647058, 0.6862745098039216, 0.7450980392156863, 0.7254901960784313, 0.7450980392156863, 0.8235294117647058, 0.7450980392156863, 0.74, 0.78]




In [133]:
np.mean(accuracy)

0.7668436400201107

In [134]:
max(accuracy)

0.8269230769230769

In [135]:
min(accuracy)

0.6862745098039216

### Conclusion:

From the above 2 approaches we have got the same results for the accuracy.Therefore we could choose anyone of them. However if the volume of the data changes, then we might have different results for the 2 approaches. And also the accuracy depends on the **no of splits**. In this case we have taken **no of splits** as 15. If we take 10, then **StratifiedKfolds** cross validation would yield better results

## Application - 2. Select the best model between different models.

Cross Validation helps us evaluate which model should be a best fit for predicting results. Let's see how we could use cross validation to choose the best model

#### 1. Implementing the K-nearest neighbour classifier with cross validation value as 10 and nearest neighbours as 4

In [130]:
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
from sklearn.model_selection import cross_val_score
knnclassifier = KNeighborsClassifier(n_neighbors=4)
print(cross_val_score(knnclassifier, X, Y, cv=10, scoring ='accuracy').mean())

0.7161995898838005


#### 2. Implementing Logistic Regression with cross validation value as 10

In [131]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
print (cross_val_score(logreg, X, Y, cv=10, scoring = 'accuracy').mean())

0.7668660287081339


