# Tutorial 11:

## Cross-Validation
Cross-validation is a technique for evaluating ML models by training several ML models on subsets of the available input data and evaluating them on the complementary subset of the data. Use cross-validation to detect overfitting, ie, failing to generalize a pattern.
![Cross_validation](images/Cross_validation.png "Cross_validation")

The following diagram shows an example of the training subsets and complementary evaluation subsets generated for each of the four models that are created and trained during a 4-fold cross-validation. Model one uses the first 25 percent of data for evaluation, and the remaining 75 percent for training. Model two uses the second subset of 25 percent (25 percent to 50 percent) for evaluation, and the remaining three subsets of the data for training, and so on.

#### Expecting you know all the basics here is a quick tutorial to perform cross validation

In [1]:
import ipywidgets as widgets
from IPython.display import display

style = {'description_width': 'initial'}
#1 Importing essential libraries
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
from sklearn.datasets import load_iris
iris = load_iris()

# np.c_ is the numpy concatenate function
# which is used to concat iris['data'] and iris['target'] arrays 
# for pandas column argument: concat iris['feature_names'] list
# and string list (in this case one string); you can make this anything you'd like..  
# the original dataset would probably call this ['Species']
dataset = pd.DataFrame(data= np.c_[iris['data'], iris['target']],
                     columns= iris['feature_names'] + ['target'])
dataset.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0.0
1,4.9,3.0,1.4,0.2,0.0
2,4.7,3.2,1.3,0.2,0.0
3,4.6,3.1,1.5,0.2,0.0
4,5.0,3.6,1.4,0.2,0.0


In [3]:
print(f"Dataset has {dataset.shape[0]} rows and {dataset.shape[1]} columns.")

Dataset has 150 rows and 5 columns.


In [4]:
#3 classify dependent and independent variables
X = dataset.iloc[:,:-1].values  #independent variable YearsofExperience
y = dataset.iloc[:,-1].values  #dependent variable salary

print("\nIdependent Variable (Sepal and Petal Attributes):\n\n", X[:5])
print("\nDependent Variable (Species):\n\n", y[:5])


Idependent Variable (Sepal and Petal Attributes):

 [[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]]

Dependent Variable (Species):

 [0. 0. 0. 0. 0.]


## Cross_validation

#### 1. Using cross_validate

In [29]:
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import cross_validate
from sklearn.svm import SVC
svc = SVC()
#single metric
cv_results = cross_validate(svc, X, y, cv=5)
pd.DataFrame({"test_score":cv_results["test_score"]})

Unnamed: 0,test_score
0,0.966667
1,0.966667
2,0.966667
3,0.933333
4,1.0


In [26]:
#multiple metric
scores = cross_validate(svc, X, y, cv=5,scoring=('r2', 'neg_mean_squared_error')
                        ,return_train_score=False)
pd.DataFrame({"test_r2":scores["test_r2"],
              "test_neg_mean_squared_error":scores["test_neg_mean_squared_error"]})

Unnamed: 0,test_r2,test_neg_mean_squared_error
0,0.95,-0.033333
1,0.95,-0.033333
2,0.95,-0.033333
3,0.9,-0.066667
4,1.0,-0.0


### 3. Using For loop(always suggested to use this!)

It is suggested to use StratifiedKFold for imbalanced distribution of target

#### --> StratifiedKFold (the distribution of target is set here using the known target variable)

In [45]:
from sklearn.model_selection import KFold, StratifiedKFold
from sklearn.metrics import accuracy_score,r2_score
splits = 5
kfold, scores = StratifiedKFold(n_splits=splits, shuffle=True, random_state=27), []
for train_, test_ in kfold.split(X,y):
    x_train, x_test = X[train_], X[test_]
    y_train, y_test = y[train_], y[test_]
    svc = SVC()
    svc.fit(x_train,y_train)
    preds = svc.predict(x_test)
    score = accuracy_score(y_test, preds)
    print(score)
    scores.append(score)
print("Average:Score", sum(scores)/len(scores))

1.0
1.0
0.9333333333333333
1.0
0.9
Average:Score 0.9666666666666668


#### --> KFold (the distribution of target isnt set here)

In [46]:
from sklearn.model_selection import KFold, StratifiedKFold
from sklearn.metrics import accuracy_score,r2_score
splits = 5
kfold, scores = KFold(n_splits=splits, shuffle=True, random_state=27), []
for train_, test_ in kfold.split(X):
    x_train, x_test = X[train_], X[test_]
    y_train, y_test = y[train_], y[test_]
    svc = SVC()
    svc.fit(x_train,y_train)
    preds = svc.predict(x_test)
    score = accuracy_score(y_test, preds)
    print(score)
    scores.append(score)
print("Average:Score", sum(scores)/len(scores))

0.9333333333333333
0.9666666666666667
0.9666666666666667
1.0
0.9333333333333333
Average:Score 0.96
