# Model Validation

In [1]:
import pandas
iris = pandas.read_csv('../Datasets/iris.csv')
iris.head()

Unnamed: 0,SepalLength,SepalWidth,PetalLength,PetalWidth,Species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [2]:
from sklearn.linear_model import LinearRegression

X = iris[['PetalWidth']]
y = iris['PetalLength']
model = LinearRegression()
model.fit(X,y)
print(model.score(X,y))

0.9271098389904927


This score is not a good way to assess this model because X, y were used to build and to test model.

Testing data should be different from training data.

The only data we have is in X and y.  So, we'll need to split this data into training and testing data.

In [3]:
from sklearn.model_selection import train_test_split

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.1)
model.fit(X_train, y_train)
print(model.score(X_test,y_test))

0.9344965632804348


Each iteration uses the same dataset (X and y).  It's just split in a different way.  

Each iteration *validates* the model. Each iteration is biased because the data is split in a specific way.

To assess the model, we must validate it across multiple iterations.

This is called *cross validation*.

In [33]:
def my_validate(X, y, model, n):
    scores = []
    for i in range(n):
        X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.1)
        model.fit(X_train, y_train)
        scores.append( model.score(X_test, y_test) )
    return sum(scores)/len(scores)

In [36]:
my_validate(X,y,model,20)

0.9311893482205884

In [37]:
my_validate(X,y,model,100)

0.9199013406971726

# Sklearn

In [7]:
from sklearn.model_selection import cross_validate
iris = pandas.read_csv('../Datasets/iris.csv')
iris=iris.sample(150)

In [8]:
X = iris[['PetalWidth']]
y = iris['PetalLength']

In [9]:
result = cross_validate(model,X,y)
result['test_score'].mean()

0.9223323606704591

R-squared is negative. It's worse than the baseline prediction.  Why?

### Random Shuffling

In [71]:
from sklearn.linear_model import LinearRegression
model = LinearRegression()

In [69]:
from sklearn.model_selection import cross_validate, ShuffleSplit

In [10]:
iris = pandas.read_csv('../Datasets/iris.csv')
X = iris[['PetalWidth']]
y = iris['PetalLength']

In [82]:
validator = ShuffleSplit(n_splits=100)
result = cross_validate(model, X, y, cv=validator)
print(result['test_score'].mean().round(2))


0.91


ShuffleSplit randomly partitions the data into training and testing.

It's possible that some data points are never used for training.

It's possible that some data points are never used for testing.

However, it's random.

To get an accurate answers, we need to do many iterations (n_splits should be large).


## K-Fold cross validation

In [88]:
from sklearn.model_selection import cross_validate, KFold
validator = KFold(n_splits=10, shuffle=True)
result = cross_validate(model, X, y, cv=validator)
print(result['test_score'].mean().round(2))

0.92


How does k-Fold cross validation work?

This picture shows a 4-Fold cross validation. There're 4 iterations.
<img src="https://upload.wikimedia.org/wikipedia/commons/b/b5/K-fold_cross_validation_EN.svg">

Every data point is used for testing.

Every data point is used for training.

There are only k iterations.

# Stratified K-Fold

If y is skewed, we want stratified kfold. It will try to sample data in a way that respects the distribution of y.

y is continuous. We can't do this.