## Cross Validation
Holdout sets are a great start to model validation. However, using a single train and test set if often not enough. Cross-validation is considered the gold standard when it comes to validating model performance and is almost always used when tuning model hyper-parameters.

In [1]:
# Load datasets and packages
from sklearn.datasets import load_diabetes 
import numpy as np
datasets = load_diabetes()
X, y = datasets['data'], datasets['target']

In [2]:
# Split the training and test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [3]:
# Using the linear regression to fit the data
from sklearn.linear_model import LinearRegression
reg = LinearRegression().fit(X_train,y_train)
y_pred = reg.predict(X_test)
score = reg.score(X_test, y_test)
# In linear model, the score is referred to R^2 score
print(r'Score of the model is : {}'.format(score))

Score of the model is : 0.4526027629719195


We have build a linear model by using training set and test set, now we use the cross-validation technique for training. 

In [5]:
k = 5 # 5-fold cross-validation
n_samples = len(X_train)
fold_size = n_samples // k
scores = []
masks = []
models = []
for fold in range(k):
    reg = LinearRegression()
    
    # generate a boolean mask for the validation set in this fold
    val_mask = np.zeros(n_samples, dtype=bool)
    val_mask[fold*fold_size:(fold+1)*fold_size] = True
    
    # store the mask for visualization
    masks.append(val_mask)
    
    # create training and validation sets using this mask
    X_val, y_val = X_train[val_mask], y_train[val_mask]
    X_trn, y_trn = X_train[~val_mask], y_train[~val_mask]
    
    # Tips: fit the classifier (see the former example)
    reg.fit(X_trn, y_trn)
    # y_pred = reg.predict(X_val)
    # Tips: compute the score and record it
    score = reg.score(X_val, y_val)
    scores.append(score)
    models.append(reg)

1. Use the k models to predict the value of $y_{test}$ respectively, and calculate the mean score

In [6]:
y_preds = []
score_tests = []

for fold in range(k):
    reg = models[fold]
    # Tips: get the scores and predictions (see the former examples)
    score_test = reg.score(X_test, y_test)
    y_pred = reg.predict(X_test)
    score_tests.append(score_test)
    y_preds.append(y_pred)
    
# Tips: calculate the mean score
score = np.mean(score_tests)
print('Mean score of the {} model is : {}'.format(k, score))

Mean score of the 5 model is : 0.449375724169015


2. Use the ensembled k models to predict $y_{test}$, and calculate the score

In [7]:
y_preds = np.stack(y_preds, axis=0)
# Tips: calculate the mean prediction value
y_pred = np.mean(y_preds, axis=0)

from sklearn.metrics import r2_score
score = r2_score(y_test, y_pred)
print('Score of the ensembled model by {} fold is : {}'.format(k, score))

Score of the ensembled model by 5 fold is : 0.4550976834912759


#### Question: There are two scores: (1) mean of the scores and (2) score of the mean prediction. Which score is correct? Why?

 Please write your answer here.

* The former one is correct because it represents the average performance of models from all folds. However, the "mean prediction" is just a theoratical concept but does not actually exist.