In [1]:
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [2]:
run src/preprocessing.py

## Model Selection: Cross-Validation

In the next phase of this project we move into developing our machine learning models. We have previously about model selection and have considered managing the Bias-Variance Tradeoff as we fit our predictive model. We primarily focused on identifying the simplest possible model as a way to making sure that our model generalizes to new data. Now we expand on this by examining three new concepts in model assessment and selection.

1. using cross-validation to study model variance
1. applying regularization to help our models generalize
1. using emsembling to help our models generalize 

One commonly held misconceptions is that cross-validation can to help models to generalize. This is not the case. Rather, cross-validation can be used to help to identify potential issues and to optimize model hyperparameters toward the end of choosing the best possible model.

#### The Validation Set Approach

Cross-validation is a resampling technique and is simply the creative use of collected data. We have already seen a very simple cross-validation approach, the train-test split also called The Validation Set Approach.

![](doc/img/Chapter5/5-1.png)

In [3]:
from time import time
from sklearn.model_selection import train_test_split

In [4]:
zoning_df = pd.read_csv('data/zoning.csv')
listing_df = pd.read_csv('data/listing.csv')
sale_df = pd.read_csv('data/sale.csv')

housing_df = pd.merge(zoning_df, listing_df, left_on="Id", right_on="Id")
housing_df = pd.merge(housing_df, sale_df, left_on="Id", right_on="Id")
housing_df["SalePrice"]

0       208500
1       181500
2       223500
3       140000
4       250000
5       143000
6       307000
7       200000
8       129900
9       118000
10      129500
11      345000
12      144000
13      279500
14      157000
15      132000
16      149000
17       90000
18      159000
19      139000
20      325300
21      139400
22      230000
23      129900
24      154000
25      256300
26      134800
27      306000
28      207500
29       68500
         ...  
1430    192140
1431    143750
1432     64500
1433    186500
1434    160000
1435    174000
1436    120500
1437    394617
1438    149700
1439    197000
1440    191000
1441    149300
1442    310000
1443    121000
1444    179600
1445    129000
1446    157900
1447    240000
1448    112000
1449     92000
1450    136000
1451    287090
1452    145000
1453     84500
1454    185000
1455    175000
1456    210000
1457    266500
1458    142125
1459    147500
Name: SalePrice, Length: 1460, dtype: int64

In [5]:
target_log_std_sc_out_rem_df

Id
1       208500
2       181500
3       223500
4       140000
5       250000
6       143000
7       307000
8       200000
9       129900
10      118000
11      129500
12      345000
13      144000
14      279500
15      157000
16      132000
17      149000
18       90000
19      159000
20      139000
21      325300
22      139400
23      230000
24      129900
25      154000
26      256300
27      134800
28      306000
29      207500
30       68500
         ...  
1431    192140
1432    143750
1433     64500
1434    186500
1435    160000
1436    174000
1437    120500
1438    394617
1439    149700
1440    197000
1441    191000
1442    149300
1443    310000
1444    121000
1445    179600
1446    129000
1447    157900
1448    240000
1449    112000
1450     92000
1451    136000
1452    287090
1453    145000
1454     84500
1455    185000
1456    175000
1457    210000
1458    266500
1459    142125
1460    147500
Name: SalePrice, Length: 1444, dtype: int64

In [6]:
(dataset_1.shape,
 dataset_2.shape,
 dataset_3.shape,
 dataset_4.shape)

((1444, 382), (1444, 390), (1444, 382), (1444, 390))

In [7]:
np.testing.assert_allclose(dataset_1.index, target_1.index)
np.testing.assert_allclose(dataset_2.index, target_2.index)
np.testing.assert_allclose(dataset_3.index, target_3.index)
np.testing.assert_allclose(dataset_4.index, target_4.index)

In [8]:
ttsplit_1 = train_test_split(dataset_1, target_1, test_size=0.4, random_state=0)
ttsplit_2 = train_test_split(dataset_2, target_1, test_size=0.4, random_state=0)
ttsplit_3 = train_test_split(dataset_3, target_1, test_size=0.4, random_state=0)
ttsplit_4 = train_test_split(dataset_4, target_1, test_size=0.4, random_state=0)

In [9]:
#ttsplit_1[1]

In [10]:
def fit_score(model, data):
    X_train = data[0]
    X_test  = data[1]
    y_train = data[2]
    y_test  = data[3]
    
    start = time()
    model.fit(X_train, y_train)
    end = time() - start 
    return model.score(X_test, y_test),end

In [11]:
from sklearn.linear_model import Lasso, Ridge

In [12]:
print(fit_score(Ridge(max_iter=1E5), ttsplit_1))
print(fit_score(Ridge(max_iter=1E5), ttsplit_2))
print(fit_score(Ridge(max_iter=1E5), ttsplit_3))
print(fit_score(Ridge(max_iter=1E5), ttsplit_4))

(0.89860862751808546, 0.03981661796569824)
(0.89858141337740194, 0.012806177139282227)
(0.89924977700919972, 0.011998176574707031)
(0.89931145722760453, 0.012408018112182617)


In [13]:
print(fit_score(Lasso(max_iter=1E4), ttsplit_1))
print(fit_score(Lasso(max_iter=1E5), ttsplit_2))
print(fit_score(Lasso(max_iter=1E4), ttsplit_3))
print(fit_score(Lasso(max_iter=1E5), ttsplit_4))

(0.87587594870369745, 1.19460129737854)
(0.87587068045006999, 7.579938888549805)
(0.87344492815283614, 0.810370922088623)
(0.87345736287380238, 4.375954627990723)


#### Leave-One-Out Cross-Validation

An alternative to using a single validation set is using **leave-one-out cross-validation** (LOOCV). 

![](doc/img/Chapter5/5-3.png)

Here, instead of creating two sets, we create $n$ sets and fit $n$ models. Using this method, each data point is used as a testing point exactly once. To assess the performance we simply take the average over all models

$$\text{CV}_n=\mathbb{E}\left[MSE(f_i)\right]$$

One draw back to this approach is the substantial time required to set a model for each data point.

In [14]:
from sklearn.model_selection import LeaveOneOut

In [15]:
def fit_score_loo(model, dataset, target):
    loo = LeaveOneOut()
    scores = []
    for train, test in loo.split(dataset, target):
        train = dataset.index[train]
        test = dataset.index[test]

        X_train = dataset.loc[train]
        X_test  = dataset.loc[test]
        y_train = target.loc[train]
        y_test  = target.loc[test]
    
        model.fit(X_train, y_train)
        scores.append(model.score(X_test, y_test))
    
    scores = np.array(scores)
    print("Mean: {} Variance: {}".format(scores.mean(), scores.var()))

In [16]:
#length 1444 scores after calculation for dataset_1

In [27]:
print(fit_score_loo(Ridge(), dataset_1, target_1))
# print(fit_score_loo(Ridge(), dataset_2, target_2))
# print(fit_score_loo(Ridge(), dataset_3, target_3))
# print(fit_score_loo(Ridge(), dataset_4, target_4))

Mean: 0.0 Variance: 0.0
None


In [18]:
# print(fit_score_loo(Lasso(), dataset_1, target_1))
# print(fit_score_loo(Lasso(), dataset_2, target_2))
# print(fit_score_loo(Lasso(), dataset_3, target_3))
# print(fit_score_loo(Lasso(), dataset_4, target_4))

#### K-Fold Cross-Validation

It is usually not practical to use LOOCV. Unacceptable alternative is to use **k-fold cross-validation** (KCV). In this method the data set is split into $k$ groups. Then, $k$ models are fit. Uses exactly one of the groups as a validation set And the remaining data as the training set. As before, the cross validation score is simply the average of the scores across all of the models

$$\text{CV}_k=\mathbb{E}\left[MSE(f_i)\right]$$

![](doc/img/Chapter5/5-5.png)

Typical values of $k$ are $k=5$ or $k=10$.

In [19]:
from sklearn.model_selection import KFold

In [20]:
def fit_score_kfold(model, dataset, target, folds=5):
    kf = KFold(n_splits=folds)
    scores = []
    start = time()
    for train, test in kf.split(dataset, target):
        train = dataset.index[train]
        test = dataset.index[test]

        X_train = dataset.loc[train]
        X_test  = dataset.loc[test]
        y_train = target.loc[train]
        y_test  = target.loc[test]
    
        model.fit(X_train, y_train)
        scores.append(model.score(X_test, y_test))
    
    scores = np.array(scores)
    end = time() - start 

    print("Mean: {:6} Variance: {:6} Time: {:6}".format(scores.mean(), scores.var(), end))

In [21]:
#Understanding the data:
kf = KFold(n_splits=5)
for train, test in kf.split(dataset_1, target_1):
    train = dataset_1.index[train]
    test = dataset_1.index[test]
    
    X_train = dataset_1.loc[train]
    X_test  = dataset_1.loc[test]
    y_train = dataset_1.loc[train]
    y_test  = dataset_1.loc[test]

In [22]:
kftest = kf.split(dataset_1, target_1)
dftest = pd.DataFrame(list(kftest))
dftest

Unnamed: 0,0,1
0,"[289, 290, 291, 292, 293, 294, 295, 296, 297, ...","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,..."
1,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[289, 290, 291, 292, 293, 294, 295, 296, 297, ..."
2,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[578, 579, 580, 581, 582, 583, 584, 585, 586, ..."
3,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[867, 868, 869, 870, 871, 872, 873, 874, 875, ..."
4,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[1156, 1157, 1158, 1159, 1160, 1161, 1162, 116..."


In [23]:
len(test)

288

In [24]:
#length of 5 scores: array([ 0.99847548,  0.99813734,  0.99802722,  0.99837401,  0.99761703])

In [25]:
fit_score_kfold(Ridge(), dataset_1, target_1)
fit_score_kfold(Ridge(), dataset_2, target_2)
fit_score_kfold(Ridge(), dataset_3, target_3)
fit_score_kfold(Ridge(), dataset_4, target_4)

Mean: 0.8853121172449192 Variance: 0.00022291442467840236 Time: 0.1361231803894043
Mean: 0.8853098855630306 Variance: 0.0002225420260587801 Time: 0.10610556602478027
Mean: 0.8855857104448723 Variance: 0.0002238401520438682 Time: 0.10524654388427734
Mean: 0.8855861191098917 Variance: 0.00022437950348525468 Time: 0.10732865333557129


In [26]:
fit_score_kfold(Lasso(), dataset_1, target_1)
fit_score_kfold(Lasso(), dataset_2, target_2)
fit_score_kfold(Lasso(), dataset_3, target_3)
fit_score_kfold(Lasso(), dataset_4, target_4)



Mean: 0.8705352790807519 Variance: 0.0003678074887980236 Time: 2.3268306255340576
Mean: 0.8705030078286796 Variance: 0.00037058614194948483 Time: 2.3912527561187744
Mean: 0.8676419438563763 Variance: 0.00021040580637601944 Time: 2.2835559844970703
Mean: 0.867624853279915 Variance: 0.0002079946079796495 Time: 2.022021770477295


### Bias-Variance Trade-Off for k-Fold Cross-Validation

In terms of bias, it is clear that LOOCV will have lower bias than KCV when $k < n$. This is because each model is trained using $n-1$ points which is nearly all of the training data. Since KCV uses less of the data, it has less ability to learn the phenomenon represented by the data and is therefore more biased then LOOCV.

On the other hand, LOOCV has more variance than KCV. This is because LOOCV involve the fitting and then averaging of performance of $n$ models, whereas KCV does this over $k$ models. Furthermore, the $n$ LOOCV models are more correlated with each other than are the $k$ KCV models. This is clear because each LOOCV model is identical to any other LOOCV model save for one point. Meanwhile each KCV model differs from any other KCV model in $n/k$ points. It can be shown that the meani of highly correlated quantities has higher variance then does the mean of quantities that are not as highly correlated. In other words, the LOOCV has higher variance than does the KCV.