# Model Selection and Evaluation
* 1 Cross-validation: evaluating estimator performance
* 2 Tuning the hyper-parameters of an estimator
* 3 Model evaluation: quantifying the quality of predictions
* 4 Model persistence
* 5 Validation curves: plotting scores to evaluate models

## 一、Cross-validation: evaluating estimator performance

In [24]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn import svm

# 导入数据
iris = datasets.load_iris()
iris.data.shape, iris.target.shape

((150, 4), (150,))

In [25]:
# 分割训练集(0.6)-测试集(0.4)
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.4, random_state=0)

print(X_train.shape, y_train.shape)
print()
print(X_test.shape, y_test.shape)

clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)
clf.score(X_test, y_test)                           

(90, 4) (90,)

(60, 4) (60,)


0.96666666666666667

### 1 Computing cross-validated metrics：计算交叉验证的度量

In [28]:
from sklearn.model_selection import cross_val_score

clf = svm.SVC(kernel='linear', C=1)
scores = cross_val_score(clf, iris.data, iris.target, cv=5) # 度量

print(scores)
print()
# The mean score and the 95% confidence interval of the score estimate
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

[ 0.96666667  1.          0.96666667  0.96666667  1.        ]

Accuracy: 0.98 (+/- 0.03)


In [14]:
scores = cross_val_score(clf, iris.data, iris.target, cv=5, scoring='f1_macro') # 度量方式：scoring parameter
scores    

array([ 0.96658312,  1.        ,  0.96658312,  0.96658312,  1.        ])

In [34]:
from sklearn.cross_validation import ShuffleSplit

n_samples = iris.data.shape[0]
cv = ShuffleSplit(n_samples,n_iter=3, test_size=0.3, random_state=0)

print(n_samples)
print()
print(cv)
cross_val_score(clf, iris.data, iris.target, cv=cv) # 不同的交叉验证

# 看验证集分割的标签
# for train_index, test_index in cv:
#     print("TRAIN:", train_index, "TEST:", test_index)

150

ShuffleSplit(150, n_iter=3, test_size=0.3, random_state=0)


array([ 0.97777778,  0.97777778,  1.        ])

* Data transformation with held out data：同样的数据处理要应用在测试集上
> * preprocessing (such as standardization, feature selection, etc.) and similar data transformations similarly should be learnt from a training set and applied to held-out data for prediction

In [15]:
from sklearn import preprocessing

X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.4, random_state=0)

scaler = preprocessing.StandardScaler().fit(X_train)
X_train_transformed = scaler.transform(X_train)

clf = svm.SVC(C=1).fit(X_train_transformed, y_train)
X_test_transformed = scaler.transform(X_test)

clf.score(X_test_transformed, y_test)  

0.93333333333333335

In [16]:
from sklearn.pipeline import make_pipeline

clf = make_pipeline(preprocessing.StandardScaler(), svm.SVC(C=1))
cross_val_score(clf, iris.data, iris.target, cv=cv)

array([ 0.97777778,  0.93333333,  0.95555556])

* Obtaining predictions by cross-validation：通过交叉验证得到预测值
> * the prediction that was obtained for that element when it was in the test set. 
> * Only cross-validation strategies that assign all elements to a test set exactly once can be used

In [17]:
from sklearn.model_selection import cross_val_predict
from sklearn import metrics

predicted = cross_val_predict(clf, iris.data, iris.target, cv=10)
metrics.accuracy_score(iris.target, predicted) 

0.96666666666666667

### 2 Cross validation iterators

### 3 Cross-validation iterators for i.i.d. data：不同分割数据的方式
* Independent Identically Distributed (i.i.d.)
> * While i.i.d. data is a common assumption in machine learning theory, it rarely holds in practice. 
> * If one knows that the samples have been generated using a time-dependent process, it’s safer to use a time-series aware cross-validation scheme <time_series_cv> 
> * Similarly if we know that the generative process has a group structure (samples from collected from different subjects, experiments, measurement devices) it safer to use group-wise cross-validation <group_cv>.

#### 3.1 K-fold

In [19]:
# 2-fold cross-validation on a dataset with 4 samples
import numpy as np
from sklearn.model_selection import KFold

X = ["a", "b", "c", "d"]
kf = KFold(n_splits=2)
for train, test in kf.split(X):
    print("%s %s" % (train, test))

[2 3] [0 1]
[0 1] [2 3]


* Each fold is constituted by two arrays: the first one is related to the training set, and the second one to the test set. 
* Thus, one can create the training/test sets using numpy indexing

In [38]:
X = np.array([[0., 0.], [1., 1.], [-1., -1.], [2., 2.]])
y = np.array([0, 1, 0, 1])
X_train, X_test, y_train, y_test = X[train], X[test], y[train], y[test]
print(X_train)
print()
print(X_test)
print()
print(y_train)
y_test

[[ 0.  0.]
 [ 1.  1.]]

[[-1. -1.]
 [ 2.  2.]]

[0 1]


array([0, 1])

#### 3.2 Leave One Out (LOO)

In [39]:
from sklearn.model_selection import LeaveOneOut

X = [1, 2, 3, 4]
loo = LeaveOneOut()
for train, test in loo.split(X):
    print("%s %s" % (train, test))

[1 2 3] [0]
[0 2 3] [1]
[0 1 3] [2]
[0 1 2] [3]


#### 3.3 Leave P Out (LPO)

In [40]:
# Leave-2-Out on a dataset with 4 samples
from sklearn.model_selection import LeavePOut

X = np.ones(4)
lpo = LeavePOut(p=2)
for train, test in lpo.split(X):
    print("%s %s" % (train, test))

[2 3] [0 1]
[1 3] [0 2]
[1 2] [0 3]
[0 3] [1 2]
[0 2] [1 3]
[0 1] [2 3]


#### 3.4 Random permutations cross-validation a.k.a. Shuffle & Split

In [41]:
from sklearn.model_selection import ShuffleSplit

X = np.arange(5)
ss = ShuffleSplit(n_splits=3, test_size=0.25,random_state=0)
# random_state：control the randomness for reproducibility of the results by explicitly seeding

for train_index, test_index in ss.split(X):
    print("%s %s" % (train_index, test_index))

[1 3 4] [2 0]
[1 4 3] [0 2]
[4 0 2] [1 3]


In [48]:
from sklearn.model_selection import ShuffleSplit
help(ShuffleSplit)

Help on class ShuffleSplit in module sklearn.model_selection._split:

class ShuffleSplit(BaseShuffleSplit)
 |  Random permutation cross-validator
 |  
 |  Yields indices to split data into training and test sets.
 |  
 |  Note: contrary to other cross-validation strategies, random splits
 |  do not guarantee that all folds will be different, although this is
 |  still very likely for sizeable datasets.
 |  
 |  Read more in the :ref:`User Guide <cross_validation>`.
 |  
 |  Parameters
 |  ----------
 |  n_splits : int (default 10)
 |      Number of re-shuffling & splitting iterations.
 |  
 |  test_size : float, int, or None, default 0.1
 |      If float, should be between 0.0 and 1.0 and represent the
 |      proportion of the dataset to include in the test split. If
 |      int, represents the absolute number of test samples. If None,
 |      the value is automatically set to the complement of the train size.
 |  
 |  train_size : float, int, or None (default is None)
 |      If floa

* ShuffleSplit is thus a good alternative to KFold cross validation that allows a finer control on the number of iterations and the proportion of samples on each side of the train / test split.

### 4 Cross-validation iterators with stratification based on class labels.
* large imbalance in the distribution of the target classes：类数量不平衡
> *  use stratified sampling as implemented in StratifiedKFold and StratifiedShuffleSplit to ensure that relative class frequencies is approximately preserved in each train and validation fold.

In [42]:
# stratified 3-fold cross-validation on a dataset with 10 samples from two slightly unbalanced classes
from sklearn.model_selection import StratifiedKFold   # StratifiedShuffleSplit

X = np.ones(10)
y = [0, 0, 0, 0, 1, 1, 1, 1, 1, 1]
skf = StratifiedKFold(n_splits=3)

for train, test in skf.split(X, y):
    print("%s %s" % (train, test))

[2 3 6 7 8 9] [0 1 4 5]
[0 1 3 4 5 8 9] [2 6 7]
[0 1 2 4 5 6 7] [3 8 9]


### 5 Cross-validation iterators for grouped data
* The i.i.d. assumption is broken if the underlying generative process yield groups of dependent samples.
* medical data collected from multiple patients, with multiple samples taken from each patient. 
* And such data is likely to be dependent on the individual group.
* In this case we would like to know if a model trained on a particular set of groups generalizes well to the unseen groups. 
* To measure this, we need to ensure that all the samples in the validation fold come from groups that are not represented at all in the paired training fold.

#### 5.1 Group k-fold
* GroupKFold is a variation of k-fold which ensures that the same group is not represented in both testing and training sets. 

In [43]:
#  have three subjects, each with an associated number from 1 to 3
from sklearn.model_selection import GroupKFold

X = [0.1, 0.2, 2.2, 2.4, 2.3, 4.55, 5.8, 8.8, 9, 10]
y = ["a", "b", "b", "b", "c", "c", "c", "d", "d", "d"]
groups = [1, 1, 1, 2, 2, 2, 3, 3, 3, 3]

gkf = GroupKFold(n_splits=3)
for train, test in gkf.split(X, y, groups=groups):
    print("%s %s" % (train, test))

[0 1 2 3 4 5] [6 7 8 9]
[0 1 2 6 7 8 9] [3 4 5]
[3 4 5 6 7 8 9] [0 1 2]


* Each subject is in a different testing fold, and the same subject is never in both testing and training. 
* Notice that the folds do not have exactly the same size due to the imbalance in the data.

#### 5.2 Leave One Group Out
* used to create a cross-validation based on the different experiments: we create a training set using the samples of all the experiments except one
* Another common application is to use time information: for instance the groups could be the year of collection of the samples and thus allow for cross-validation against time-based splits.

In [44]:
from sklearn.model_selection import LeaveOneGroupOut

X = [1, 5, 10, 50, 60, 70, 80]
y = [0, 1, 1, 2, 2, 2, 2]
groups = [1, 1, 2, 2, 3, 3, 3]
logo = LeaveOneGroupOut()
for train, test in logo.split(X, y, groups=groups):
    print("%s %s" % (train, test))

[2 3 4 5 6] [0 1]
[0 1 4 5 6] [2 3]
[0 1 2 3] [4 5 6]


#### 5.3 Leave P Groups Out

In [45]:
from sklearn.model_selection import LeavePGroupsOut

X = np.arange(6)
y = [1, 1, 1, 2, 2, 2]
groups = [1, 1, 2, 2, 3, 3]
lpgo = LeavePGroupsOut(n_groups=2) # Leave-2-Group Out
for train, test in lpgo.split(X, y, groups=groups):
    print("%s %s" % (train, test))

[4 5] [0 1 2 3]
[2 3] [0 1 4 5]
[0 1] [2 3 4 5]


#### 5.4 Group Shuffle Split
* The GroupShuffleSplit iterator behaves as a combination of ShuffleSplit and LeavePGroupsOut, and generates a sequence of randomized partitions in which a subset of groups are held out for each split.
* This class is useful when the behavior of LeavePGroupsOut is desired, but the number of groups is large enough that generating all possible partitions with P groups withheld would be prohibitively expensive.

In [46]:
from sklearn.model_selection import GroupShuffleSplit

X = [0.1, 0.2, 2.2, 2.4, 2.3, 4.55, 5.8, 0.001]
y = ["a", "b", "b", "b", "c", "c", "c", "a"]
groups = [1, 1, 2, 2, 3, 3, 4, 4]
gss = GroupShuffleSplit(n_splits=4, test_size=0.5, random_state=0)

for train, test in gss.split(X, y, groups=groups):
    print("%s %s" % (train, test))

[0 1 2 3] [4 5 6 7]
[2 3 6 7] [0 1 4 5]
[2 3 4 5] [0 1 6 7]
[4 5 6 7] [0 1 2 3]


### 6 Predefined Fold-Splits / Validation-Sets

### 7 Cross validation of time series data
* This class can be used to cross-validate time series data samples that are observed at fixed time intervals.

* Time series data is characterised by the correlation between observations that are near in time (autocorrelation). 
* However, classical cross-validation techniques such as KFold and ShuffleSplit assume the samples are independent and identically distributed, and would result in unreasonable correlation between training and testing instances (yielding poor estimates of generalisation error) on time series data. 
* Therefore, it is very important to evaluate our model for time series data on the “future” observations least like those that are used to train the model. 

In [47]:
# 3-split time series cross-validation on a dataset with 6 samples
from sklearn.model_selection import TimeSeriesSplit

X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([1, 2, 3, 4, 5, 6])
tscv = TimeSeriesSplit(n_splits=3)
print(tscv)  

for train, test in tscv.split(X):
    print("%s %s" % (train, test))

TimeSeriesSplit(n_splits=3)
[0 1 2] [3]
[0 1 2 3] [4]
[0 1 2 3 4] [5]


### 8 A note on shuffling

* Some cross validation iterators, such as KFold, have an inbuilt option to shuffle the data indices before splitting them. 
* Note that:
> * This consumes less memory than shuffling the data directly.
> * By default no shuffling occurs, including for the (stratified) K fold cross- validation performed by specifying cv=some_integer to cross_val_score, grid search, etc. Keep in mind that train_test_split still returns a random split.
> * The random_state parameter defaults to None, meaning that the shuffling will be different every time KFold(..., shuffle=True) is iterated. However, GridSearchCV will use the same shuffling for each set of parameters validated by a single call to its fit method.
> * To ensure results are repeatable (on the same platform), use a fixed value for random_state.

### 9 Cross validation and model selection
* Cross validation iterators can also be used to directly perform model selection using Grid Search for the optimal hyperparameters of the model.