# 1 Cross-validation: evaluating estimator performance

**三种样本划分**  
- 假设为单一独立均匀分布的样本(包括普通划分，以及分层划分)
- 假设数据属于各个组的情形  
- 时间序列数据

 ## 1.1 Computing cross-validated metrics

计算交叉验证举证  
- **cross_val_score**  
- **cross_validate**
- **cross_val_predict**  
- **metics.make_scorer** : custom scoring function  
接受一个estimator(fit method),X,y,groups, scoring -> 表现评价函数方法，cv 交叉验证子集大小，返回一个scores

In [2]:
from sklearn import datasets, linear_model
from sklearn.model_selection import cross_val_score
diabetes = datasets.load_diabetes()
X = diabetes.data[:150]
y = diabetes.target[:150]
lasso = linear_model.Lasso()
print(cross_val_score(lasso, X, y, cv=3))  

[0.33150734 0.08022311 0.03531764]


**可以自己指定score方法** --> [scikit - model wvaluation](https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter)

**cv 划分数据集的方法也可以自己进行指定**

In [5]:
from sklearn import preprocessing
from sklearn.pipeline import make_pipeline
from sklearn import svm
iris = datasets.load_iris()
clf = make_pipeline(preprocessing.StandardScaler(),svm.SVC(C=1))
print(cross_val_score(clf,iris.data,iris.target,cv=10))

[1.         0.93333333 1.         0.93333333 1.         0.93333333
 0.86666667 1.         1.         1.        ]


In [8]:
from sklearn.metrics import recall_score,precision_score,make_scorer
from sklearn.model_selection import cross_validate
scoring = {'prec_macro': 'precision_macro',
         'rec_micro': make_scorer(recall_score, average='macro')}
scores = cross_validate(clf, iris.data, iris.target, scoring=scoring,
                        cv=5, return_train_score=True)

In [9]:
list(scores.keys())

['fit_time',
 'score_time',
 'test_prec_macro',
 'train_prec_macro',
 'test_rec_micro',
 'train_rec_micro']

## 1.1.2  Obtaining predictions by cross-validation

**cross_val_predict** ---> cross_val_score

## 1.2 Cross validation iterators

**根据不同的交叉验证策略产生不同的数据集划分方式 ---> 返回数据集划分的index**

### 1.2.1 Cross-validation iterators for i.i.d. data

**KFold**  
- n_splits : 划分的份数 2、3/5、...份
- shuffle : 划分数据集时是否将实例打乱 1 2 3 ---> 2 1 3
- random_state 当shuffle = true时使用

In [12]:
import numpy as np
from sklearn.model_selection import KFold
kf = KFold(n_splits=2)
X = ["a", "b", "c", "d"]
for train,test in kf.split(X): # 使用方法split（data）
    print(train,test) # 返回train,test的编号

[2 3] [0 1]
[0 1] [2 3]


**Repeated K-Fold**  
- n_splits  
- n_repeats  
- random_state  

表示重复kFold多少次


In [13]:
from sklearn.model_selection import RepeatedKFold
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([0, 0, 1, 1])
rkf = RepeatedKFold(n_splits=2, n_repeats=2, random_state=2652124)
for train_index, test_index in rkf.split(X):
    print("TRAIN:", train_index, "TEST:", test_index)

TRAIN: [0 1] TEST: [2 3]
TRAIN: [2 3] TEST: [0 1]
TRAIN: [1 2] TEST: [0 3]
TRAIN: [0 3] TEST: [1 2]


**Leave One Out (LOO)**  

In [14]:
from sklearn.model_selection import LeaveOneOut
X = [1, 2, 3, 4]
loo = LeaveOneOut()
for train, test in loo.split(X):
    print("%s %s" % (train, test))

[1 2 3] [0]
[0 2 3] [1]
[0 1 3] [2]
[0 1 2] [3]


**Leave P Out (LPO)**  
**LeavePOut**

**Random permutations cross-validation a.k.a. Shuffle & Split**  
**ShuffleSplit**  

In [22]:
from sklearn.model_selection import ShuffleSplit
X = np.arange(10)
ss = ShuffleSplit(n_splits=5, test_size=0.25,random_state=1)
for train_index, test_index in ss.split(X):
    print("%s %s" % (train_index, test_index))

[4 0 3 1 7 8 5] [2 9 6]
[0 8 4 2 1 6 7] [9 5 3]
[9 0 6 1 7 4 2] [8 3 5]
[0 6 1 5 8 7 9] [2 4 3]
[0 1 2 3 5 9 8] [6 4 7]


## 1.2.2 Cross-validation iterators with stratification based on class labels.