# 交叉验证

参考博客：https://www.jianshu.com/p/6b5ef5afdf14

## ShuffleSplit 方法
```python
ShuffleSplit(n_splits=10, test_size=None, train_size=None, random_state=None)
```
**Prarmeters:**
第一个参数：代表折数

第二、三个参数：设置训练和测试的比例

第四个参数：
1. int ： 则传入的整形被作为随机数字生成器的种子
2. RandomState instance：可以直接输入随机生成器
3. None ：随机生成器为 np.random

**Attribute:**

1. get_n_splits :  返回交叉验证的折数（其实就相当于传入的第一个参数）

2. split ： 返回的是索引

**Example01 ：shuffleSplit**

In [1]:
import numpy as np
from sklearn.model_selection import ShuffleSplit
X = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [3, 4], [5, 6]])
y = np.array([1, 2, 1, 2, 1, 2])
rs = ShuffleSplit(n_splits=5, test_size=.25, random_state=0)
rs.get_n_splits(X)

5

1. random_state都指定为0，这样返回的索引值相同。
2. 如果只设置了训练集的比例，测试机默认为：1-训练集
3. 同样训练集和测试集的比例和可以不为1

In [6]:
rs = ShuffleSplit(n_splits=5, test_size=.25, train_size=None,random_state=0)
for train_index, test_index in rs.split(X):
    print("TRAIN:", train_index, "TEST:", test_index)

TRAIN: [1 3 0 4] TEST: [5 2]
TRAIN: [4 0 2 5] TEST: [1 3]
TRAIN: [1 2 4 0] TEST: [3 5]
TRAIN: [3 4 1 0] TEST: [5 2]
TRAIN: [3 5 1 0] TEST: [2 4]


In [5]:
rs = ShuffleSplit(n_splits=5, test_size=.25, train_size=0.5,random_state=0)
for train_index, test_index in rs.split(X):
    print("TRAIN:", train_index, "TEST:", test_index)

TRAIN: [1 3 0] TEST: [5 2]
TRAIN: [4 0 2] TEST: [1 3]
TRAIN: [1 2 4] TEST: [3 5]
TRAIN: [3 4 1] TEST: [5 2]
TRAIN: [3 5 1] TEST: [2 4]


**Example02:与SVC结合测试**

In [7]:
from sklearn.svm import SVC
from sklearn.model_selection import ShuffleSplit
from sklearn.datasets import load_iris

iris = load_iris()
X = iris.data
y = iris.target

cv_split = ShuffleSplit(n_splits=5, train_size=0.7, test_size=0.25)
for train_index, test_index in cv_split.split(X):
    train_X = X[train_index]
    test_X = X[test_index]
    train_y = y[train_index]
    test_y = y[test_index]
    svc_model = SVC()
    svc_model.fit(train_X, train_y)
    score = svc_model.score(test_X, test_y)
    print(score)

0.9473684210526315
1.0
0.9473684210526315
0.9210526315789473
0.9736842105263158




## Pipeline和GridSearchCV结合使用


Pipeline构造器接受(name, transform) tuple的列表作为参数。

In [8]:
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier

pipe_steps = [
    ('svc', SVC())
]
pipeline = Pipeline(pipe_steps)
cv_split = ShuffleSplit(n_splits=5, train_size=0.7, test_size=0.25)
param_grid = {
    'svc__cache_size' : [100, 200, 400],
    'svc__C': [1, 10, 100],
    'svc__kernel' : ['rbf', 'linear'],
    'svc__degree' : [1, 2, 3, 4],
}
grid = GridSearchCV(pipeline, param_grid, cv=cv_split)
grid.fit(X, y)









GridSearchCV(cv=ShuffleSplit(n_splits=5, random_state=None, test_size=0.25, train_size=0.7),
       error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('svc', SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='rbf', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False))]),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'svc__cache_size': [100, 200, 400], 'svc__C': [1, 10, 100], 'svc__kernel': ['rbf', 'linear'], 'svc__degree': [1, 2, 3, 4]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

模型训练完之后可以生成最好的参数，字典para_grid里面的参数组合：

In [9]:
print(grid.best_params_)
print(grid.best_score_)

{'svc__C': 1, 'svc__cache_size': 100, 'svc__degree': 1, 'svc__kernel': 'linear'}
0.9789473684210527


# 详细显示classification_report 
使用sklearn.metrics库中的classification_report方法检验上一步得出的最优模型分类效果。

In [10]:
from sklearn.metrics import classification_report

predict_y = grid.predict(X)
print(classification_report(y, predict_y))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        50
           1       1.00      0.98      0.99        50
           2       0.98      1.00      0.99        50

   micro avg       0.99      0.99      0.99       150
   macro avg       0.99      0.99      0.99       150
weighted avg       0.99      0.99      0.99       150



# 参考阅读

## Sklearn中Pipeline的使用

参考博客：

【1】https://blog.csdn.net/dss_dssssd/article/details/82840256

【2】https://www.jianshu.com/p/9c2c8c8ef42d

有很多数据转换步骤需要按照正确的步骤执行，sklearn提供了Pipeline类来处理这种顺序的操作步骤
简易代码如下：

```python
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
num_pipeline = Pipeline([
	('imputer', Imputer(strategy="median")),
	('attribs_adder', CombinedAttributesAdder()),
	('std_scaler', StandardScaler()),
])
housing_num_tr = num_pipeline.fit_transform(housing_num)
```

数据的传输过程说明：

    先在housing_num中利用Imputer处理缺失值，然后将返回值传给CombinedAttributesAdder，添加一些特征属性，接下来再将返回值传给StandardScaler，完成标准化。完成数据预处理