- sklearn官网：http://scikit-learn.org/stable/index.html
- sklearn API：http://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing

# 获取数据

机器学习算法往往需要大量的数据，在 skleran 中获取数据通常采用两种方式，一种是使用自带的数据集，另一种是创建数据集

## 导入数据集

sklearn 自带了很多数据集，可以用来对算法进行测试分析，免去了自己再去找数据集的烦恼

其中包括：

- 鸢尾花数据集: `load_iris()`
- 手写数字数据集: `load_digitals()`
- 糖尿病数据集: `load_diabetes()`
- 乳腺癌数据集: `load_breast_cancer()`
- 波士顿房价数据集: `load_boston()`
- 体能训练数据集: `load_linnerud()`
    
    
这里以鸢尾花数据集为例导入数据集

In [1]:
# 导入 sklearn 的数据集

import sklearn.datasets as sk_datasets 
iris = sk_datasets.load_iris() 
iris_X = iris.data  # 导入数据 
iris_y = iris.target # 导入标签 

## 创建数据集

使用 skleran 的样本生成器 (samples generator) 可以创建数据，`sklearn.datasets.samples_generator` 中包含了大量创建样本数据的方法。

这里以分类问题创建样本数据

In [4]:
import sklearn.datasets.samples_generator as sk_sample_generator

X, y = sk_sample_generator.make_classification(
    n_samples=6, n_features=5, n_informative=2, n_redundant=3, n_classes=2, n_clusters_per_class=2, scale=1, random_state=20)

for x_, y_ in zip(X, y):
    print(y_, end=": ") 
    print(x_)

0: [ 0.64459602  0.92767918 -1.32091378 -1.25725859 -0.74386837]
0: [ 1.66098845  2.22206181 -2.86249859 -3.28323172 -1.62389676]
0: [ 0.27019475 -0.12572907  1.1003977  -0.6600737   0.58334745]
1: [-0.77182836 -1.03692724  1.34422289  1.52452016  0.76221055]
1: [-0.1407289   0.32675611 -1.41296696  0.4113583  -0.75833145]
1: [-0.76656634 -0.35589955 -0.83132182  1.68841011 -0.4153836 ]


参数说明：

- `n_features` : 特征个数 = `n_informative() + n_redundant + n_repeated`
- `n_informative`：多信息特征的个数
- `n_redundant`：冗余信息，informative 特征的随机线性组合
- `n_repeated` ：重复信息，随机提取 `n_informative` 和 `n_redundant` 特征
- `n_classes`：分类类别
- `n_clusters_per_class` ：某一个类别是由几个 cluster 构成的
- `random_state`：随机种子，使得实验可重复

`n_classes` $\times$ `n_clusters_per_class` $\leq$ $2^{\text{n_informative}}$

# 数据集的划分

机器学习的过程正往往需要对数据集进行划分，常分为训练集，测试集。sklearn 中的 `model_selection` 为我们提供了划分数据集的方法。

以鸢尾花数据集为例进行划分

In [7]:
import sklearn.model_selection as sk_model_selection
X_train, X_test, y_train, y_test = sk_model_selection.train_test_split(
    iris_X, iris_y, train_size=0.2, random_state=20)



参数说明：

- `arrays`：样本数组，包含特征向量和标签
- `test_size`：
　　- `float`:获得多大比重的测试样本 （默认：0.25）
　　- `int`: 获得多少个测试样本
- `train_size`: 同 `test_size`
- `random_state`:`int` - 随机种子（种子固定，实验可复现）
- `shuffle` - 是否在分割之前对数据进行洗牌（默认True）

sklearn 数据集划分方法还有如下方法: KFold，GroupKFold，StratifiedKFold，LeaveOneGroupOut，LeavePGroupsOut，LeaveOneOut，LeavePOut，ShuffleSplit，GroupShuffleSplit，StratifiedShuffleSplit，PredefinedSplit，TimeSeriesSplit，

## 数据集划分方法——K折交叉验证：KFold，GroupKFold，StratifiedKFold

- 将全部训练集 $S$ 分成 $k$ 个不相交的子集，假设 $S$ 中的训练样例个数为 $m$，那么每一个自己有 $m/k$ 个训练样例，相应的子集为 $\{s_1，s_2，\cdots s_k\}$
- 每次从分好的子集里面，拿出一个作为测试集，其他 $k-1$ 个作为训练集
- 在 $k-1$ 个训练集上训练出学习器模型
- 把这个模型放到测试集上，得到分类率的平均值，作为该模型或者假设函数的真实分类率

这个方法充分利用了所以样本，但计算比较繁琐，需要训练 $k$ 次，测试 $k$ 次

### KFold

In [13]:
import numpy as np
# KFold
from sklearn.model_selection import KFold
X = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [9, 10], [11, 12]])
y = np.array([1, 2, 3, 4, 5, 6])
kf = KFold(n_splits=2)  # 分成几个组
kf.get_n_splits(X)
print(kf)

for train_index, test_index in kf.split(X):
    print("Train Index:", train_index, ",Test Index:", test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    # print(X_train,X_test,y_train,y_test)

KFold(n_splits=2, random_state=None, shuffle=False)
Train Index: [3 4 5] ,Test Index: [0 1 2]
Train Index: [0 1 2] ,Test Index: [3 4 5]


### GroupKFold

In [14]:
import numpy as np
from sklearn.model_selection import GroupKFold
X = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [9, 10], [11, 12]])
y = np.array([1, 2, 3, 4, 5, 6])
groups = np.array([1, 2, 3, 4, 5, 6])
group_kfold = GroupKFold(n_splits=2)
group_kfold.get_n_splits(X, y, groups)
print(group_kfold)
for train_index, test_index in group_kfold.split(X, y, groups):
    print("Train Index:", train_index, ",Test Index:", test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    # print(X_train,X_test,y_train,y_test)

GroupKFold(n_splits=2)
Train Index: [0 2 4] ,Test Index: [1 3 5]
Train Index: [1 3 5] ,Test Index: [0 2 4]


### StratifiedKFold

保证训练集中每一类的比例是相同的

In [15]:
import numpy as np
from sklearn.model_selection import StratifiedKFold
X=np.array([[1,2],[3,4],[5,6],[7,8],[9,10],[11,12]])
y=np.array([1,1,1,2,2,2])
skf=StratifiedKFold(n_splits=3)
skf.get_n_splits(X,y)
print(skf)
for train_index,test_index in skf.split(X,y):
    print("Train Index:",train_index,",Test Index:",test_index)
    X_train,X_test=X[train_index],X[test_index]
    y_train,y_test=y[train_index],y[test_index]
    #print(X_train,X_test,y_train,y_test)

StratifiedKFold(n_splits=3, random_state=None, shuffle=False)
Train Index: [1 2 4 5] ,Test Index: [0 3]
Train Index: [0 2 3 5] ,Test Index: [1 4]
Train Index: [0 1 3 4] ,Test Index: [2 5]


## 数据集划分方法——留一法: LeaveOneGroupOut，LeavePGroupsOut，LeaveOneOut，LeavePOut

留一法验证（Leave-one-out，LOO）：假设有 $N$ 个样本，将每一个样本作为测试样本，其他 $N-1$ 个样本作为训练样本，这样得到 $N$ 个分类器，$N$ 个测试结果，用这 $N$ 个结果的平均值来衡量模型的性能

如果 LOO 与 K-fold CV 比较，LOO 在 $N$ 个样本上建立 $N$ 个模型而不是 $k$ 个，更进一步，$N$ 个模型的每一个都是在 $N-1$ 个样本上训练的，而不是 $(k-1)n/k$。两种方法中，假定 $k$ 不是很大而且 $k<<N$，LOO 比 k-fold CV 更耗时

留 $P$ 法验证（Leave-p-out）：有 $N$ 个样本，将每 $P$ 个样本作为测试样本，其它 $N-P$ 个样本作为训练样本，这样得到个 train-test pairs，不像 LeaveOneOut 和 KFold，当 $P>1$ 时，测试集将会发生重叠，当 $P=1$ 的时候，就变成了留一法

###  leaveOneOut：测试集就留下一个

In [16]:
import numpy as np
from sklearn.model_selection import LeaveOneOut
X = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [9, 10], [11, 12]])
y = np.array([1, 2, 3, 4, 5, 6])
loo = LeaveOneOut()
loo.get_n_splits(X)
print(loo)
for train_index, test_index in loo.split(X, y):
    print("Train Index:", train_index, ",Test Index:", test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    # print(X_train,X_test,y_train,y_test)

LeaveOneOut()
Train Index: [1 2 3 4 5] ,Test Index: [0]
Train Index: [0 2 3 4 5] ,Test Index: [1]
Train Index: [0 1 3 4 5] ,Test Index: [2]
Train Index: [0 1 2 4 5] ,Test Index: [3]
Train Index: [0 1 2 3 5] ,Test Index: [4]
Train Index: [0 1 2 3 4] ,Test Index: [5]


### LeavePOut：测试集留下P个

In [17]:
import numpy as np
from sklearn.model_selection import LeavePOut
X=np.array([[1,2],[3,4],[5,6],[7,8],[9,10],[11,12]])
y=np.array([1,2,3,4,5,6])
lpo=LeavePOut(p=3)
lpo.get_n_splits(X)
print(lpo)
for train_index,test_index in lpo.split(X,y):
    print("Train Index:",train_index,",Test Index:",test_index)
    X_train,X_test=X[train_index],X[test_index]
    y_train,y_test=y[train_index],y[test_index]
    #print(X_train,X_test,y_train,y_test)

LeavePOut(p=3)
Train Index: [3 4 5] ,Test Index: [0 1 2]
Train Index: [2 4 5] ,Test Index: [0 1 3]
Train Index: [2 3 5] ,Test Index: [0 1 4]
Train Index: [2 3 4] ,Test Index: [0 1 5]
Train Index: [1 4 5] ,Test Index: [0 2 3]
Train Index: [1 3 5] ,Test Index: [0 2 4]
Train Index: [1 3 4] ,Test Index: [0 2 5]
Train Index: [1 2 5] ,Test Index: [0 3 4]
Train Index: [1 2 4] ,Test Index: [0 3 5]
Train Index: [1 2 3] ,Test Index: [0 4 5]
Train Index: [0 4 5] ,Test Index: [1 2 3]
Train Index: [0 3 5] ,Test Index: [1 2 4]
Train Index: [0 3 4] ,Test Index: [1 2 5]
Train Index: [0 2 5] ,Test Index: [1 3 4]
Train Index: [0 2 4] ,Test Index: [1 3 5]
Train Index: [0 2 3] ,Test Index: [1 4 5]
Train Index: [0 1 5] ,Test Index: [2 3 4]
Train Index: [0 1 4] ,Test Index: [2 3 5]
Train Index: [0 1 3] ,Test Index: [2 4 5]
Train Index: [0 1 2] ,Test Index: [3 4 5]


## 数据集划分方法——随机划分法：ShuffleSplit，GroupShuffleSplit，StratifiedShuffleSplit

- ShuffleSplit 迭代器产生指定**数量的独立**的 $train/test$ 数据集划分，首先对样本全体随机打乱，然后再划分出 $train/test$ 对，可以使用随机数种子random_state来控制数字序列发生器使得讯算结果可重现
- ShuffleSplit 是 KFlod 交叉验证的比较好的替代，他允许更好的**控制迭代次数和 $train/test$ 的样本比例**
- StratifiedShuffleSplit 和 ShuffleSplit 的一个变体，返回分层划分，也就是在创建划分的时候要**保证每一个划分中类的样本比例与整体数据集中的原始比例保持一致**

### ShuffleSplit 

把数据集打乱顺序，然后划分测试集和训练集，训练集额和测试集的比例随机选定，训练集和测试集的比例的和可以小于 $1$

In [18]:
import numpy as np
from sklearn.model_selection import ShuffleSplit
X=np.array([[1,2],[3,4],[5,6],[7,8],[9,10],[11,12]])
y=np.array([1,2,3,4,5,6])
rs=ShuffleSplit(n_splits=3,test_size=.25,random_state=0)
rs.get_n_splits(X)
print(rs)
for train_index,test_index in rs.split(X,y):
    print("Train Index:",train_index,",Test Index:",test_index)
    X_train,X_test=X[train_index],X[test_index]
    y_train,y_test=y[train_index],y[test_index]
    #print(X_train,X_test,y_train,y_test)
print("==============================")
rs=ShuffleSplit(n_splits=3,train_size=.5,test_size=.25,random_state=0)
rs.get_n_splits(X)
print(rs)
for train_index,test_index in rs.split(X,y):
    print("Train Index:",train_index,",Test Index:",test_index)

ShuffleSplit(n_splits=3, random_state=0, test_size=0.25, train_size=None)
Train Index: [1 3 0 4] ,Test Index: [5 2]
Train Index: [4 0 2 5] ,Test Index: [1 3]
Train Index: [1 2 4 0] ,Test Index: [3 5]
ShuffleSplit(n_splits=3, random_state=0, test_size=0.25, train_size=0.5)
Train Index: [1 3 0] ,Test Index: [5 2]
Train Index: [4 0 2] ,Test Index: [1 3]
Train Index: [1 2 4] ,Test Index: [3 5]


### StratifiedShuffleSplitShuffleSplit 

把数据集打乱顺序，然后划分测试集和训练集，训练集额和测试集的比例随机选定，训练集和测试集的比例的和可以小于 $1$,但是还要保证训练集中各类所占的比例是一样的

In [19]:
import numpy as np
from sklearn.model_selection import StratifiedShuffleSplit
X = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [9, 10], [11, 12]])
y = np.array([1, 2, 1, 2, 1, 2])
sss = StratifiedShuffleSplit(n_splits=3, test_size=.5, random_state=0)
sss.get_n_splits(X, y)
print(sss)
for train_index, test_index in sss.split(X, y):
    print("Train Index:", train_index, ",Test Index:", test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    # print(X_train,X_test,y_train,y_test)

StratifiedShuffleSplit(n_splits=3, random_state=0, test_size=0.5,
            train_size=None)
Train Index: [5 4 1] ,Test Index: [3 2 0]
Train Index: [5 2 3] ,Test Index: [0 4 1]
Train Index: [5 0 4] ,Test Index: [3 1 2]


# 数据预处理

我们为什么要进行数据预处理？

通常，真实生活中，我们获得的数据中往往存在很多的无用信息，甚至存在错误信息，而机器学习中有一句话叫做"Garbage in，Garbage out"，数据的健康程度对于算法结果的影响极大。数据预处理就是让那些冗余混乱的源数据变得能满足其应用要求。

当然，仅仅是数据预处理的方法就可以写好几千字的文章了，在这里只谈及几个基础的数据预处理的方法。

skleran 中为我们提供了一个数据预处理的 package：`preprocessing`，我们直接导入即可

In [8]:
import sklearn.preprocessing as sk_preprocessing

下面的例子我们使用:`[[1, -1, 2], [0, 2, -1], [0, 1, -2]]`做为初始数据。
    
## 数据的归一化

基于 `mean` 和 `std` 的标准化

In [9]:
scaler = sk_preprocessing.StandardScaler().fit(X)
new_X = scaler.transform(X)
print('基于mean和std的标准化:',new_X)

基于mean和std的标准化: [[ 5.83777663e-01  5.78490274e-01 -4.46390200e-01 -5.78622166e-01
  -4.59626499e-01]
 [ 1.78208748e+00  1.82365655e+00 -1.49369334e+00 -1.75732374e+00
  -1.53002699e+00]
 [ 1.42364786e-01 -4.34864195e-01  1.19857094e+00 -2.31182824e-01
   1.15469921e+00]
 [-1.08616319e+00 -1.31141585e+00  1.36421794e+00  1.03980354e+00
   1.37225487e+00]
 [-3.42107391e-01  4.16121589e-04 -5.08928171e-01  3.92171245e-01
  -4.77218310e-01]
 [-1.07995935e+00 -6.56282899e-01 -1.13777169e-01  1.13515394e+00
  -6.00822798e-02]]


规范化到一定区间内 `feature_range` 为数据规范化的范围

In [10]:
scaler = sk_preprocessing.MinMaxScaler(feature_range=(0,1)).fit(X)
new_X=scaler.transform(X)
print('规范化到一定区间内',new_X)

规范化到一定区间内 [[0.5822158  0.60282695 0.36645754 0.40750585 0.36881342]
 [1.         1.         0.         0.         0.        ]
 [0.4283196  0.27959535 0.94203914 0.52762409 0.92503979]
 [0.         0.         1.         0.96703505 1.        ]
 [0.25941101 0.41843754 0.34457514 0.74313278 0.36275205]
 [0.00216293 0.208969   0.4828408  1.         0.50647897]]


# 数据的正则化

首先求出样本的 $p$-范数，然后该样本的所有元素都要除以该范数，这样最终使得每个样本的范数都为 $1$.

In [11]:
new_X = sk_preprocessing.normalize(X,norm='l2')
print('求二范数',new_X)

求二范数 [[ 0.28390667  0.40858817 -0.5817849  -0.55374853 -0.3276303 ]
 [ 0.30681812  0.4104597  -0.52876131 -0.60647922 -0.29996653]
 [ 0.18754123 -0.0872681   0.76378216 -0.45815483  0.40489941]
 [-0.30549798 -0.41042697  0.53205791  0.60342151  0.30169115]
 [-0.08310828  0.19296775 -0.83443596  0.24293007 -0.4478371 ]
 [-0.36426189 -0.16911862 -0.39503282  0.80230951 -0.19738463]]


# 分类器

大致可以将这些分类器分成两类： 

- 单一分类器
- 集成分类器

## 单一分类器

In [20]:
from sklearn.cross_validation import cross_val_score
from sklearn.datasets import make_blobs

# meta-estimator
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier 

from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis


classifiers = {
    'KN': KNeighborsClassifier(3),
    'SVC': SVC(kernel="linear", C=0.025),
    'SVC': SVC(gamma=2, C=1),
    'DT': DecisionTreeClassifier(max_depth=5),
    'RF': RandomForestClassifier(n_estimators=10, max_depth=5, max_features=1),  # clf.feature_importances_
    'ET': ExtraTreesClassifier(n_estimators=10, max_depth=None),  # clf.feature_importances_
    'AB': AdaBoostClassifier(n_estimators=100),
    'GB': GradientBoostingClassifier(n_estimators=100, learning_rate=1.0, max_depth=1, random_state=0), # clf.feature_importances_
    'GNB': GaussianNB(),
    'LD': LinearDiscriminantAnalysis(),
    'QD': QuadraticDiscriminantAnalysis()}

    
    
X, y = make_blobs(n_samples=10000, n_features=10, centers=100, random_state=0)


for name, clf in classifiers.items():
    scores = cross_val_score(clf, X, y)
    print(name,'\t--> ',scores.mean())



KN 	-->  1.0
SVC 	-->  0.21240344622697563
DT 	-->  0.2783125371360666
RF 	-->  0.966131907308378
ET 	-->  1.0
AB 	-->  0.03608437314319667
GB 	-->  0.05460487225193108
GNB 	-->  1.0




LD 	-->  1.0
QD 	-->  1.0


## 集成分类器

集成分类器有四种：Bagging, Voting, GridSearch, PipeLine。最后一个 PipeLine其实是管道技术

### Bagging

In [None]:
from sklearn.ensemble import BaggingClassifier
from sklearn.neighbors import KNeighborsClassifier

meta_clf = KNeighborsClassifier() 
bg_clf = BaggingClassifier(meta_clf, max_samples=0.5, max_features=0.5)

### Voting

In [None]:
from sklearn import datasets
from sklearn import cross_validation
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier

iris = datasets.load_iris()
X, y = iris.data[:, 1:3], iris.target

clf1 = LogisticRegression(random_state=1)
clf2 = RandomForestClassifier(random_state=1)
clf3 = GaussianNB()

eclf = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2), ('gnb', clf3)], voting='hard', weights=[2,1,2])

for clf, label in zip([clf1, clf2, clf3, eclf], ['Logistic Regression', 'Random Forest', 'naive Bayes', 'Ensemble']):
    scores = cross_validation.cross_val_score(clf, X, y, cv=5, scoring='accuracy')
    print("Accuracy: %0.2f (+/- %0.2f) [%s]" % (scores.mean(), scores.std(), label))

### GridSearch

In [None]:
import numpy as np

from sklearn.datasets import load_digits

from sklearn.ensemble import RandomForestClassifier
from sklearn.grid_search import GridSearchCV
from sklearn.grid_search import RandomizedSearchCV

# 生成数据
digits = load_digits()
X, y = digits.data, digits.target

# 元分类器
meta_clf = RandomForestClassifier(n_estimators=20)

# =================================================================
# 设置参数
param_dist = {"max_depth": [3, None],
              "max_features": sp_randint(1, 11),
              "min_samples_split": sp_randint(1, 11),
              "min_samples_leaf": sp_randint(1, 11),
              "bootstrap": [True, False],
              "criterion": ["gini", "entropy"]}

# 运行随机搜索 RandomizedSearch
n_iter_search = 20
rs_clf = RandomizedSearchCV(meta_clf, param_distributions=param_dist,
                                   n_iter=n_iter_search)

start = time()
rs_clf.fit(X, y)
print("RandomizedSearchCV took %.2f seconds for %d candidates"
      " parameter settings." % ((time() - start), n_iter_search))
print(rs_clf.grid_scores_)

# =================================================================
# 设置参数
param_grid = {"max_depth": [3, None],
              "max_features": [1, 3, 10],
              "min_samples_split": [1, 3, 10],
              "min_samples_leaf": [1, 3, 10],
              "bootstrap": [True, False],
              "criterion": ["gini", "entropy"]}

# 运行网格搜索 GridSearch
gs_clf = GridSearchCV(meta_clf, param_grid=param_grid)
start = time()
gs_clf.fit(X, y)

print("GridSearchCV took %.2f seconds for %d candidate parameter settings."
      % (time() - start, len(gs_clf.grid_scores_)))
print(gs_clf.grid_scores_)

### PipeLine

In [None]:
from sklearn import svm
from sklearn.datasets import samples_generator
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_regression
from sklearn.pipeline import Pipeline

# 生成数据
X, y = samples_generator.make_classification(n_informative=5, n_redundant=0, random_state=42)

# 定义Pipeline，先方差分析，再SVM
anova_filter = SelectKBest(f_regression, k=5)
clf = svm.SVC(kernel='linear')
pipe = Pipeline([('anova', anova_filter), ('svc', clf)])

# 设置anova的参数k=10，svc的参数C=0.1（用双下划线"__"连接！）
pipe.set_params(anova__k=10, svc__C=.1)
pipe.fit(X, y)

prediction = pipe.predict(X)

pipe.score(X, y)                        

# 得到 anova_filter 选出来的特征
s = pipe.named_steps['anova'].get_support()
print(s)