#### BaggingClassifier
基础学习器是knn  
max_samples=0.5 表示训练样本数占比 ，样本扰动  
max_features=0.5 表示训练样本属性占比，属性扰动 
bootstrap=True, bootstrap_features=False 表示样本无放回抽样，属性有放回抽样  
类似RF，但是RF的基础学习器是Decision Tree  

In [1]:
from sklearn.ensemble import BaggingClassifier
from sklearn.neighbors import KNeighborsClassifier
bagging = BaggingClassifier(KNeighborsClassifier(),
                            max_samples=0.5, max_features=0.5)
bagging

BaggingClassifier(base_estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform'),
         bootstrap=True, bootstrap_features=False, max_features=0.5,
         max_samples=0.5, n_estimators=10, n_jobs=1, oob_score=False,
         random_state=None, verbose=0, warm_start=False)

#### RandomForest、Extra-Trees 
Extra-Trees的阈值是针对每个候选特征随机生成的，选择这些随机生成的阈值中的最佳者作为分割规则（随机性比RF强）  
RF则是随机选择一组属性，寻找最具有区分度的阈值

In [2]:
from sklearn.model_selection import cross_val_score
from sklearn.datasets import make_blobs
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.tree import DecisionTreeClassifier
X, y = make_blobs(n_samples=10000, n_features=10, centers=100,
    random_state=0)

In [3]:
clf = DecisionTreeClassifier(max_depth=None, min_samples_split=2,
    random_state=0)
scores = cross_val_score(clf, X, y)
print('DecisionTreeClassifier score: {}'.format(scores.mean()))  

DecisionTreeClassifier score: 0.9794087938205586


In [4]:
clf = RandomForestClassifier(n_estimators=10, max_depth=None,
    min_samples_split=2, random_state=0)
scores = cross_val_score(clf, X, y)
print('RandomForestClassifier score: {}'.format(scores.mean()))  

RandomForestClassifier score: 0.9996078431372549


In [5]:
clf = ExtraTreesClassifier(n_estimators=10, max_depth=None,
    min_samples_split=2, random_state=0)
scores = cross_val_score(clf, X, y)
print('ExtraTreesClassifier score: {}'.format(scores.mean()))

ExtraTreesClassifier score: 0.99989898989899


要调整的主要参数是n_estimator和max_features  
可以通过设置以下参数来降低模型复杂度:
min_samples_split , min_samples_leaf , max_leaf_nodes和max_depth  
特征重要性评估  
特征重要新存储在模型的feature_importances_属性中，这是个形状为(n_features,)的array，其中的数值为正数且和为1


#### AdaBoost
基于前面训练的基础学习器调整样本分布，使训练错误的样本在后续得到更多的关注，基于新的分布训练学习器，直至训练得到T个  
实质是基于加性模型以类似牛顿迭代法来优化指数损失函数  

In [6]:
from sklearn.model_selection import cross_val_score
from sklearn.datasets import load_iris
from sklearn.ensemble import AdaBoostClassifier

iris = load_iris()
clf = AdaBoostClassifier(n_estimators=100)
scores = cross_val_score(clf, iris.data, iris.target)
scores.mean()   

0.95996732026143794

#### GradientBoostingClassifier 
支持分类、回归  
分类：二分类、多分类  
该方法，所有的、需要推导的树数量，等于 n_classes \* n_estimators  
不适合多分类类别较多的情况，那种情况下**更推荐RF**  

In [7]:
# 简单的例子
from sklearn.datasets import make_hastie_10_2
from sklearn.ensemble import GradientBoostingClassifier

X, y = make_hastie_10_2(random_state=0)
X_train, X_test = X[:2000], X[2000:]
y_train, y_test = y[:2000], y[2000:]

#树的复杂性可以通过，最大深度、叶节点最多样本数、
clf = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0,
    max_depth=1, random_state=0).fit(X_train, y_train)
clf.score(X_test, y_test) 

0.91300000000000003

#### Voting Classifier
组合多个学习器（都表现很好）  
硬投票、软投票


In [8]:
# hard voting
from sklearn import datasets
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
iris = datasets.load_iris()
X, y = iris.data[:, 1:3], iris.target
clf1 = LogisticRegression(random_state=1)
clf2 = RandomForestClassifier(random_state=1)
clf3 = GaussianNB()
eclf = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2), ('gnb', clf3)], voting='hard')
for clf, label in zip(
    [clf1, clf2, clf3, eclf], ['Logistic Regression', 'Random Forest', 'naive Bayes', 'Ensemble']):
    scores = cross_val_score(clf, X, y, cv=5, scoring='accuracy')
    print("Accuracy: %0.2f (+/- %0.2f) [%s]" % (scores.mean(), scores.std(), label))

Accuracy: 0.90 (+/- 0.05) [Logistic Regression]
Accuracy: 0.93 (+/- 0.05) [Random Forest]
Accuracy: 0.91 (+/- 0.04) [naive Bayes]
Accuracy: 0.95 (+/- 0.05) [Ensemble]


In [9]:
# soft voting
from sklearn import datasets
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from itertools import product
from sklearn.ensemble import VotingClassifier

# Loading some example data
iris = datasets.load_iris()
X = iris.data[:, [0,2]]
y = iris.target

# Training classifiers
clf1 = DecisionTreeClassifier(max_depth=4)
clf2 = KNeighborsClassifier(n_neighbors=7)
clf3 = SVC(kernel='rbf', probability=True)
# soft voting 为每个学习器设置权重
eclf = VotingClassifier(estimators=[('dt', clf1), ('knn', clf2), ('svc', clf3)], voting='soft', weights=[2,1,2])

for clf, label in zip(
    [clf1, clf2, clf3, eclf], ['dt', 'knn', 'svc', 'Ensemble']):
    scores = cross_val_score(clf, X, y, cv=5, scoring='accuracy')
    print("Accuracy: %0.2f (+/- %0.2f) [%s]" % (scores.mean(), scores.std(), label))

Accuracy: 0.95 (+/- 0.03) [dt]
Accuracy: 0.94 (+/- 0.04) [knn]
Accuracy: 0.95 (+/- 0.03) [svc]
Accuracy: 0.95 (+/- 0.03) [Ensemble]


#### VotingClassifier + GridSearch

In [11]:
from sklearn.model_selection import GridSearchCV
clf1 = LogisticRegression(random_state=1)
clf2 = RandomForestClassifier(random_state=1)
clf3 = GaussianNB()
eclf = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2), ('gnb', clf3)], voting='soft')

params = {'lr__C': [1.0, 100.0], 'rf__n_estimators': [20, 200],}

grid = GridSearchCV(estimator=eclf, param_grid=params, cv=5)
grid = grid.fit(iris.data, iris.target)

In [15]:
print(grid.best_score_)
print(grid.best_estimator_)
print(grid.best_params_)

0.96
VotingClassifier(estimators=[('lr', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=1, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)), ('rf', RandomFore...   oob_score=False, random_state=1, verbose=0, warm_start=False)), ('gnb', GaussianNB(priors=None))],
         flatten_transform=None, n_jobs=1, voting='soft', weights=None)
{'lr__C': 1.0, 'rf__n_estimators': 200}
