** <font size=4>决策树</font>  **  

Ref: Gavin Hackeling, Mastering Machine Learning with scikit-learn, 2014  


信息熵：信息的期望值，描述信息的不确定度。熵越大，表明集合信息的混乱程度越高，换句话说，集合信息混沌，其包含信息价值少   
信息增益：是对信息前后变化量的描述。  
* 信息增益>0，表明集合信息熵减小，包含的信息更纯更有序，价值得到提高。  
* 信息增益<0，信息变得混沌。  
* 信息增益=0，信息量没有变化，但不表明信息没有变化。  

基尼不纯度：表示一个随机选中的样本在子集中被分错的可能性

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV

FILENAME = r'D:\Code\GitHub\notebook\machine_learning\datasets\ad\ad.data'

df = pd.read_csv(FILENAME, header=None)
explanatory_variable_columns = df.iloc[:,0:-1]
response_variable_column = df[len(df.columns.values)-1]

y = [1 if e=='ad.' else 0 for e in response_variable_column]
X = df[list(explanatory_variable_columns)]

# 使用-1取代缺失值
X.replace(to_replace=' *\?', value=-1, regex=True, inplace=True)

X_train, X_test, y_train, y_test = train_test_split(X, y)



In [2]:
pipeline = Pipeline([
    ('clf', DecisionTreeClassifier(criterion='entropy'))
])

parameters = {
    'clf__max_depth': (150, 155, 160),
    # 注意clf__min_samples_split取值不能是1，参考：https://stackoverflow.com/questions/43319023/min-samples-split-must-be-at-least-2-or-in-0-1-got-1
    'clf__min_samples_split': (2, 3),
    'clf__min_samples_leaf': (1, 2, 3),
}

grid_search = GridSearchCV(pipeline, parameters, n_jobs=2, verbose=1, scoring='f1')
grid_search.fit(X_train, y_train)
print('Best score: %.3f' %grid_search.best_score_)
print('Best parameters set:')
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
    print('\t%s: %r' %(param_name, best_parameters[param_name]))

predictions = grid_search.predict(X_test)
print(classification_report(y_test, predictions))

Fitting 3 folds for each of 18 candidates, totalling 54 fits


[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed:   41.3s
[Parallel(n_jobs=2)]: Done  54 out of  54 | elapsed:   46.5s finished


Best score: 0.867
Best parameters set:
	clf__max_depth: 150
	clf__min_samples_leaf: 3
	clf__min_samples_split: 3
             precision    recall  f1-score   support

          0       0.98      0.99      0.98       705
          1       0.92      0.88      0.90       115

avg / total       0.97      0.97      0.97       820



In [3]:
# 随机森林
# 参考： http://www.jianshu.com/p/d90189008864
'''
在机器学习算法中，有一类算法比较特别，叫组合算法(Ensemble)，即将多个基算法(Base)组合起来使用。
每个基算法单独预测，最后的结论由全部基算法进行投票（用于分类问题）或者求平均（包括加权平均，用于回归问题）。

- 随机有放回的抽取数据，数量可以和原数据相同，也可以略小
- 随机选取N个特征，选择最好的属性进行分裂
- 在N个最好的分裂特征中，随机选择一个进行分裂
'''

from sklearn.ensemble import RandomForestClassifier

pipeline = Pipeline([
    ('clf', RandomForestClassifier(criterion='entropy'))
])

parameters = {
    'clf__n_estimators': (5, 10, 20, 50),
    'clf__max_depth': (50, 150, 250),
    'clf__min_samples_split': (2, 3),
    'clf__min_samples_leaf': (1, 2, 3),
}

grid_search = GridSearchCV(pipeline, parameters, n_jobs=2, verbose=1, scoring='f1')
grid_search.fit(X_train, y_train)
print('Best score: %.3f' %grid_search.best_score_)
print('Best parameters set:')
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
    print('\t%s: %r' %(param_name, best_parameters[param_name]))

predictions = grid_search.predict(X_test)
print(classification_report(y_test, predictions))

Fitting 3 folds for each of 72 candidates, totalling 216 fits


[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed:   39.1s
[Parallel(n_jobs=2)]: Done 196 tasks      | elapsed:  2.4min
[Parallel(n_jobs=2)]: Done 216 out of 216 | elapsed:  2.6min finished


Best score: 0.912
Best parameters set:
	clf__max_depth: 250
	clf__min_samples_leaf: 1
	clf__min_samples_split: 2
	clf__n_estimators: 50
             precision    recall  f1-score   support

          0       0.98      0.99      0.99       705
          1       0.95      0.90      0.92       115

avg / total       0.98      0.98      0.98       820

