- class sklearn.tree.DecisionTreeClassifier(criterion=’gini’, max_depth=None,random_state=None)
    - criterion, 特征选择标准, "gini"或者"entropy"，一默认"gini"，因为 gini 效率高一些
    - 内部节点再划分所需最小样本数,默认为 2
    - min_samples_leaf, 叶子节点最少样本数,如果一次划分以后叶子节点数量少于这个数,就不进行这次划分.
    - max_depth,决策树最大深度
    - random_state, 随机数种子

# 泰坦尼克号案例

泰坦尼克号沉没是历史上最臭名昭着的沉船之一。1912年4月15日，在她的处女航中，泰坦尼克号在与冰山相撞后沉没，在2224名乘客和机组人员中造成1502人死亡。这场耸人听闻的悲剧震惊了国际社会，并为船舶制定了更好的安全规定。 造成海难失事的原因之一是乘客和机组人员没有足够的救生艇。尽管幸存下沉有一些运气因素，但有些人比其他人更容易生存，例如妇女，儿童和上流社会。 在这个案例中，我们要求您完成对哪些人可能存活的分析。特别是，我们要求您运用机器学习工具来预测哪些乘客幸免于悲剧。

案例：https://www.kaggle.com/c/titanic/overview

我们提取到的数据集中的特征包括票的类别，是否存活，乘坐班次，年龄，登陆home.dest，房间，船和性别等。

数据：http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic.txt


经过观察数据得到:

- 1 乘坐班是指乘客班（1，2，3），是社会经济阶层的代表。

- 2 其中age数据存在缺失。

## 工作流程
- 1.获取数据
- 2.数据基本处理
    - 2.1 确定特征值,目标值
    - 2.2 缺失值处理
    - 2.3 数据集划分
- 3.特征工程(字典特征抽取)
- 4.机器学习(决策树)
- 5.模型评估

In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction import DictVectorizer
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, export_graphviz

- 获取数据

In [5]:
titan = pd.read_csv("../data/titanic.csv")
titan.head()

Unnamed: 0,row.names,pclass,survived,name,age,embarked,home.dest,room,ticket,boat,sex
0,1,1st,1,"Allen, Miss Elisabeth Walton",29.0,Southampton,"St Louis, MO",B-5,24160 L221,2,female
1,2,1st,0,"Allison, Miss Helen Loraine",2.0,Southampton,"Montreal, PQ / Chesterville, ON",C26,,,female
2,3,1st,0,"Allison, Mr Hudson Joshua Creighton",30.0,Southampton,"Montreal, PQ / Chesterville, ON",C26,,(135),male
3,4,1st,0,"Allison, Mrs Hudson J.C. (Bessie Waldo Daniels)",25.0,Southampton,"Montreal, PQ / Chesterville, ON",C26,,,female
4,5,1st,1,"Allison, Master Hudson Trevor",0.9167,Southampton,"Montreal, PQ / Chesterville, ON",C22,,11,male


- 2.数据基本处理

    - 2.1 确定特征值,目标值

In [7]:
x = titan[["pclass", "age", "sex"]]
y = titan["survived"]
x.head()

Unnamed: 0,pclass,age,sex
0,1st,29.0,female
1,1st,2.0,female
2,1st,30.0,male
3,1st,25.0,female
4,1st,0.9167,male


- 2.2 缺失值处理

In [8]:
# 缺失值需要处理，将特征当中有类别的这些特征进行字典特征抽取
x['age'].fillna(x['age'].mean(), inplace=True)

- 2.3 数据集划分

In [20]:
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=22)

- 3.特征工程(字典特征抽取)

    - 特征中出现类别符号，需要进行one-hot编码处理(DictVectorizer)

    - x.to_dict(orient="records") 需要将数组特征转换成字典数据

In [21]:
x_train.to_dict(orient="records")

[{'pclass': '1st', 'age': 19.0, 'sex': 'male'},
 {'pclass': '1st', 'age': 28.0, 'sex': 'male'},
 {'pclass': '1st', 'age': 17.0, 'sex': 'male'},
 {'pclass': '3rd', 'age': 31.19418104265403, 'sex': 'female'},
 {'pclass': '3rd', 'age': 31.19418104265403, 'sex': 'male'},
 {'pclass': '3rd', 'age': 21.0, 'sex': 'female'},
 {'pclass': '1st', 'age': 25.0, 'sex': 'female'},
 {'pclass': '3rd', 'age': 31.19418104265403, 'sex': 'male'},
 {'pclass': '3rd', 'age': 19.0, 'sex': 'male'},
 {'pclass': '3rd', 'age': 31.19418104265403, 'sex': 'male'},
 {'pclass': '3rd', 'age': 31.19418104265403, 'sex': 'male'},
 {'pclass': '3rd', 'age': 31.19418104265403, 'sex': 'male'},
 {'pclass': '3rd', 'age': 31.19418104265403, 'sex': 'female'},
 {'pclass': '3rd', 'age': 31.19418104265403, 'sex': 'male'},
 {'pclass': '3rd', 'age': 31.19418104265403, 'sex': 'male'},
 {'pclass': '2nd', 'age': 31.19418104265403, 'sex': 'male'},
 {'pclass': '3rd', 'age': 31.19418104265403, 'sex': 'female'},
 {'pclass': '3rd', 'age': 26.0,

In [22]:
# 对于x转换成字典数据x.to_dict(orient="records")
# [{"pclass": "1st", "age": 29.00, "sex": "female"}, {}]

transfer = DictVectorizer(sparse=False)

x_train = transfer.fit_transform(x_train.to_dict(orient="records"))
x_test = transfer.fit_transform(x_test.to_dict(orient="records"))


In [25]:
x_train[1]

array([28.,  1.,  0.,  0.,  0.,  1.])

- 4.决策树模型训练和模型评估

决策树API当中，如果没有指定max_depth,那么会根据信息熵的条件直到最终结束。这里我们可以指定树的深度来进行限制树的大小

In [26]:
# 4.机器学习(决策树)
estimator = DecisionTreeClassifier(criterion="entropy", max_depth=5)
estimator.fit(x_train, y_train)


DecisionTreeClassifier(criterion='entropy', max_depth=5)

In [27]:
# 5.模型评估
score = estimator.score(x_test, y_test)
print(score)
estimator.predict(x_test)

0.790273556231003


array([0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0,
       1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0,
       0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0,

## 决策树可视化
- sklearn.tree.export_graphviz() 该函数能够导出DOT格式
    - tree.export_graphviz(estimator,out_file='tree.dot’,feature_names=[‘’,’’])

In [29]:
export_graphviz(estimator, out_file="../data/tree.dot", feature_names=['age', 'pclass=1st', 'pclass=2nd', 'pclass=3rd', '女性', '男性'])



digraph Tree {
node [shape=box] ;
0 [label="petal length (cm) <= 2.45\nentropy = 1.584\nsamples = 112\nvalue = [39, 37, 36]"] ;
1 [label="entropy = 0.0\nsamples = 39\nvalue = [39, 0, 0]"] ;
0 -> 1 [labeldistance=2.5, labelangle=45, headlabel="True"] ;
2 [label="petal width (cm) <= 1.75\nentropy = 1.0\nsamples = 73\nvalue = [0, 37, 36]"] ;
0 -> 2 [labeldistance=2.5, labelangle=-45, headlabel="False"] ;
3 [label="petal length (cm) <= 5.05\nentropy = 0.391\nsamples = 39\nvalue = [0, 36, 3]"] ;
2 -> 3 ;
4 [label="sepal length (cm) <= 4.95\nentropy = 0.183\nsamples = 36\nvalue = [0, 35, 1]"] ;
3 -> 4 ;
5 [label="petal length (cm) <= 3.9\nentropy = 1.0\nsamples = 2\nvalue = [0, 1, 1]"] ;
4 -> 5 ;
6 [label="entropy = 0.0\nsamples = 1\nvalue = [0, 1, 0]"] ;
5 -> 6 ;
7 [label="entropy = 0.0\nsamples = 1\nvalue = [0, 0, 1]"] ;
5 -> 7 ;
8 [label="entropy = 0.0\nsamples = 34\nvalue = [0, 34, 0]"] ;
4 -> 8 ;
9 [label="petal width (cm) <= 1.55\nentropy = 0.918\nsamples = 3\nvalue = [0, 1, 2]"] ;
3 -> 9 ;
10 [label="entropy = 0.0\nsamples = 2\nvalue = [0, 0, 2]"] ;
9 -> 10 ;
11 [label="entropy = 0.0\nsamples = 1\nvalue = [0, 1, 0]"] ;
9 -> 11 ;
12 [label="petal length (cm) <= 4.85\nentropy = 0.191\nsamples = 34\nvalue = [0, 1, 33]"] ;
2 -> 12 ;
13 [label="entropy = 0.0\nsamples = 1\nvalue = [0, 1, 0]"] ;
12 -> 13 ;
14 [label="entropy = 0.0\nsamples = 33\nvalue = [0, 0, 33]"] ;
12 -> 14 ;
}

## 网站显示结构

http://webgraphviz.com/

## 小结
- 流程分析
- 1.获取数据
- 2.数据基本处理
    - 2.1 确定特征值,目标值
    - 2.2 缺失值处理
    - 2.3 数据集划分
- 3.特征工程(字典特征抽取)
- 4.机器学习(决策树)
- 5.模型评估


- 决策树可视化【了解】
    - sklearn.tree.export_graphviz()


- 决策树优缺点总结【知道】
- 优点：
    - 简单的理解和解释，树木可视化。
- 缺点：
    - 决策树学习者可以创建不能很好地推广数据的过于复杂的树,容易发生过拟合。
- 改进：
    - 减枝cart算法
    - 随机森林（集成学习的一种）