##### 前面学习的逻辑回归、线性回归、支持向量机模型都需要在某种程度上要求被学习数据的特征和目标之间遵循线性假设，但是在很多情景下，这种情况是不存在的

##### 决策数据是树形结构的算法，在数据特征和目标之间没有明显的线性关系的情况下，可以使用决策树来分析数据。
##### 在遇到多种特征组合来构建决策树的过程
##### 中.模型在学习的时候需要考虑特征节点的选取顺序。常用的方法用信息熵entropy和基尼不纯度Jini Impurity

### 使用泰坦尼克号沉船事故，船员的生还数据作为学习决策树的数据源。

### 1.下载数据

In [1]:
import pandas as pd
df = pd.read_csv('http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic.txt')

In [2]:
#查看数据基本信息
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1313 entries, 0 to 1312
Data columns (total 11 columns):
row.names    1313 non-null int64
pclass       1313 non-null object
survived     1313 non-null int64
name         1313 non-null object
age          633 non-null float64
embarked     821 non-null object
home.dest    754 non-null object
room         77 non-null object
ticket       69 non-null object
boat         347 non-null object
sex          1313 non-null object
dtypes: float64(1), int64(2), object(8)
memory usage: 112.9+ KB


In [5]:
df.head()

Unnamed: 0,row.names,pclass,survived,name,age,embarked,home.dest,room,ticket,boat,sex
0,1,1st,1,"Allen, Miss Elisabeth Walton",29.0,Southampton,"St Louis, MO",B-5,24160 L221,2,female
1,2,1st,0,"Allison, Miss Helen Loraine",2.0,Southampton,"Montreal, PQ / Chesterville, ON",C26,,,female
2,3,1st,0,"Allison, Mr Hudson Joshua Creighton",30.0,Southampton,"Montreal, PQ / Chesterville, ON",C26,,(135),male
3,4,1st,0,"Allison, Mrs Hudson J.C. (Bessie Waldo Daniels)",25.0,Southampton,"Montreal, PQ / Chesterville, ON",C26,,,female
4,5,1st,1,"Allison, Master Hudson Trevor",0.9167,Southampton,"Montreal, PQ / Chesterville, ON",C22,,11,male


### 2.使用pclass,age,sex来预测船员的生存状况

In [4]:
X = df[['pclass','sex','age']]
Y = df['survived']

In [5]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1313 entries, 0 to 1312
Data columns (total 3 columns):
pclass    1313 non-null object
sex       1313 non-null object
age       633 non-null float64
dtypes: float64(1), object(2)
memory usage: 30.9+ KB


In [10]:
#查看信息发现，age只有633个是不为nan的，需要补充。使用均值或中位数补充。对于sex和pclass是类别类型，需要转换成数值特诊

In [6]:
X['age'].fillna(X['age'].mean(),inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._update_inplace(new_data)


In [7]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1313 entries, 0 to 1312
Data columns (total 3 columns):
pclass    1313 non-null object
sex       1313 non-null object
age       1313 non-null float64
dtypes: float64(1), object(2)
memory usage: 30.9+ KB


### 3.数据分割


In [8]:
#现在age已经补充完成，接下来要把sex、pclass壮观成数值
from sklearn.cross_validation import train_test_split
x_train,x_test,y_train,y_test = train_test_split(X,Y,test_size=0.25,random_state=124)



In [9]:
x_train.shape

(984, 3)

In [10]:
x_test.shape

(329, 3)

In [21]:
y_train.value_counts()

0    640
1    344
Name: survived, dtype: int64

In [22]:
y_test.value_counts()

0    224
1    105
Name: survived, dtype: int64

In [11]:
help(x_train.to_dict)

Help on method to_dict in module pandas.core.frame:

to_dict(orient='dict', into=<class 'dict'>) method of pandas.core.frame.DataFrame instance
    Convert DataFrame to dictionary.
    
    Parameters
    ----------
    orient : str {'dict', 'list', 'series', 'split', 'records', 'index'}
        Determines the type of the values of the dictionary.
    
        - dict (default) : dict like {column -> {index -> value}}
        - list : dict like {column -> [values]}
        - series : dict like {column -> Series(values)}
        - split : dict like
          {index -> [index], columns -> [columns], data -> [values]}
        - records : list like
          [{column -> value}, ... , {column -> value}]
        - index : dict like {index -> {column -> value}}
    
          .. versionadded:: 0.17.0
    
        Abbreviations are allowed. `s` indicates `series` and `sp`
        indicates `split`.
    
    into : class, default dict
        The collections.Mapping subclass used for all Mapping

In [12]:
# 使用sklean的feature_extraction对特征进行转换
from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer(sparse=False)
X_train = vec.fit_transform(x_train.to_dict(orient='record'))
vec.feature_names_

['age', 'pclass=1st', 'pclass=2nd', 'pclass=3rd', 'sex=female', 'sex=male']

In [34]:
X_train

array([[ 21.        ,   0.        ,   0.        ,   1.        ,
          0.        ,   1.        ],
       [ 31.19418104,   0.        ,   0.        ,   1.        ,
          0.        ,   1.        ],
       [ 30.        ,   0.        ,   0.        ,   1.        ,
          1.        ,   0.        ],
       ..., 
       [ 36.        ,   1.        ,   0.        ,   0.        ,
          0.        ,   1.        ],
       [ 22.        ,   0.        ,   0.        ,   1.        ,
          0.        ,   1.        ],
       [ 23.        ,   0.        ,   1.        ,   0.        ,
          1.        ,   0.        ]])

In [35]:
# 转换之后发现，对于类别数据都单独剥离出来形成新的一列，对于数值数据保持不变

In [36]:
x_train.to_dict(orient='record')

[{'age': 21.0, 'pclass': '3rd', 'sex': 'male'},
 {'age': 31.19418104265403, 'pclass': '3rd', 'sex': 'male'},
 {'age': 30.0, 'pclass': '3rd', 'sex': 'female'},
 {'age': 31.19418104265403, 'pclass': '3rd', 'sex': 'female'},
 {'age': 31.19418104265403, 'pclass': '2nd', 'sex': 'male'},
 {'age': 32.0, 'pclass': '3rd', 'sex': 'male'},
 {'age': 31.19418104265403, 'pclass': '3rd', 'sex': 'female'},
 {'age': 42.0, 'pclass': '2nd', 'sex': 'female'},
 {'age': 22.0, 'pclass': '1st', 'sex': 'female'},
 {'age': 46.0, 'pclass': '1st', 'sex': 'male'},
 {'age': 31.19418104265403, 'pclass': '3rd', 'sex': 'male'},
 {'age': 13.0, 'pclass': '2nd', 'sex': 'female'},
 {'age': 31.19418104265403, 'pclass': '3rd', 'sex': 'female'},
 {'age': 31.19418104265403, 'pclass': '1st', 'sex': 'male'},
 {'age': 31.19418104265403, 'pclass': '3rd', 'sex': 'male'},
 {'age': 22.0, 'pclass': '3rd', 'sex': 'female'},
 {'age': 45.0, 'pclass': '3rd', 'sex': 'female'},
 {'age': 31.19418104265403, 'pclass': '3rd', 'sex': 'male'},
 

In [37]:
X_train

array([[ 21.        ,   0.        ,   0.        ,   1.        ,
          0.        ,   1.        ],
       [ 31.19418104,   0.        ,   0.        ,   1.        ,
          0.        ,   1.        ],
       [ 30.        ,   0.        ,   0.        ,   1.        ,
          1.        ,   0.        ],
       ..., 
       [ 36.        ,   1.        ,   0.        ,   0.        ,
          0.        ,   1.        ],
       [ 22.        ,   0.        ,   0.        ,   1.        ,
          0.        ,   1.        ],
       [ 23.        ,   0.        ,   1.        ,   0.        ,
          1.        ,   0.        ]])

In [13]:
# 对测试数据要做转换
X_test = vec.transform(x_test.to_dict(orient='records'))

### 4.使用决策树分类器

In [18]:
from sklearn.tree import DecisionTreeClassifier
#初始化决策树模型
dtc = DecisionTreeClassifier(max_depth=4)
#喂数据训练模型
dtc.fit(X_train,y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=4,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [19]:
#预测
dtc_predict = dtc.predict(X_test)

### 5.性能测试

In [20]:
print("决策树预测准确性：\n",dtc.score(X_test,y_test))

决策树预测准确性：
 0.848024316109


In [21]:
from sklearn.metrics import classification_report
print("决策树性能指标：\n",classification_report(dtc_predict,y_test,target_names=['died','survived']))

决策树性能指标：
              precision    recall  f1-score   support

       died       0.97      0.83      0.90       262
   survived       0.58      0.91      0.71        67

avg / total       0.89      0.85      0.86       329



### 6.决策树算法特点

##### 决策树推断逻辑简单，可解释，也方便模型的可视化，无需对数据量化和标准化。决策属于有参数模型，需要花费更多的时间在训练数据上