### Decision Tree


#### CART Algorithm

CART算法，全称为Classifier and Regtression Tree, 分类回归树。
ID3算法和C4.5算法可以生成二叉或多叉树，而CART只支持二叉树。同时CART决策树比较特殊，即可以作为分类树，也可以作为回归树。
分类树可以处理离散数据，也就是数据种类有限的数据，它输出的是样本的类别；而回归树可以对连续性的数值进行预测，也就是数据在
某区间内都有取值的可能，它的输出是一个数值。


##### 用CART分类树，给iris数据集构造一棵分类决策树

In [6]:
# encoding = utf-8

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier    #create CART classifier, criterion=gini
from sklearn.datasets import load_iris

# prepare dataset
iris = load_iris()

#  get features and classifier label
features = iris.data
labels = iris.target

# extract training set and test set
training_features, test_features, training_labels, test_labels = train_test_split(features, labels,test_size=0.33, random_state=0)

# create cart
clf = DecisionTreeClassifier(criterion='gini')

# fit cart
clf = clf.fit(training_features,training_labels)

# predict test features
test_predict = clf.predict(test_features)

# calculate accuracy rate
score = accuracy_score(test_labels, test_predict)

print("CART accuracy rate = %.4lf"%score)

CART accuracy rate = 0.9600


#####  使用CART回归树，对波士顿房价进行预测

In [10]:
# encoding = utf-8

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn.tree import DecisionTreeRegressor    #create CART regression tree
from sklearn.datasets import load_boston

# prepare dataset
boston = load_boston()

# print features
print(boston.feature_names)

# get features and price (continuous)
features = boston.data
prices = boston.target

# extract training set and test set
training_features, test_features, training_price, test_price = train_test_split(features, prices, test_size = 0.33, random_state = 0)

# create cart regressor
dtr = DecisionTreeRegressor()

# fit training set
dtr.fit(training_features, training_price)

# predict test price
predict_price = dtr.predict(test_features)

# assess the result
print("回归树二乘偏差均值：", mean_squared_error(test_price, predict_price))
print("回归树绝对值偏差均值：", mean_absolute_error(test_price, predict_price))
print("回归树残差均值：", r2_score(test_price, predict_price))

['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
 'B' 'LSTAT']
回归树二乘偏差均值： 26.461556886227545
回归树绝对值偏差均值： 3.2610778443113766
回归树残差均值： 0.6717829942751936


##### 使用CART 分类树预测Titanic 生存者

In [24]:

# encoding = utf-8

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn.tree import DecisionTreeRegressor    #create CART regression tree
from sklearn.datasets import load_boston
import pandas as pd 

# load data
train_data = pd.read_csv('./data/Titanic/train.csv')
test_data = pd.read_csv('./data/Titanic/test.csv')

# explore data
print(train_data.info())
print('-'*30)
print(test_data.info())
print('-'*30)
print(train_data.describe())
print('-'*30)
print(train_data.describe(include=['O']))
print('-'*30)
print(train_data.head())
print('-'*30)
print(train_data.tail())
print('-'*30)

# Use average of Age to fill the nan in Age column
train_data['Age'].fillna(int(train_data['Age'].mean()), inplace = True)
test_data['Age'].fillna(int(test_data['Age'].mean()), inplace = True)

# use average of Fare to fill the nan in the Fare column
test_data['Fare'].fillna(test_data['Fare'].mean(), inplace = True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
None
------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  4

In [26]:
print(train_data.info())
print(test_data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          891 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pcl

In [27]:
print(train_data['Embarked'].value_counts())

S    644
C    168
Q     77
Name: Embarked, dtype: int64


In [28]:
train_data['Embarked'].fillna('S', inplace = True)

In [29]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          891 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     891 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [30]:
test_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass       418 non-null    int64  
 2   Name         418 non-null    object 
 3   Sex          418 non-null    object 
 4   Age          418 non-null    float64
 5   SibSp        418 non-null    int64  
 6   Parch        418 non-null    int64  
 7   Ticket       418 non-null    object 
 8   Fare         418 non-null    float64
 9   Cabin        91 non-null     object 
 10  Embarked     418 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB


In [32]:
# choose features
features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']
train_features = train_data[features]
train_labels= train_data['Survived']
test_features = test_data[features]

In [33]:
train_features.head()

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,3,male,22.0,1,0,7.25,S
1,1,female,38.0,1,0,71.2833,C
2,3,female,26.0,0,0,7.925,S
3,1,female,35.0,1,0,53.1,S
4,3,male,35.0,0,0,8.05,S


In [34]:
test_features.head()

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,3,male,34.5,0,0,7.8292,Q
1,3,female,47.0,1,0,7.0,S
2,2,male,62.0,0,0,9.6875,Q
3,3,male,27.0,0,0,8.6625,S
4,3,female,22.0,1,1,12.2875,S


In [35]:
# convert string into category 

from sklearn.feature_extraction import DictVectorizer
dvec = DictVectorizer(sparse = False)

tain_features = dvec.fit_transform(train_features.to_dict(orient = 'record'))

In [37]:
print(dvec.feature_names_)

['Age', 'Embarked=C', 'Embarked=Q', 'Embarked=S', 'Fare', 'Parch', 'Pclass', 'Sex=female', 'Sex=male', 'SibSp']


In [38]:
from sklearn.tree import DecisionTreeClassifier

# create ID3 DT
clf = DecisionTreeClassifier(criterion='entropy')

# train DT
clf.fit(train_features,train_labels)

ValueError: could not convert string to float: 'male'

In [39]:
# predict test features
test_features = dvec.fit_transform(test_features.to_dict(orient = 'record'))
predict_labels = clf.predict(test_features)


NotFittedError: This DecisionTreeClassifier instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.

In [None]:
# calculate the accuracy
acc_decision_tree = round(clf.score(test_labels, predict_labels), 6)
print(u'score accuracy is %.4lf'% acc_decision_tree)