# 决策树Decision Tree

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
%matplotlib inline

In [3]:
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False

## 分类

分类学习是最常见的监督学习问题之一，分类预测的结果可以是二分类或多分类。

垃圾短信/邮件的识别需要根据发送人号码、内容等特征进行判断，这是一个二分类问题。

车牌识别需要识别出车牌上的每个字母和数字，这是一个多分类问题。

根据数据集中的多个特征和一个分类标签，可以预测新数据的分类。

例如在银行客户偿还贷款能力的分析中，可以通过以往客户的房产、婚姻、年收入（特征）和是否偿还（分类标签）进行分类学习。当新客户来贷款时，只要给出该客户的各项特征值，分类模型就可以预测此客户未来是否具有偿还贷款的能力。

常用的分类学习算法包含：

- 决策树
- 贝叶斯分类
- KNN
- 支持向量机SVM
- 神经网络
- 集成学习

## 性能评估

混淆矩阵（Confusion Matrix）可用于评估分类模型的预测准确度。

| 真实值\预测值 | Yes | No |
| --- | --- | --- |
| **Yes** | a | b |
| **No** | c | d |

准确率Accuracy：所有数据中被正确预测的比例。

Accuracy = $ {a + d} \over {a + b + c + d} $

精确率Precision：预测类为Yes的样本中，真实类为Yes的比例。

Precision = $ a \over {a + c} $

召回率Recall：真实类为Yes的样本中，被正确预测的比例。

Recall = $ a \over {a + b} $

F1：精确率和召回率的调和平均数。

F1 = $ 2a \over {2a + b + c} $

## 决策树

决策树是常见的分类学习方法，它来源于人们在面临决策问题时一种自然的思考过程。

例如，判断苹果好不好，先看颜色，青的肯定不好；颜色红的再看有没有虫眼，没有虫眼的就是好苹果。

In [4]:
df = pd.read_csv('data/bankdebt.csv', header=None, names=['房产', '婚姻', '年收入（万元）', '无法偿还'])
df

Unnamed: 0,房产,婚姻,年收入（万元）,无法偿还
1,Yes,Single,12.5,No
2,No,Married,10.0,No
3,No,Single,7.0,No
4,Yes,Married,12.0,No
5,No,Divorced,9.5,Yes
6,No,Married,6.0,No
7,Yes,Divorced,22.0,No
8,No,Single,8.5,Yes
9,No,Married,7.5,No
10,No,Single,9.0,Yes


![](./img/决策树.png)

在训练模型时，需要将特征转换为数值类型。

In [5]:
df['房产'] = df['房产'].map({'Yes': 1, 'No': 0})
df['婚姻'] = df['婚姻'].map({'Single': 1, 'Married': 2, 'Divorced': 3})
df['无法偿还'] = df['无法偿还'].map({'Yes': 1, 'No': 0})

df

Unnamed: 0,房产,婚姻,年收入（万元）,无法偿还
1,1,1,12.5,0
2,0,2,10.0,0
3,0,1,7.0,0
4,1,2,12.0,0
5,0,3,9.5,1
6,0,2,6.0,0
7,1,3,22.0,0
8,0,1,8.5,1
9,0,2,7.5,0
10,0,1,9.0,1


取出特征值和分类值：

In [6]:
X = df[['房产', '婚姻', '年收入（万元）']]
y = df['无法偿还']

创建决策树模型：

In [7]:
from sklearn.tree import DecisionTreeClassifier

In [8]:
model = DecisionTreeClassifier()
model.fit(X, y)

读取测试集：

In [9]:
df_test = pd.read_csv('data/test_bankdebt.csv', header=None, names=['房产', '婚姻', '年收入（万元）'])
df_test

Unnamed: 0,房产,婚姻,年收入（万元）
0,1,2,9.368809
1,0,2,13.330199
2,0,2,22.665605
3,0,2,11.4869
4,1,2,7.441759
5,0,2,12.125957
6,0,3,23.136569
7,0,3,10.442645
8,1,2,17.953802
9,0,3,5.010408


In [10]:
X_test = df_test[['房产', '婚姻', '年收入（万元）']]

预测客户能够偿还：

In [11]:
y_pred = model.predict(X_test)
y_pred

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       1, 0, 0], dtype=int64)

In [12]:
df_test['无法偿还'] = y_pred
df_test

Unnamed: 0,房产,婚姻,年收入（万元）,无法偿还
0,1,2,9.368809,0
1,0,2,13.330199,0
2,0,2,22.665605,0
3,0,2,11.4869,0
4,1,2,7.441759,0
5,0,2,12.125957,0
6,0,3,23.136569,0
7,0,3,10.442645,0
8,1,2,17.953802,0
9,0,3,5.010408,1


生成决策树：

In [13]:
from sklearn.tree import export_text

In [14]:
print(export_text(model, feature_names=['房产', '婚姻', '年收入（万元）']))

|--- 年收入（万元） <= 9.75
|   |--- 婚姻 <= 2.50
|   |   |--- 婚姻 <= 1.50
|   |   |   |--- 年收入（万元） <= 9.25
|   |   |   |   |--- 年收入（万元） <= 7.75
|   |   |   |   |   |--- 年收入（万元） <= 6.25
|   |   |   |   |   |   |--- class: 1
|   |   |   |   |   |--- 年收入（万元） >  6.25
|   |   |   |   |   |   |--- class: 0
|   |   |   |   |--- 年收入（万元） >  7.75
|   |   |   |   |   |--- class: 1
|   |   |   |--- 年收入（万元） >  9.25
|   |   |   |   |--- class: 0
|   |   |--- 婚姻 >  1.50
|   |   |   |--- class: 0
|   |--- 婚姻 >  2.50
|   |   |--- class: 1
|--- 年收入（万元） >  9.75
|   |--- class: 0



## 练习

### 购车预测

`car_data.csv`记录了来咨询买车的客户信息，使用决策树创建一个预测模型，用于预测未来客户的购车意愿，并分析模型的性能。

In [24]:
df = pd.read_csv('data/car_data.csv', index_col='User ID')
df.head()

Unnamed: 0_level_0,Gender,Age,AnnualSalary,Purchased
User ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
385,Male,35,20000,0
681,Male,40,43500,0
353,Male,49,74000,0
895,Male,40,107500,1
661,Male,25,79000,0


In [29]:
df['Gender'] = df['Gender'].map({'Male': 0, 'Female': 1})
df.head()

Unnamed: 0_level_0,Gender,Age,AnnualSalary,Purchased
User ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
385,0,35,20000,0
681,0,40,43500,0
353,0,49,74000,0
895,0,40,107500,1
661,0,25,79000,0


In [30]:
from sklearn.model_selection import train_test_split

In [31]:
X_train, X_test, y_train, y_test = train_test_split(df[['Gender', 'Age', 'AnnualSalary']], df['Purchased'], test_size=0.2)

In [32]:
model = DecisionTreeClassifier()
model.fit(X_train, y_train)

In [33]:
y_pred = model.predict(X_test)
y_pred

array([0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1,
       1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1,
       1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0,
       0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0,
       1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0,
       0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1,
       1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0,
       1, 1], dtype=int64)

In [34]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score

In [36]:
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred))
print("Recall:", recall_score(y_test, y_pred))
print("F1:", f1_score(y_test, y_pred))

Accuracy: 0.865
Precision: 0.8160919540229885
Recall: 0.8658536585365854
F1: 0.8402366863905325


### 航空满意度预测

`Invistico_Airline.csv`保存了航空公司对不同客服满意度的调查，创建决策树模型预测未来客户的满意度，并分析性能。

In [47]:
df = pd.read_csv('data/Invistico_Airline.csv')
df.head()

Unnamed: 0,satisfaction,Customer Type,Age,Type of Travel,Class,Flight Distance,Seat comfort,Departure/Arrival time convenient,Food and drink,Gate location,...,Online support,Ease of Online booking,On-board service,Leg room service,Baggage handling,Checkin service,Cleanliness,Online boarding,Departure Delay in Minutes,Arrival Delay in Minutes
0,satisfied,Loyal Customer,65,Personal Travel,Eco,265,0,0,0,2,...,2,3,3,0,3,5,3,2,0,0.0
1,satisfied,Loyal Customer,47,Personal Travel,Business,2464,0,0,0,3,...,2,3,4,4,4,2,3,2,310,305.0
2,satisfied,Loyal Customer,15,Personal Travel,Eco,2138,0,0,0,3,...,2,2,3,3,4,4,4,2,0,0.0
3,satisfied,Loyal Customer,60,Personal Travel,Eco,623,0,0,0,3,...,3,1,1,0,1,4,1,3,0,0.0
4,satisfied,Loyal Customer,70,Personal Travel,Eco,354,0,0,0,3,...,4,2,2,0,2,4,2,5,0,0.0


In [48]:
df.dtypes

satisfaction                          object
Customer Type                         object
Age                                    int64
Type of Travel                        object
Class                                 object
Flight Distance                        int64
Seat comfort                           int64
Departure/Arrival time convenient      int64
Food and drink                         int64
Gate location                          int64
Inflight wifi service                  int64
Inflight entertainment                 int64
Online support                         int64
Ease of Online booking                 int64
On-board service                       int64
Leg room service                       int64
Baggage handling                       int64
Checkin service                        int64
Cleanliness                            int64
Online boarding                        int64
Departure Delay in Minutes             int64
Arrival Delay in Minutes             float64
dtype: obj

In [49]:
df['satisfaction'] = df['satisfaction'].map({'satisfied': 1, 'dissatisfied': 0})
df['Class'] = df['Class'].map({'Business': 1, 'Eco Plus': 2, 'Eco': 3})

df.dtypes

satisfaction                           int64
Customer Type                         object
Age                                    int64
Type of Travel                        object
Class                                  int64
Flight Distance                        int64
Seat comfort                           int64
Departure/Arrival time convenient      int64
Food and drink                         int64
Gate location                          int64
Inflight wifi service                  int64
Inflight entertainment                 int64
Online support                         int64
Ease of Online booking                 int64
On-board service                       int64
Leg room service                       int64
Baggage handling                       int64
Checkin service                        int64
Cleanliness                            int64
Online boarding                        int64
Departure Delay in Minutes             int64
Arrival Delay in Minutes             float64
dtype: obj

In [50]:
df = pd.get_dummies(df, columns=['Customer Type', 'Type of Travel'])
df.dtypes

satisfaction                           int64
Age                                    int64
Class                                  int64
Flight Distance                        int64
Seat comfort                           int64
Departure/Arrival time convenient      int64
Food and drink                         int64
Gate location                          int64
Inflight wifi service                  int64
Inflight entertainment                 int64
Online support                         int64
Ease of Online booking                 int64
On-board service                       int64
Leg room service                       int64
Baggage handling                       int64
Checkin service                        int64
Cleanliness                            int64
Online boarding                        int64
Departure Delay in Minutes             int64
Arrival Delay in Minutes             float64
Customer Type_Loyal Customer            bool
Customer Type_disloyal Customer         bool
Type of Tr

In [51]:
X_train, X_test, y_train, y_test = train_test_split(df.drop(columns=['satisfaction']), df['satisfaction'], test_size=0.2)

In [52]:
model = DecisionTreeClassifier()
model.fit(X_train, y_train)

In [53]:
y_pred = model.predict(X_test)
y_pred

array([0, 0, 1, ..., 1, 1, 0], dtype=int64)

In [54]:
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred))
print("Recall:", recall_score(y_test, y_pred))
print("F1:", f1_score(y_test, y_pred))

Accuracy: 0.9383661841700031
Precision: 0.9478582084250906
Recall: 0.9389866291344123
F1: 0.9434015625552373
