# 构建C4.5决策树
## 目录
* C4.5决策树简介
* 实现C4.5决策树的版本1
* C4.5相对于ID3算法的改进之处

## C4.5决策树简介
C4.5决策树是在ID3算法的基础上进行改进，主要是改进两个部分
1. 使用信息增益率（Gain Ratio）作为选取切分字段的参考指标
$$Gain Ratio = \frac{Information Gain}{Information Value}$$
其中$$Information Values = -\sum\limits_{i=1}^k P(v_i)log_2P(v_i)$$
$v_i$为子节点中样本量占父节点样本量的比例

    我们选择 信息增益率最大的那一列，本质是信息增益最大，分支度又较小的列（也就是纯度提升很快，但又不是靠着把类别分 特别细 来提升 那些特征）

2. 添加连续变量处理手段
	如果输入特征字段是连续性变量，则有下列步骤：
	1. 算法首先对这一列数从小到大排序
	2. 选取相邻的两个数的中间数作为切分数据集的备选点，此时针对连续变量的处理并非将其转化为一个拥有N-1个分类水平的分类变量，而是将其转化为N-1个二分方案。

## 实现C4.5决策树

在实现过程中，基于原本的ID3决策树的代码进行修改，具体修改内容：

1. 需要对特征字段是连续性变量还是离散型变量进行判断，故在原类的基础上，添加判断特征字段类别的预处理
2. 采用信息增益率，同时，连续性变量有对应不同的操作，故bestsplit函数需要进行调整
3. 对连续性变量采用二分的方法，故splitData函数需要进行调整
4. 建树时，对于连续性变量，不再是以某个类别标签来进行划分，故createTree函数需要进行调整

### 01版本
感觉比较复杂，所以就先一点一点来。  
先完成基于信息增益率来划分的改变。

In [1]:
import numpy as np
import pandas as pd
from collections import Counter
from graphviz import Digraph
import graphviz
class Decision_tree():
    
    def __init__(self,cal = "Entropy"):
        self.tree = None
        self.cal = cal
        self.columns=None
    
    def calEnt(self,y):
        count = Counter(y)
        p = np.array(list(count.values()))/len(y)
        Ent = (-p* np.log2(p)).sum()
        return Ent
    
    def calGini(self,y):
        count = Counter(y)
        p = np.array(list(count.values()))/len(y)
        Gini = 1-(p**2).sum()
        return Gini
    
    def calimpurity(self,y):
        """根据参数来选择是使用Entropy还是Gini来计算不纯度"""
        if self.cal == "Entropy":
            return self.calEnt(y)
        else:
            return self.calGini(y)
    
    
    def fit_C45_01(self,X,y,featurename):
        
        def bestSplit(X,y):
            """让数据集根据某一个特征值进行划分，返回数据集最佳切分列索引"""
            bestFeature=-1
            bestGainRate=-1
            baseEnt = self.calimpurity(y)
            
            for i in range(X.shape[1]):
                label = list(Counter(X[:,i]).keys())
                sub_ent = 0
                sub_info = 0 #计算information value
                for j in label:
                    subData = y[X[:,i] == j]
                    childEnt = self.calimpurity(subData)
                    sub_ent += childEnt*len(subData)/len(y)
                    sub_info += -len(subData)/len(y)*np.log2(len(subData)/len(y))
                Gain = baseEnt - sub_ent
                Gainrate = Gain/sub_info
                if Gainrate > bestGainRate: 
                    bestGainRate = sub_ent
                    bestFeature = i
            return bestFeature
        
        
        def splitData(X,y,feature,label):
            """按指定的特征和标签来划分数据子集"""
            subX = X[X[:,feature] == label]
            suby = y[X[:,feature] == label]
            subX = np.delete(subX,feature,axis=1)
            return subX,suby
        
        
        def createTree(X,y,featurename):
            """用字典的形式保存最终的树"""
            if X.shape[1]==1 or len(list(Counter(y)))==1:#即没有再可以划分的特征，或者子集已经只有一列，则迭代结束
                return Counter(y).most_common(1)[0][0]#返回所占比例最多的类别
            
            bestfeature = bestSplit(X,y)
            bestfeaturename = featurename[bestfeature]
            labellist = set(Counter(X[:,bestfeature]))
            dic = {}
            for label in labellist:
                subX,suby = splitData(X,y,bestfeature,label)
                col = featurename.copy()
                del col[bestfeature]
                
                dic[label] = createTree(subX,suby,col)
            mytree = {bestfeaturename:dic}
            return mytree
         
        self.columns = featurename    
        self.tree = createTree(X,y,featurename)
        return self
        
        
    def _predict(self,test):
        """对单条测试集进行预测"""

        def __predict(tree,test,columns):
            feature = next(iter(tree))
            secondDic = tree[feature]
            index = columns.index(feature)
            content = test[index]
            for key in secondDic:
                if key == content:
                    if type(secondDic[key]) == dict :
                        return __predict(secondDic[key],test,columns)
                    else:
                        return secondDic[key]

        assert self.tree is not None,"fit before predict"
        tree = self.tree
        columns = self.columns
        return __predict(tree,test,columns)
    
    def predict(self,X_test):
        return np.array([self._predict(test) for test in X_test])
            
    def score(self,X_test,y_test):
        """计算模型的准确率"""
        y_predict = self.predict(X_test)
        return (y_test == y_predict).mean()
    
    def draw_tree(self):
        from graphviz import Digraph
        
        def export_graphviz(tree,root_index): 
            root = next(iter(tree))
            text_node.append([str(root_index),root])
            secondDic = tree[root]
            for key in secondDic:
                if type(secondDic[key]) == dict:
                    i[0]+=1
                    secondrootindex=i[0]
                    text_edge.append([str(root_index),str(secondrootindex),str(key)])
                    export_graphviz(secondDic[key],secondrootindex)
                else:
                    i[0] += 1
                    text_node.append([str(i[0]),str(secondDic[key])])
                    text_edge.append([ str(root_index) , str(i[0]) , str(key) ])
          
        
        tree = self.tree
        text_node=[]
        text_edge=[]
        i=[1]
        export_graphviz(tree,i[0])
        dot = Digraph()
        for line in text_node:
            dot.node(line[0],line[1])
        for line in text_edge:
            dot.edge(line[0],line[1],line[2])
        
        dot.view()

## 用数据集进行验证

In [2]:
data = pd.DataFrame([
    [1,"<=30","high","no","fair","no"],
    [2,"<=30","high","no","excellent","no"],
    [3,"31~40","high","no","fair","yes"],
    [4,">40","medium","no","fair","yes"],
    [5,">40","low","yes","fair","yes"],
    [6,">40","low","yes","excellent","no"],
    [7,"31~40","low","yes","excellent","yes"],
    [8,"<=30","medium","no","fair","no"],
    [9,"<=30","low","yes","fair","yes"],
    [10,">40","medium","yes","fair","yes"],
    [11,"<=30","medium","yes","excellent","yes"],
    [12,"31~40","medium","no","excellent","yes"],
    [13,"31~40","high","yes","fair","yes"],
    [14,">40","medium","no","excellent","no"]
                 ],columns=["index","age","income","student","credit_rating","Class"])

In [3]:
X = data.iloc[:,:-1]
y = data.iloc[:,-1]

In [4]:
clf = Decision_tree()
clf.fit_C45_01(np.array(X),np.array(y),list(X.columns))
print(clf.tree)

{'age': {'>40': {'credit_rating': {'excellent': 'no', 'fair': 'yes'}}, '<=30': {'credit_rating': {'excellent': {'student': {'no': 'no', 'yes': 'yes'}}, 'fair': {'student': {'no': 'no', 'yes': 'yes'}}}}, '31~40': 'yes'}}


In [5]:
clf.draw_tree()

![image](terribletree.png)

对比之前使用ID3算法构建的决策树，现在C4.5构建的树简直好太多

## 02版本
添加决策树对于连续性变量的分类  
由于这部分基本上就是在CART树的基础上，添加对离散变量的多分枝处理，较为复杂
而且对于离散型变量来说，本身也是可以全部按照二分法将其分离，故暂时就不实现这个版本的C4.5树啦

之后有兴趣了，可以在CART树的代码基础上进行修改。