<a href="https://colab.research.google.com/github/xiaochengJF/MachineLearning/blob/master/%E5%86%B3%E7%AD%96%E6%A0%91_ID3C4_5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 决策树ID3、C4.5
　
## ID3
- ID3没有考虑连续特征，比如长度，密度都是连续值(没考虑可以自己加)

- ID3采用信息增益大的特征优先建立决策树的节点，但相同条件下，取值比较多的特征比取值少的特征信息增益大。如：一个变量有2个值，各为1/2，另一个变量为3个值，各为1/3，两者都是完全不确定的变量，但是取3个值的比取2个值的信息增益大

- ID3算法对于缺失值的情况没有做考虑

- 没有考虑过拟合的问题

## C4.5
- 剪枝方法主要是两种，一种是预剪枝，即在生成决策树的时候就决定是否剪枝。另一个是后剪枝，即先生成决策树，再通过交叉验证来剪枝。C4.5用的前者

- C4.5生成的是多叉树，即一个父节点可以有多个节点。在计算机中二叉树模型会比多叉树运算效率高

- C4.5只能用于分类

- C4.5由于使用了熵模型，里面有大量的耗时的对数运算(<font color=skyblue>但求log是防止计算溢出的利器，特别适合用于处理极小概率的情况</font>)，如果是连续值还有大量的排序运算

## 参考文献
【1】[计算Gini指数示例](https://sefiks.com/2018/08/27/a-step-by-step-cart-decision-tree-example/)  
【2】[机器学习算法实践-决策树(Decision Tree)](http://pytlab.github.io/2017/07/09/%E6%9C%BA%E5%99%A8%E5%AD%A6%E4%B9%A0%E7%AE%97%E6%B3%95%E5%AE%9E%E8%B7%B5-%E5%86%B3%E7%AD%96%E6%A0%91/)  
【3】[机器学习系列之手把手教你实现一个决策树](https://www.ibm.com/developerworks/cn/analytics/library/machine-learning-hands-on4-decision-tree/index.html)  
【4】[机器学习系列之手把手教你实现一个分类回归树](https://www.ibm.com/developerworks/cn/analytics/library/machine-learning-hands-on5-cart-tree/index.html?ca=drs-)  
【5】[CART决策树(Decision Tree)的Python源码实现](https://zhuanlan.zhihu.com/p/32164933)  
【6】[机器学习算法实践-决策树(Decision Tree)](https://zhuanlan.zhihu.com/p/27905967)  
【7】[决策树算法的Python实现](https://zhuanlan.zhihu.com/p/20794583)  
【8】[决策树算法原理(上)](https://www.cnblogs.com/pinard/p/6050306.html)  
【9】[决策树算法原理(下)](https://www.cnblogs.com/pinard/p/6053344.html)  
【10】[决策树（decision tree）(二)——剪枝](https://blog.csdn.net/u012328159/article/details/79285214)  
【11】[决策树之决策树剪枝](https://zhuanlan.zhihu.com/p/30296061)

In [0]:
from numpy import *

## 数据集
关于判断某水果是否为苹果的6条数据。数据集前两列分别代表两个特征，分别是圆的和红的。数据集第三列代表类别。拿第一条数据为例，指的是圆的和红的水果是苹果

In [0]:
def createDataSet():
    dataSet = [[1, 1, 'yes'],
               [1, 1, 'yes'],
               [1, 0, 'no'],
               [0, 1, 'no'],
               [1, 1, 'yes'],
               [1, 1, 'no']]
    labels = ['round','red']
    return dataSet, labels

## 计算熵
待分类的事物可能划分在多个类别中，则符号$x_i$的信息是:
$$I(x_i) = -\log_2(P(x_{i}))$$
 $P(x_i)$ 越大，则 $I(x_i)$ 越小，即 $x_i$ 的概率越大，则 $x_i$ 包含的信息越少 

信息熵：平均一个事件发生带来的信息量大小，也就是信息量的期望值：
$$H = \sum_{i=1}^{n}H(x_{i}) = -\sum_{i=1}^{n}P(x_{i})\log_2(P(x_{i}))$$

In [0]:
# 计算数据集的entropy
def calcEntropy(dataSet):
    totalNum = len(dataSet)
    labelNum = {}
    entropy = 0
    for data in dataSet:
        label = data[-1]
        if label in labelNum:
            labelNum[label] += 1
        else:
            labelNum[label] = 1

    for key in labelNum:
        p = labelNum[key] / totalNum
        entropy -= p * log2(p)
    return entropy


def calcEntropyForFeature(featureList):
    totalNum = len(featureList)
    dataNum = {}
    entropy = 0
    for data in featureList:
        if data in dataNum:
            dataNum[data] += 1
        else:
            dataNum[data] = 1

    for key in dataNum:
        p = dataNum[key] / totalNum
        entropy -= p * log2(p)
    return entropy

 ## ID3决策树选择最优特征
 首先计算数据集的初始信息熵，然后循环计算按不同的特征划分后的数据集的信息熵，前一个信息熵减去后一个信息熵的差值就是信息增益。选择信息增益最大的那个特征作为最优特征。  
 数据集信息熵：
 $$H(D) = -\sum^K_{k=1}\frac{|C_k|}{|D|}\log\frac{|C_k|}{|D|}$$

 条件熵：

 $$H{(D|A)} = \sum_{i=1}^{n}P(D_{i})H(D_{j}) = \sum_{i=1}^{n}\frac{|D_i|}{|D|}\sum^K_{k=1}\frac{|D_{ik}|}{|D_i|}\log\frac{|D_{ik}|}{|D_i|}$$

信息增益：
 $$G(D,A) = H(D) - H{(D|A)}$$

 假设训练数据集为 $D$，样本容量为 $|D|$ ,有 $k$ 个类别 $C_k$ ，$|C_k|$ 为类别 $C_k$  的样本个数。某一特征 $A$ 有 $n$ 个不同的取值 $a_1,a_2,\cdots,a_n$ 。根据特征 $A$ 的取值可将数据集 $D$ 划分为 $n$ 个子集 $D_1,D_2,\cdots,D_n$ , $|D_i|$ 为 $D_i$ 的样本个数。并记子集 $D_i$ 中属于类 $C_k$ 的样本的集合为 $|D_{ik}|$ 为 $D_{ik}$ 的样本个数

 <font face=楷体 color=brown size=5>缺点</font> ：信息增益偏向取值较多的特征，如一个变量有2个值，各为1/2，另一个变量为3个值，各为1/3，其实他们都是完全不确定的变量，但是取3个值的比取2个值的信息增益大 

In [0]:
#选择最优划分属性ID3
def chooseBestFeatureID3(dataSet, labels):
    bestFeature = 0
    initialEntropy = calcEntropy(dataSet)
    biggestEntropyG = 0
    for i in range(len(labels)):
        currentEntropy = 0
        feature = [data[i] for data in dataSet]
        subSet = splitDataSetByFeature(i, dataSet)
        totalN = len(feature)
        for key in subSet:
            prob = len(subSet[key]) / totalN
            currentEntropy += prob * calcEntropy(subSet[key])
        entropyGain = initialEntropy - currentEntropy
        if(biggestEntropyG < entropyGain):
            biggestEntropyG = entropyGain
            bestFeature = i
    return bestFeature

## C4.5决策树选择最优特征
选择时需要选信息增益比最大的特征作为最优特征。首先计算数据集的初始信息熵，然后循环计算按不同的特征划分后的数据集的信息熵，前一个信息熵减去后一个信息熵的差值就是信息增益。信息增益除以数据集关于某特征取值的熵就是信息增益比。最后将信息增益比最大的那个特征作为最优特征。

特征 $A$ 对训练数据集 $D$ 的信息增益比定义为其信息增益与训练集$D$关于特征$A$的值的熵之比
$$G_R(D|A) = \frac{G(D,A)}{H_A(D)}$$

其中：
 $$H_A(D) = -\sum^n_{i=1}\frac{|D_i|}{|D|}\log\frac{|D_i|}{|D|}$$

<font face=楷体 color=yellow size=5>增益比</font>  
在信息增益的基础之上乘上一个惩罚参数。特征个数较多时，惩罚参数较小；特征个数较少时，惩罚参数较大

<font face=楷体 color=brown size=5>缺点</font> ：信息增益比偏向取值较少的特征    
<font face=楷体 color=green size=5>原因</font> ：当特征取值较少时$H_A(D)$的值较小，因此其倒数较大，因而偏向取值较少的特征

In [0]:
#选择最优划分属性C4.5
def chooseBestFeatureC45(dataSet, labels):
    bestFeature = 0
    initialEntropy = calcEntropy(dataSet)
    biggestEntropyGR = 0
    for i in range(len(labels)):
        currentEntropy = 0
        feature = [data[i] for data in dataSet]
        entropyFeature = calcEntropyForFeature(feature)
        subSet = splitDataSetByFeature(i, dataSet)
        totalN = len(feature)
        for key in subSet:
            prob = len(subSet[key]) / totalN
            currentEntropy += prob * calcEntropy(subSet[key])
        entropyGain = initialEntropy - currentEntropy
        entropyGainRatio = entropyGain / entropyFeature

        if(biggestEntropyGR < entropyGainRatio):
            biggestEntropyGR = entropyGainRatio
            bestFeature = i
    return bestFeature

## 按特征划分数据集
按数据集的某个特征划分数据集。先统计该特征的取值，然后按不同取值划分数据集。  
<font face=楷体 color=red size=5>注意</font>：划分后的数据集中将不再包含该特征

In [0]:
def splitDataSetByFeature(i, dataSet):
    subSet = {}
    feature = [data[i] for data in dataSet]
    for j in range(len(feature)):
        if feature[j] not in subSet:
            subSet[feature[j]] = []

        splittedDataSet = dataSet[j][:i]
        splittedDataSet.extend(dataSet[j][i + 1:])
        subSet[feature[j]].append(splittedDataSet)
    return subSet

## 结束条件
ID3决策树出现两种条件则需要结束对数据集的划分
- 划分后的数据集属于同一类别
- 没有特征值可继续划分。


In [0]:
def checkIsOneCateg(newDataSet):
    flag = False
    categoryList = [data[-1] for data in newDataSet]
    category = set(categoryList)
    if(len(category) == 1):
        flag = True
    return flag


def majorityCateg(newDataSet):
    categCount = {}
    categList = [data[-1] for data in newDataSet]
    for c in categList:
        if c not in categCount:
            categCount[c] = 1
        else:
            categCount[c] += 1
    sortedCateg = sorted(categCount.items(), key = lambda x:x[1], reverse = True)

    return sortedCateg[0][0]


## 创建决策树
递归创建决策树:
- 首先选择最优划分特征
- 然后按最优特征划分数据集
- 对于划分后的数据集，先判断是否达到结束条件，如果是，则返回类别，并停止对数据子集的划分；如果不是，则继续递归构建决策树

In [0]:
#创建ID3树
def createDecisionTreeID3(decisionTree, dataSet, labels):
    bestFeature = chooseBestFeatureID3(dataSet, labels)
    decisionTree[labels[bestFeature]] = {}
    currentLabel = labels[bestFeature]
    subSet = splitDataSetByFeature(bestFeature, dataSet)
    del(labels[bestFeature])
    newLabels = labels[:]
    for key in subSet:
        newDataSet = subSet[key]
        flag = checkIsOneCateg(newDataSet)
        if(flag == True):
            decisionTree[currentLabel][key] = newDataSet[0][-1]
        else:
            if (len(newDataSet[0]) == 1): #无特征值可划分
                decisionTree[currentLabel][key] = majorityCateg(newDataSet)
            else:
                decisionTree[currentLabel][key] = {}
                createDecisionTreeID3(decisionTree[currentLabel][key], newDataSet, newLabels)

In [0]:
# 创建C4.5树
def createDecisionTreeC45(decisionTree, dataSet, labels):
    bestFeature = chooseBestFeatureC45(dataSet, labels)
    decisionTree[labels[bestFeature]] = {}
    currentLabel = labels[bestFeature]
    subSet = splitDataSetByFeature(bestFeature, dataSet)
    del (labels[bestFeature])
    newLabels = labels[:]
    for key in subSet:
        newDataSet = subSet[key]
        flag = checkIsOneCateg(newDataSet)
        if (flag == True):
            decisionTree[currentLabel][key] = newDataSet[0][-1]
        else:
            if (len(newDataSet[0]) == 1):  # 无特征值可划分
                decisionTree[currentLabel][key] = majorityCateg(newDataSet)
            else:
                decisionTree[currentLabel][key] = {}
                createDecisionTreeC45(decisionTree[currentLabel][key], newDataSet, newLabels)

## 将测试数据分类
如果到达叶节点，则返回该分类；否则，继续尝试其他特征，直到到达叶节点为止，然后返回该分类

In [0]:
#测试数据分类
def classifyTestData(decisionTree, testData):
    result1 = decisionTree['round'][testData[0]]
    if(type(result1) == str): category = result1
    else:
        category = decisionTree['round'][testData[0]]['red'][testData[1]]
    return category

In [0]:
if __name__ == '__main__':
    dataSetID3, labelsID3 = createDataSet()
    testData1 = [0, 1]
    testData2 = [1, 1]
    bestFeatureID3 = chooseBestFeatureID3(dataSetID3, labelsID3)
    decisionTreeID3 = {}
    createDecisionTreeID3(decisionTreeID3, dataSetID3, labelsID3)
    print("ID3 decision tree: ", decisionTreeID3)
    category1ID3 = classifyTestData(decisionTreeID3, testData1)
    print(testData1 , ", classified as by ID3: " , category1ID3)
    category2ID3 = classifyTestData(decisionTreeID3, testData2)
    print(testData2 , ", classified as by ID3: " , category2ID3)

    dataSetC45, labelsC45 = createDataSet()
    bestFeatureC45 = chooseBestFeatureC45(dataSetC45, labelsC45)
    decisionTreeC45 = {}
    createDecisionTreeC45(decisionTreeC45, dataSetC45, labelsC45)
    print("C4.5 decision tree: ", decisionTreeC45)
    category1C45 = classifyTestData(decisionTreeC45, testData1)
    print(testData1 , ", classified as by C4.5: " , category1C45)
    category2C45 = classifyTestData(decisionTreeC45, testData2)
    print(testData2 , ", classified as by C4.5: " , category2C45)

ID3 decision tree:  {'round': {1: {'red': {1: 'yes', 0: 'no'}}, 0: 'no'}}
[0, 1] , classified as by ID3:  no
[1, 1] , classified as by ID3:  yes
C4.5 decision tree:  {'round': {1: {'red': {1: 'yes', 0: 'no'}}, 0: 'no'}}
[0, 1] , classified as by C4.5:  no
[1, 1] , classified as by C4.5:  yes
