<a href="https://colab.research.google.com/github/yiboxu20/MachineLearning/blob/main/Resources/Module1/Decision_tree1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


Part of the notebook is based on https://medium.com/@joachimvalente

## Tip of the day
- You can adjust your preferences such, as **adding line numbers** to your code cells, **show indentation guide** and **changing the default indentation to 4**, in the settings->editor menu.

- Displaying functions documentation:
 - You can quickly display a function's documentation by pressing **Alt+/** when standing on it with the cursor.

 - You can also open a small documentation window at the bottom of the screen by running a command for the format of **?{function}** in a new cell (and replacing **{function}** with your function's name.

    Try opening a new cell, bellow this one by clicking on the **+code** button below the menu bar. Then type:
```python
?print
```
into it and run it.




In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

from IPython.display import Image
import numpy.linalg as LA
eps = 10**-12

# Decision Tree
Advantage of Decision Tree

- Simple to understand and interpret.

- Able to handle both numerical and categorical data. (today's lecture)

- Requires little data preparation.

- Uses a white box model.

- Possible to validate a model using statistical tests.

- Mirrors human decision making more closely than other approaches.


Original ID3 algorithm accepts **categorical/discrete features** only and
bears the overfitting issue. Also ID3 only works with classification.

We will introduce **C4.5** algorithm, which is the successor to ID3. C4.5 improves from ID3 mainly in three-fold:

- Accept continuous features by introducing thresholding node.

- Solve overfitting problem by pruning strategy.

- Can handle incomplete data points with missing feature values.

## C4.5 Algorithm
- One issue to use **information gain (IG)** is it favors to the feature has a large number of distinct values.

- For example, suppose that one is building a decision tree for some data describing the customers of a business.

  Information gain is often used to decide which of the features are the most relevant, so they can be tested near the root of the tree. One of the input features might be the customer's **credit card number**. This features has a high IG, because it uniquely identifies each customer, but we do not want to include it in the decision tree: deciding how to treat a customer based on their credit card number is unlikely to generalize to customers we haven't seen before. It is typical **overfitting**.

- C4.5 Algorithm chooses the feature with **highest information gain ratio** from among the features whose **information gain is average or higher**. This biases the decision tree against considering features with a large number of distinct values, while not giving an unfair advantage to attributes with very low information value.

### Information gain ratio
- Recall **Information gain (IG)** of set $\mathcal{S}$ on some feature $F$ is
$$ \text{IG}(\mathcal{S}; F)= H(\mathcal{S})- H(\mathcal{S}|F).$$

- Conditional entropy of $\mathcal{S}$ on $F$ is computed by
$$ H(\mathcal{S}|F) = \sum_{f\in F} P(F=f)H(\mathcal{S}_f)=\sum_{f\in F} \frac{|\mathcal{S}_f|}{|\mathcal{S}|}H(\mathcal{S}_f)$$

- Then entropy on feature $F$ is
$$H(F) =-\sum_{f\in F}\frac{|\mathcal{S}_f|}{|\mathcal{S}|} \log_2\left(\frac{|\mathcal{S}_f|}{|\mathcal{S}|}\right) $$
which sometimes also called intrinsic value.

- **Information gain ratio** is the ratio between the information gain and the intrinsic value:
$$ \text{IGR}(\mathcal{S}; F) = \text{IG}(\mathcal{S}; F) /H(F)$$

- The strategy is to choose the highest $\text{IGR}$ among the features whose $\text{IG}$ is average or higher.

In [2]:
#@title reuse functions defined before

def entropy(df): #H(S)
    target = df.keys()[-1]
    entropy_data = 0
    target_values = df[target].unique() #yes or no
    for target_value in target_values:
        fraction = df[target].value_counts()[target_value]/len(df[target])
        entropy_data += -fraction*np.log2(fraction)
    return entropy_data

# define a function ent to calculate conditional entropy of each feature
def entropy_feature(df,feature): #H(S|F)
    target = df.keys()[-1]
    target_values = df[target].unique()  #This gives all 'Yes' and 'No'
    variables = df[feature].unique()    #This gives different features (f values)

    entropy = 0
    for variable in variables:
        entropy_each_feature = 0
        for target_variable in target_values:
            num = len(df[feature][df[feature]==variable][df[target] ==target_variable]) #numerator
            den = len(df[feature][df[feature]==variable])  #denominator
            fraction = num/(den+eps)  #+eps can prevent runtime error of divide 0.
            entropy_each_feature += -fraction*np.log2(fraction+eps) #This calculates entropy for one feature   H(S_f)
        fraction2 = den/len(df) # P(F=f)
        entropy += -fraction2*entropy_each_feature   #Sums up all the entropy, H(S|F)=\sum P(F=f) H(S_f)

    return(abs(entropy))

# calculate Info gain of each feature
def ig(df):
    IG = []
    for feature in df.keys()[:-1]:
      IG.append(entropy(df)-entropy_feature(df,feature))
    return IG



In [3]:
url = "https://raw.githubusercontent.com/yiboxu20/MachineLearning/refs/heads/main/Resources/data/credithistory.csv"
c = pd.read_csv(url)
c= c[['collateral','income','debt','credithistory','risk']]
c

Unnamed: 0,collateral,income,debt,credithistory,risk
0,none,$0to$15K,high,bad,high
1,none,$15Kto$35K,high,unknown,high
2,none,$15Kto$35K,low,unknown,moderate
3,none,$0to$15K,low,unknown,high
4,none,over$35K,low,unknown,low
5,adequate,over$35K,low,unknown,low
6,none,$0to$15K,low,bad,high
7,adequate,over$35K,low,bad,moderate
8,none,over$35K,low,good,low
9,adequate,over$35K,high,good,low


In [5]:
def get_subtable_no_node(df, node,variable):
  tempdf = df[df[node] == variable].reset_index(drop=True)
  return tempdf.drop(node, axis = 1)


def ig_ratio(df):
  IG = ig(df)
  HF = []
  features = df.keys()[:-1]
  for feature in features:
    entropy = 0
    variables = df[feature].unique()
    for variable in variables:
      den = len(df[feature][df[feature]==variable])
      fraction = den/len(df)
      entropy += - fraction*np.log2(fraction + eps)

    HF.append(entropy)

  ratio = [IG[i]/HF[i] for i in range(len(IG))]
  return ratio



def C45(df):
  """
   Find the feature with highest information gain ratio
   from among the features whose information gain is average or higher.
  """
  features = df.keys()[:-1]
  IG = ig(df)
  IG_ratio = ig_ratio(df)
  IG_mean = np.mean(IG)
  IG_above_ave_index = [i  for i in range(len(IG)) if IG[i]>=IG_mean ]
  max_index = IG_above_ave_index[np.argmax( [IG_ratio[index] for index in IG_above_ave_index])]
  return features[max_index]

In [6]:
def buildTree(df,tree=None):
    target = df.keys()[-1]   #To make the code generic, changing target variable class name
    features = df.keys()[:-1]
    if len(features) == 0:
      target_values = df[target].unique()
      target_count = [  df[target].value_counts()[target_value]  for target_value in target_values]
      index = np.argmax(target_count)
      tree = target_values[index]
    else:
      #Here we build our decision tree

      #Get feature from C4.5
      node = C45(df)

      #Get distinct value of that feature
      variables = df[node].unique()

      #Create an empty dictionary to create tree
      if tree is None:
          tree={}
          tree[node] = {}

      #We make loop to construct a tree by calling this function recursively.
      #In this we check if the subset is pure and stops if it is pure.

      for variable in variables:

          subtable = get_subtable_no_node(df,node,variable)
          clValue,counts = np.unique(subtable[target],return_counts=True)

          if len(counts)==1:#Checking purity of subset
              tree[node][variable] = clValue[0]
          else:
              tree[node][variable] = buildTree(subtable) #Calling the function recursively

    return tree

In [9]:
import pprint
t=buildTree(c)
pprint.pprint(t)

{'income': {'$0to$15K': 'high',
            '$15Kto$35K': {'debt': {'high': {'credithistory': {'bad': 'high',
                                                               'good': 'moderate',
                                                               'unknown': 'high'}},
                                    'low': 'moderate'}},
            'over$35K': {'credithistory': {'bad': 'moderate',
                                           'good': 'low',
                                           'unknown': 'low'}}}}


## Classification and Regression Trees (CART)
- It is a non-parametric decision tree learning technique that produces either classification or **regression** trees, depending on whether the target variable is categorical or numeric, respectively.

- A couple issues in C4.5 is it is quite slow when the data is large since it  needs lots of logarithm calculations and it has non-binary tree.

- CART uses **Gini impurity** as the metric and only uses binary tree.



### Gini impurity

- Gini impurity is a measure of how often a randomly chosen element from the set would be incorrectly labeled if it was randomly labeled according to the distribution of labels in the subset.

- Given a set of $\mathcal{S}$ with $k$ target classes, the gini impurity is
$$\text{gini}(\mathcal{S})=\sum_{i=1}^k p_i (1-p_i) = 1-\sum_{i=1}^k p_i^2. $$

- Like entropy,  a node is pure (gini = 0) if all its samples belong to the same target class, while a node with many samples from many different target classes will have a Gini closer to 1.

- The gini impurity is the first order Taylor expansion of entropy.
Since $\log_2(x)=\frac{\ln(x)}{\ln(2)} \approx \frac{(-1+x+O(x^2))}{\ln(2)}$

$$ H(\mathcal{S})=-\sum_{i=1}^k p_i\log_2(p_i) \approx -\frac{1}{\ln(2)}\sum_{i=1}^k p_i(-1+p_i)=\frac{\text{gini}(\mathcal{S})}{\ln(2)} $$





### Optimal splitting
If the feature only has two discrete categories, then it is obvious to split into two branches.

But if the feature has more than two discrete categories or even the feature is continuous variable.

- The optimal splitting is that **the node is split so that the Gini impurity of the children (more specifically the average of the Gini of the children weighted by their size) is minimized.**

- Mathematically, the set $\mathcal{S}$ is splitted into two sets $\mathcal{S}_1$ and $\mathcal{S}_2$ based on some features and threshold $A$, i.e, $\{\mathcal{S_1}: \mathbf{x}\in A\}$ and $\{\mathcal{S_2}: \mathbf{x}\notin A\}$.
we need to find the optimal feature and threshold $A$ such that
$$\text{gini}(\mathcal{S},A) =  \frac{|\mathcal{S}_1|}{|\mathcal{S}|}\text{gini}(\mathcal{S}_1) + \frac{|\mathcal{S}_2|}{|\mathcal{S}|}\text{gini}(\mathcal{S}_2) $$
is minimized.

- **Algorithm**:
   -  Iterate through the sorted feature values as possible thresholds.
   - Keep track of the number of samples per class on the left and on the right.
   -  Increment/decrement them by 1 after each threshold. From them we can easily compute Gini in constant time..




### For example

In [3]:
X = '1.5,2.3,1.7,2.7,2.9'.split(',')
y = '1,2,1,2,3'.split(',')
dataset ={'X':X,'y':y}
df = pd.DataFrame(dataset,columns=['X','y'])
df

Unnamed: 0,X,y
0,1.5,1
1,2.3,2
2,1.7,1
3,2.7,2
4,2.9,3


In [4]:
def gini(df):
    target = df.keys()[-1]
    classes = df[target].unique()
    m= len(df[target])
    num_parent = [sum(df[target] == c) for c in classes]
    best_gini = 1.0 - sum((n / m) ** 2 for n in num_parent)
    return best_gini


gini(df)

0.6399999999999999

In [12]:
import pprint
def get_subtable(df, node, best_thr):
  return df[df[node].astype('float') < best_thr].reset_index(drop=True), df[df[node].astype('float') > best_thr].reset_index(drop=True)
t=buildTree(df) ## run after introducing iris example
pprint.pprint(t)

{'X': {'<2.0': '1', '>2.0': {'X': {'<2.8': '2', '>2.8': '3'}}}}


In [5]:
def best_split(df):
  """Find the best split for a node.
    "Best" means that the average impurity of the two children, weighted by their
     population, is the smallest possible. Additionally it must be less than the
     impurity of the current node.
     To find the best split, we loop through all the features, and consider all the
     midpoints between adjacent training samples as possible thresholds. We compute
     the Gini impurity of the split generated by that particular feature/threshold
     pair, and return the pair with smallest impurity.
        Returns:
            best_idx: Index of the feature for best split, or None if no split is found.
            best_thr: Threshold to use for the split, or None if no split is found.
  """
  features   = df.keys()[:-1]
  target     = df.keys()[-1]
  target_values   = df[target].unique()
  n_target_values = len(target_values)
  m          = len(df[target])
  num_parent = [sum(df[target] == c) for c in target_values]

  best_gini = 1.0 - sum((n / m) ** 2 for n in num_parent)
  best_feature, best_thr = None, None
  # Loop through all features.

  for feature in features:
    # Sort data along selected feature.
    thresholds, classes = zip(*sorted(zip(df[feature],df[target])))
    num_left = [0]*n_target_values
    num_right = num_parent.copy()
    for i in range(1, m):
      c = classes[i - 1]
      for idx, val in enumerate(target_values):
        if c==val:
          num_left[idx]  += 1
          num_right[idx] -= 1


      gini_left = 1.0 - sum((num_left[idx] / i) ** 2 for idx in range(n_target_values) )
      gini_right = 1.0 - sum((num_right[idx] / (m-i) ) ** 2 for idx in range(n_target_values) )
      # The Gini impurity of a split is the weighted average of the Gini
      # impurity of the children.
      gini = (i * gini_left + (m - i) * gini_right) / m

      if thresholds[i] == thresholds[i - 1]:
        continue

      if gini < best_gini:
        best_gini = gini
        best_feature = feature
        best_thr = (float(thresholds[i]) + float(thresholds[i - 1])) / 2

  return best_feature, best_thr

In [None]:
best_feature, best_thr = best_split(df)
print(best_feature)
print(best_thr)

X
2.0


# Let's code this up on some test data

In [13]:
from sklearn.datasets import load_iris
iris = load_iris()
df = pd.DataFrame(data= np.c_[iris['data'], iris['target']],
                     columns= iris['feature_names'] + ['target'])
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0.0
1,4.9,3.0,1.4,0.2,0.0
2,4.7,3.2,1.3,0.2,0.0
3,4.6,3.1,1.5,0.2,0.0
4,5.0,3.6,1.4,0.2,0.0


In [14]:
def get_subtable(df, node, best_thr):
  return df[df[node] < best_thr].reset_index(drop=True), df[df[node] > best_thr].reset_index(drop=True)

In [7]:
def buildTree(df,tree=None):
  target = df.keys()[-1]
  features = df.keys()[:-1]

  best_feature, best_thr = best_split(df)
  if best_feature is not None:
    node                 = best_feature

    #Create an empty dictionary to create tree
    if tree is None:
      tree={}
      tree[node] = {}

    df_left, df_right = get_subtable(df, node, best_thr)
    clValue_left, counts_left  = np.unique(df_left[target], return_counts=True)
    clValue_right,counts_right = np.unique(df_right[target],return_counts=True)
    left_variable  = '<' + str(best_thr)
    right_variable = '>' + str(best_thr)

    if len(counts_left)==1:
      tree[node][left_variable] = clValue_left[0]
    else:
      tree[node][left_variable] = buildTree(df_left)

    if len(counts_right)==1:
      tree[node][right_variable] = clValue_right[0]
    else:
      tree[node][right_variable] = buildTree(df_right)

  return tree


In [15]:
import pprint
t=buildTree(df)
pprint.pprint(t)

{'petal length (cm)': {'<2.45': np.float64(0.0),
                       '>2.45': {'petal width (cm)': {'<1.75': {'petal length (cm)': {'<4.95': {'petal width (cm)': {'<1.65': np.float64(1.0),
                                                                                                                     '>1.65': np.float64(2.0)}},
                                                                                      '>4.95': {'petal width (cm)': {'<1.55': np.float64(2.0),
                                                                                                                     '>1.55': {'sepal length (cm)': {'<6.95': np.float64(1.0),
                                                                                                                                                     '>6.95': np.float64(2.0)}}}}}},
                                                      '>1.75': {'petal length (cm)': {'<4.85': {'sepal length (cm)': {'<5.95': np.float64(1.0),
                    

### We still need to write the prediction function, as well as pruning.

The most comprehensive code is here: https://github.com/loginaway/DecisionTree.