## What is Decision Tree?
- Flowchart-like structure
### 1. DT Structure
1) Root Node : Entire dataset - initial decision to be made <br>
2) Internal Node : decisions on attributes. One or more branches <br>
3) Branches : outcome of a decision leading to another node. <br>
4) Leaf Nodes : final decision/prediction. <br>

### 2. Procedure
1) Choose a split method such as gini, entropy, or information gain and find the best split on independent variables.
2) Subset your data by the threshold you found using the above metrics.
3) Iterate this until the last leaf node

<div>
<img src="DecisionTree.png" width="700">
</div>



### 3. Metrics for splitting
- The purpose of the metrics is that we want smaller values on those metrics that make the model with purer result in the target. Example) All yes in one node and all No in another node.
- Gini over entropy is more simpler in terms of calculation and it brings similar result. 
1) Gini impurity : likelihood of incorrect classification
- Gini : $ 1- \sum_{i=1}^{n}(p_i)^2 $ where pi is proba of instance being classified into a particular class.
2) Entropy : amount of uncertainty or impurity.
- Entropy : $ - \sum_{i=1}^{n}p_i\log_2(p_i) $
3) Information Gain : reduction in entropy or Gini inpurity after split
- $ Entropy_{parent}-\sum_{i=1}^{n}(\frac{|D_i|}{D}*Entropy(D_i)) $ where Di is subset of D after splitting
4) Regression will use MSE, MAPE, MAE etc

### 4. Advantages and Disadvantages
- Advantage
1) Simplicity and Interpretability : Simple and easy to understand
2) Versatility : regression and classification
3) No required for normalization or scaling for the data
4) Non parametric so captures non linear relationship
- Disadvantage
1) Overfitting : if deep enough with many nodes
2) Instability : small variations in the data causes different trees
3) Bias towards features with more levels: features with more levels can dominate the tree structure.

### 5. Pruning
- To avoid overfitting, pruning is used. It reduces the size of the tree by removing nodes that is less powerful
1) Pre-pruning: Stops the tree from growing once it meets criteria
2) Post-pruning : Remove branches from a fully grown tree that do not provide significant power.

##### Reference Link : https://www.geeksforgeeks.org/decision-tree/

### 6. Code Implimentation from Scratch
##### Reference link : https://anderfernandez.com/en/blog/code-decision-tree-python-from-scratch/

In [1]:
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
iris = fetch_ucirepo(id=53)['data']['original']
  
# data (as pandas dataframes) 
X = iris[[i for i in iris.columns if i not in ['class']]]
y = iris['class']


In [2]:
X.head(5)

Unnamed: 0,sepal length,sepal width,petal length,petal width
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [35]:
X2 = X.assign(
    sl = X['sepal length']>5.843333,
    sw = X['sepal width']>3.054000,
    pl = X['petal length']>3.758667,
    pw = X['petal width']>1.198667
)[['sl','sw','pl','pw']]
iris2 = pd.concat([X2,y],axis=1)

In [36]:
iris3 = iris2.loc[iris2['class']!='Iris-setosa',:]

In [37]:
import pandas as pd
import numpy as np
def gini_impurity(y):
  '''
  Given a Pandas Series, it calculates the Gini Impurity. 
  y: variable with which calculate Gini Impurity.
  '''
  if isinstance(y, pd.Series):
    p = y.value_counts()/y.shape[0]
    gini = 1-np.sum(p**2)
    return(gini)

  else:
    raise('Object must be a Pandas Series.')

print(
gini_impurity(iris3.loc[iris3['sl']==True,'class']),
gini_impurity(iris3.loc[iris3['sw']==True,'class']),
gini_impurity(iris3.loc[iris3['pl']==True,'class']),
gini_impurity(iris3.loc[iris3['pw']==True,'class'])
)

0.46693877551020413 0.4351999999999999 0.49716730257833275 0.49382716049382713


In [39]:
iris3_subset = iris3.loc[iris3['sw']==True,:]

print(
gini_impurity(iris3_subset.loc[iris3_subset['sl']==True,'class']),
gini_impurity(iris3_subset.loc[iris3_subset['sw']==True,'class']),
gini_impurity(iris3_subset.loc[iris3_subset['pl']==True,'class']),
gini_impurity(iris3_subset.loc[iris3_subset['pw']==True,'class'])
)


0.4351999999999999 0.4351999999999999 0.4351999999999999 0.4351999999999999


In [None]:
iris3_subset = iris3.loc[iris3['sw']==False,:]

print(
gini_impurity(iris3_subset.loc[iris3_subset['sl']==True,'class']),
gini_impurity(iris3_subset.loc[iris3_subset['sw']==True,'class']),
gini_impurity(iris3_subset.loc[iris3_subset['pl']==True,'class']),
gini_impurity(iris3_subset.loc[iris3_subset['pw']==True,'class'])
)

In [40]:
iris3_subset

Unnamed: 0,sl,sw,pl,pw,class
50,True,True,True,True,Iris-versicolor
51,True,True,True,True,Iris-versicolor
52,True,True,True,True,Iris-versicolor
56,True,True,True,True,Iris-versicolor
65,True,True,True,True,Iris-versicolor
70,True,True,True,True,Iris-versicolor
85,True,True,True,True,Iris-versicolor
86,True,True,True,True,Iris-versicolor
100,True,True,True,True,Iris-virginica
109,True,True,True,True,Iris-virginica


In [None]:
def entropy(y):
  '''
  Given a Pandas Series, it calculates the entropy. 
  y: variable with which calculate entropy.
  '''
  if isinstance(y, pd.Series):
    a = y.value_counts()/y.shape[0]
    entropy = np.sum(-a*np.log2(a+1e-9))
    return(entropy)

  else:
    raise('Object must be a Pandas Series.')


print(
entropy(iris3.loc[iris3['sl']==True,'class']),
entropy(iris3.loc[iris3['sw']==True,'class']),
entropy(iris3.loc[iris3['pl']==True,'class']),
entropy(iris3.loc[iris3['pw']==True,'class'])
)

In [None]:
def variance(y):
  '''
  Function to help calculate the variance avoiding nan.
  y: variable to calculate variance to. It should be a Pandas Series.
  '''
  if(len(y) == 1):
    return 0
  else:
    return y.var()

def information_gain(y, mask, func=entropy):
  '''
  It returns the Information Gain of a variable given a loss function.
  y: target variable.
  mask: split choice.
  func: function to be used to calculate Information Gain in case os classification.
  '''
  
  a = sum(mask)
  b = mask.shape[0] - a
  
  if(a == 0 or b ==0): 
    ig = 0
  
  else:
    if y.dtypes != 'O':
      ig = variance(y) - (a/(a+b)* variance(y[mask])) - (b/(a+b)*variance(y[-mask]))
    else:
      ig = func(y)-a/(a+b)*func(y[mask])-b/(a+b)*func(y[-mask])
  
  return ig

In [None]:
# information_gain(data['obese'], data['Gender'] == 'Male')

In [None]:
import itertools

def categorical_options(a):
  '''
  Creates all possible combinations from a Pandas Series.
  a: Pandas Series from where to get all possible combinations. 
  '''
  a = a.unique()

  opciones = []
  for L in range(0, len(a)+1):
      for subset in itertools.combinations(a, L):
          subset = list(subset)
          opciones.append(subset)

  return opciones[1:-1]

def max_information_gain_split(x, y, func=entropy):
  '''
  Given a predictor & target variable, returns the best split, the error and the type of variable based on a selected cost function.
  x: predictor variable as Pandas Series.
  y: target variable as Pandas Series.
  func: function to be used to calculate the best split.
  '''

  split_value = []
  ig = [] 

  numeric_variable = True if x.dtypes != 'O' else False

  # Create options according to variable type
  if numeric_variable:
    options = x.sort_values().unique()[1:]
  else: 
    options = categorical_options(x)

  # Calculate ig for all values
  for val in options:
    mask =   x < val if numeric_variable else x.isin(val)
    val_ig = information_gain(y, mask, func)
    # Append results
    ig.append(val_ig)
    split_value.append(val)

  # Check if there are more than 1 results if not, return False
  if len(ig) == 0:
    return(None,None,None, False)

  else:
  # Get results with highest IG
    best_ig = max(ig)
    best_ig_index = ig.index(best_ig)
    best_split = split_value[best_ig_index]
    return(best_ig,best_split,numeric_variable, True)


# weight_ig, weight_slpit, _, _ = max_information_gain_split(data['Weight'], data['obese'],)  


# print(
#   "The best split for Weight is when the variable is less than ",
#   weight_slpit,"\nInformation Gain for that split is:", weight_ig
# )

In [None]:
weight_ig, weight_slpit, _, _ = max_information_gain_split(X['sepal length'], y)  


print(
  "The best split for Weight is when the variable is less than ",
  weight_slpit,"\nInformation Gain for that split is:", weight_ig
)

In [None]:
X.apply(max_information_gain_split, y = y)