<div style="display: flex; align-items: center;">
  
  <div style="text-align: left;">
   <h2 style="font-size: 1.8em; margin-bottom: 0;"><b>Branching out decisions in a tree</b></h2>
   <br>
   <h3 style=" font-size: 1.2em;margin-bottom: 0;">Decision Trees and Ensemble Learning</h3>
   <h3 style="font-size: 1.2em; margin-bottom: 0; color: blue;"><i>Dr. Satadisha Saha Bhowmick</i></h3>
  </div>

  <div style="margin-right: 10px;"> 
    <img src="media/images/dsi-logo-600.png" align="right" alt="UC-DSI" scale="0.7;">
  </div>

</div>

<!-- ### Learning Loop -->

<div style="display: flex; align-items: center;gap: 5px;">
  <div style="flex: 1;">
    <h3>About Me</h3>
  <h4>Satadisha Saha Bhowmick, Ph.D</h4>

  <div class="fragment"  style="font-size: 14px;">
    <h4>Affiliation</h4>
    <ul>
      <li>Postdoctoral Teaching Fellow <br> Data Science Institute, University of Chicago</li>
    </ul>
  </div>

  <div class="fragment"  style="font-size: 14px;">
    <h4>Courses I teach</h4>
    <ul>
      <li>Introduction to Data Science</li>
      <li>Mathematical Methods for Data Science</li>
      <li>Ethics, Fairness, Responsibility, and Privacy in Data Science</li>
      <li>Object Oriented Programming with Java</li>
    </ul>
  </div>

  <div class="fragment"  style="font-size: 14px;">
    <h4>Research Interest</h4>
    <ul>
      <li>Information Extraction</li>
      <li>Short Text Mining</li>
    </ul>
  </div>
  
  </div>
  <div style="flex: 1;">
    <img src="media/images/satadisha-photo.png" alt="Self" scale="0.3">
  </div>
</div>


In [4]:
from graphviz import Digraph
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from matplotlib.colors import ListedColormap
import ipywidgets as widgets
from IPython.display import display, clear_output
from PIL import Image
from matplotlib import cm
from mpl_toolkits.mplot3d import Axes3D  # noqa: F401
from ipywidgets import interact
df = pd.read_csv("data/data_for_tree_Oct22.csv")

### Today's Learning Outcomes
Course Module from DATA 119 Introduction to Data Science II

<div style="display: flex; gap: 2px;">

  <div style="flex: 1;">

  <ul>
    <li class="fragment"> General understanding of Tree Models</li>
    <li class="fragment"> Data driven decision making with trees</li>
    <li class="fragment">Impurity functions to build decision boundaries for tree models.</li>
  </ul>

  
  </div>

  <div style="flex: 1;">
  <ul>
    <li class="fragment">Using an ensemble of tree-based learners</li> 
    <li class="fragment"> Bagging</li>
    <li class="fragment"> Boosting</li>
  </ul>
  </div>

</div>

### Setting The Scene

<div style="display:flex; gap:20px;">
  <div style="flex:1;">
  <img src="media/images/bullet1.png" alt="tab1" scale="0.35;" style="width: 20%;">
  <p>Most data that is interesting<br> enough for prediction has<br> some inherent structure.</p>
  </div>
  <div style="flex:1;">
  <img src="media/images/bullet2.png" alt="tab1" scale="0.35;" style="width: 20%;">
  <p>Tree-based models exploit structure in data to split them into multiple homogenous subgroups</p>
  <p>Approximates a (typically) discrete valued target function by repeatedly segmenting the predictor space into more homogeneous regions.</p>
  <p>Represent a disjunction of conjunctions of constraints on the values of attributes representing the data.</p>
  </div>
  <div style="flex:1;">
  <img src="media/images/bullet3.png" alt="tab1" scale="0.35;" style="width: 20%;">
  <p><b>Advantages</b></p>
  <p>Training data need not be stored once the tree is constructed</p>
  <p>Very fast during test time as test inputs only need to traverse down the tree to a leaf.</p>
  <p>Decision trees require no distance metric because the splits are based on feature thresholds and not distances.</p>

  </div>
</div>

### Decision Tree: Example

- Assume a toy task that consisting of a dataset that contains several attributes related to trees growing in a plot of land. 
- Given only the $\color{blue}{\textbf{Diameter}}$ and $\color{blue}{\textbf{Height}}$ of a tree trunk, we must determine if it's an Apple, Cherry, or Oak tree. 
- To do this, we'll use a $\color{blue}{\textbf{Decision Tree}}$.

<i>Let's start by investigating the data!</i>

In [5]:
data = df[["Diameter", "Height"]]
print("Number of rows:",len(data))

#Number of instances per class
class_counts = (
    df["Family"]
    .value_counts()
    .sort_index()
    .rename_axis("Tree type")
    .reset_index(name="Count")
)

class_counts.loc["Total"] = ["Total", class_counts["Count"].sum()]
class_counts


Number of rows: 150


Unnamed: 0,Tree type,Count
0,apple,50
1,cherry,50
2,oak,50
Total,Total,150


<img src="media/images/tree-data.png" alt="Tree Data" scale="0.55;" style="width: 90%;">

### Decision Tree: Example

Learned trees can also be thought of as <span style="color:blue;"><i>sets of if-then rules</i></span> progressively dividing the feature space!

<img src="media/gif/decision_tree_growth.gif" alt="Decision Tree Example" scale="0.55;" style="width: 90%;">