# **CIS 520: Machine Learning, Fall 2020**
# **Week 2, Worksheet 3**
## **Information Gain for Decision Trees**

- **Content Creators:** Lyle Ungar
- **Content Reviewers:** Michael Zhou, Tejas Srivastava
- **Acknowledgments/Citations:** Eric Eaton and https://www.python-course.eu/Decision_Trees.php

**Goal** : To understand how to compute information gain for a given dataset manually and in a program. 

**Instructions**: For the first dataset compute the information gain and fill in the answers. Follow the functions for the second dataset to see how information gain can be computed in a function.


In [None]:
#@title Setup
def InfoGain(data,attribute_name,target_name="class"):
    """
    Calculate the information gain of a dataset. This function takes three parameters:
    1. data = The dataset for whose feature the IG should be calculated (pd Dataframe)
    2. attribute_name = the name of the feature for which the information gain should be calculated(string)
    3. target_name = the name of the target feature.(string)
    """    
    #Calculate the entropy of the total dataset
    total_entropy = entropy(data[target_name])
    
    
    #Calculate the values and the corresponding counts for the attribute by which tree is split
    vals,counts= np.unique(data[attribute_name],return_counts=True)
    
    #Calculate the weighted entropy
    Weighted_Entropy = np.sum([(counts[i]/np.sum(counts))*entropy(data.where(data[attribute_name]==vals[i]).dropna()[target_name]) for i in range(len(vals))])
    
    #Calculate the information gain
    Information_Gain = total_entropy - Weighted_Entropy

    return Information_Gain

def entropy(target):
    """
    Calculate the entropy of a dataset.
    The only parameter of this function is the target_col parameter which specifies the target column
    """
    elem,count = np.unique(target,return_counts = True)
    entropy = np.sum([(-count[i]/np.sum(count))*np.log2(count[i]/np.sum(count)) for i in range(len(elem))])
    return entropy

##Creating Data

In [None]:
import pandas as pd
import numpy as np


data = pd.DataFrame({"Ageover50":["True","False","False","True","True","True","False","False","True","False"],
                     "smoking":["True","True","False","True","True","True","False","False","True","False"],
                     "asthma":["True","True","True","True","True","True","False","True","True","True"],
                     "obese":["True","True","False","True","True","False","False","False","True","True"],
                     "diabetic":["yes","yes","no","yes","no","yes","no","no","yes","no"]}, 
                    columns=["Ageover50","smoking","asthma","obese","diabetic"])

features = data[["Ageover50","smoking","asthma","obese"]]
target = data["diabetic"]

data

Unnamed: 0,Ageover50,smoking,asthma,obese,diabetic
0,True,True,True,True,yes
1,False,True,True,True,yes
2,False,False,True,False,no
3,True,True,True,True,yes
4,True,True,True,True,no
5,True,True,True,False,yes
6,False,False,False,False,no
7,False,False,True,False,no
8,True,True,True,True,yes
9,False,False,True,True,no



**Q1:** Compute the information gain (IG) for split attributes


1.   Compute IG for split attribute "**Age over 50**" given the target attribute is diabetic

2.    Compute IG for split attribute **smoking** given the target attribute is diabetic

3.    Compute IG for split attribute **asthma** given the target attribute is diabetic

4.    Compute IG for split attribute **obese** given the target attribute is diabetic

Your answers should be accurate up to 3 decimals.

Include your answers to the assigned variable names in order to check them. 

In [None]:
Ageover50="ANSWER FOR Q1.1"
smoking="ANSWER FOR Q1.2"
asthma="ANSWER FOR Q1.3"
obese="ANSWER FOR Q1.4"

myInfoGain=[Ageover50,smoking,asthma,obese]

##Answers

**Do not edit the following cell, as it will assess your answers** (run all cells)

In [None]:
ans1= InfoGain(data,'Ageover50',target_name="diabetic")
ans2= InfoGain(data,'smoking',target_name="diabetic")
ans3= InfoGain(data,'asthma',target_name="diabetic")
ans4= InfoGain(data,'obese',target_name="diabetic")

ans=[ans1,ans2,ans3,ans4]

You don't have to compute information gain for all of the questions in order to check your answers.

In [None]:
import hashlib 

for i in range(0,4):
  if (float(myInfoGain[i])==round(float(ans[i]), 3) or float(myInfoGain[i])==round(float(ans[i]), 3)+0.001):
    print("Q 1.%d :True- Good job" %(i+1))
  else:
    print("Q 1.%d :false-Try again" %(i+1))

Q 1.1 :True- Good job
Q 1.2 :True- Good job
Q 1.3 :True- Good job
Q 1.4 :True- Good job


Good job! You finished calculating the information gain manually. Now let's take a look at how we compute them in a program. 

In [None]:
def InfoGain(data,attribute_name,target_name="class"):
    """
    Calculate the information gain of a dataset. This function takes three parameters:
    1. data = The dataset for whose feature the IG should be calculated (pd Dataframe)
    2. attribute_name = the name of the feature for which the information gain should be calculated(string)
    3. target_name = the name of the target feature.(string)
    """    
    #Calculate the entropy of the total dataset
    total_entropy = entropy(data[target_name])
    
    
    #Calculate the values and the corresponding counts for the attribute by which tree is split
    vals,counts= np.unique(data[attribute_name],return_counts=True)
    
    #Calculate the weighted entropy
    Weighted_Entropy = np.sum([(counts[i]/np.sum(counts))*entropy(data.where(data[attribute_name]==vals[i]).dropna()[target_name]) for i in range(len(vals))])
    
    #Calculate the information gain
    Information_Gain = total_entropy - Weighted_Entropy

    return Information_Gain

In [None]:
## Calculating the functions mathmatically

def entropy(target):
    """
    Calculate the entropy of a dataset.
    The only parameter of this function is the target_col parameter which specifies the target column
    """
    elem,count = np.unique(target,return_counts = True)
    entropy = np.sum([(-count[i]/np.sum(count))*np.log2(count[i]/np.sum(count)) for i in range(len(elem))])
    return entropy

You may also want to consider the following questions:

- How would you calculate information gain for multiple split attributes?
- What are the differences between the information gain and the entropy of a dataset?
- What is the mathematical relationship between information gain and entropy for a given dataset?