In [1]:
%load_ext nb_black

<IPython.core.display.Javascript object>

## Day 29 Lecture 2 Assignment

In this assignment, we will learn about entropy and information gain in the ID3 algorithm.

In [2]:
import numpy as np
import pandas as pd
import ssl

ssl._create_default_https_context = ssl._create_unverified_context

<IPython.core.display.Javascript object>

In [3]:
tennis = pd.read_csv(
    "https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/tennis_decision.csv"
)
tennis = tennis.set_index("Day")

<IPython.core.display.Javascript object>

In [4]:
print(tennis.shape)
tennis

(14, 5)


Unnamed: 0_level_0,Outlook,Temp.,Humidity,Wind,Decision
Day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,Sunny,Hot,High,Weak,No
2,Sunny,Hot,High,Strong,No
3,Overcast,Hot,High,Weak,Yes
4,Rain,Mild,High,Weak,Yes
5,Rain,Cool,Normal,Weak,Yes
6,Rain,Cool,Normal,Strong,No
7,Overcast,Cool,Normal,Strong,Yes
8,Sunny,Mild,High,Weak,No
9,Sunny,Cool,Normal,Weak,Yes
10,Rain,Mild,Normal,Weak,Yes


<IPython.core.display.Javascript object>

Write a function to compute entropy given an input of a sequence of probabilities.

In [25]:
class Entropy:
    ''' 
    params: 
        df: Categorical dataframe with no nulls
        target: The column name of the target variable of the model
        
    get_wae_d: Returns a dict: {feature: {value: weighted average entropy}}
    
    display_wae: params: Feature to display
                 function: displays weighted average 
                           entropy of each value of each feature
                         
    display_best_questions: Display the best question (smallest 
                            weighted average entropy) of each feature
    '''
    
    def __init__(self, df, target):
        # Initialize class
        self.df = df
        self.target = target
        self.columns = df.drop(columns=[target]).columns
        self.col_values_d = self.get_col_values_d
        self.wae_d=self.get_wae_d
        self.best_questions= {k: min(v,key=v.get) for k,v in self.wae_d().items()}

    def get_col_values_d(self):
        # Returns {feature: [value,value,...], ...}
        col_values_d = dict()
        for col in self.columns:
            col_values_d[col] = tennis[col].unique()

        return col_values_d

    def get_entropy(self,probs):
        # Returns entropy of given probabilities
        entropy = 0
        for prob in probs:
            entropy += -prob * np.log2(prob)

        return entropy

    def get_wae_d(self):
        # Returns {feature: {value: weighted average entropy}, ...}
        entropy_d = dict()
        
        # Iterate through values of each feature
        # Fill a dict of dicts with features, values, and entropy
        for k, v in self.col_values_d().items(): 
            value_dict = dict()
            for value in v:
                yes = self.df[self.df[k] == value]
                no= self.df[self.df[k] != value]
                yes_probs = yes[self.target].value_counts(normalize=True)
                no_probs = no[self.target].value_counts(normalize=True)
                yes_entropy = self.get_entropy(yes_probs)
                no_entropy= self.get_entropy(no_probs)
                no_weight= no.shape[0]/ self.df.shape[0]
                yes_weight= yes.shape[0]/ self.df.shape[0]
                weighted_average_entropy= yes_weight*yes_entropy+no_weight*no_entropy
                value_dict[value] = weighted_average_entropy
            entropy_d[k] = value_dict
        
        return entropy_d
    
    def display_wae(self,column):
        # Display weighted average entropy
        for k,v in self.wae_d().items():
            if k==column:
                print(k,'\n')
                for key,value in v.items():
                    print(key+':',round(value,5),'\n')
                    

    def display_best_questions(self):
        # Display values with smallest weighted average entropy
        for k,v in self.best_questions.items():
            print(k,':',v,'\n')
            


<IPython.core.display.Javascript object>

In [17]:
tennis_entropy = Entropy(tennis, "Decision")

<IPython.core.display.Javascript object>

Aggregate the tennis decision table for each value of each column. Start with Outlook below. Compute the weighted mean of the entropy for outlook (the weighted mean of the yes decision and the no decision).

In [18]:
tennis_entropy.display_wae("Outlook")

Outlook 

Sunny: 0.83804 

Overcast: 0.71429 

Rain: 0.9371 



<IPython.core.display.Javascript object>

Compute the weighted mean of the entropy for temperature, humidity and wind as well and decide based on these values which should be the first variable chosen for a split.

In [19]:
tennis_entropy.wae_d()

{'Outlook': {'Sunny': 0.8380423950607804,
  'Overcast': 0.7142857142857143,
  'Rain': 0.9371011056259821},
 'Temp.': {'Hot': 0.9152077851647805,
  'Mild': 0.9389462162661898,
  'Cool': 0.9253298887416583},
 'Humidity': {'High': 0.7884504573082896, 'Normal': 0.7884504573082896},
 'Wind': {'Weak': 0.8921589282623617, 'Strong': 0.8921589282623617}}

<IPython.core.display.Javascript object>

In [20]:
tennis_entropy.display_best_questions()

Outlook : Overcast 

Temp. : Hot 

Humidity : High 

Wind : Weak 



<IPython.core.display.Javascript object>