In [1]:
%reload_ext nb_black

<IPython.core.display.Javascript object>

## Day 29 Lecture 2 Assignment

In this assignment, we will learn about entropy and information gain in the ID3 algorithm.

In [2]:
import numpy as np
import pandas as pd

<IPython.core.display.Javascript object>

In [3]:
tennis = pd.read_csv(
    "https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/tennis_decision.csv"
)

<IPython.core.display.Javascript object>

In [4]:
tennis

Unnamed: 0,Day,Outlook,Temp.,Humidity,Wind,Decision
0,1,Sunny,Hot,High,Weak,No
1,2,Sunny,Hot,High,Strong,No
2,3,Overcast,Hot,High,Weak,Yes
3,4,Rain,Mild,High,Weak,Yes
4,5,Rain,Cool,Normal,Weak,Yes
5,6,Rain,Cool,Normal,Strong,No
6,7,Overcast,Cool,Normal,Strong,Yes
7,8,Sunny,Mild,High,Weak,No
8,9,Sunny,Cool,Normal,Weak,Yes
9,10,Rain,Mild,Normal,Weak,Yes


<IPython.core.display.Javascript object>

#### Write a function to compute entropy given an input of a sequence of probabilities.

In [5]:
def entropy(probs):
    e = 0
    for pi in probs:
        e += -pi * np.log2(pi)

    return e

<IPython.core.display.Javascript object>

In [6]:
# Originally our decision column has probs of...
probs = tennis["Decision"].value_counts(normalize=True)
probs

Yes    0.642857
No     0.357143
Name: Decision, dtype: float64

<IPython.core.display.Javascript object>

In [7]:
# Originally our decision column has entropy of...
entropy(probs)

0.9402859586706311

<IPython.core.display.Javascript object>

In [8]:
# Might be convenient later to have a function to calculate
# entropy given just a pandas column (pandas calls single columns: 'series')
def series_entropy(series):
    probs = series.value_counts(normalize=True)
    return entropy(probs)

<IPython.core.display.Javascript object>

In [9]:
# Confirm we get the same answer
series_entropy(tennis["Decision"])

0.9402859586706311

<IPython.core.display.Javascript object>

#### Aggregate the tennis decision table for each value of each column. Start with Outlook below. Compute the weighted mean of the entropy for outlook (the weighted mean of the yes decision and the no decision).

If we used the question "Is the outlook 'Sunny'?", we would have the splits of:

In [10]:
yes_sunny = tennis[tennis["Outlook"] == "Sunny"]
no_sunny = tennis[tennis["Outlook"] != "Sunny"]

<IPython.core.display.Javascript object>

The entropy for our decision when it is sunny is:

In [11]:
series_entropy(yes_sunny["Decision"])

0.9709505944546686

<IPython.core.display.Javascript object>

The entropy for our decision when it isn't sunny is:

In [12]:
series_entropy(no_sunny["Decision"])

0.7642045065086203

<IPython.core.display.Javascript object>

The weighted average of these is:

In [13]:
yes_sunny_percent = yes_sunny.shape[0] / tennis.shape[0]
no_sunny_percent = no_sunny.shape[0] / tennis.shape[0]

(
    series_entropy(yes_sunny["Decision"]) * yes_sunny_percent
    + series_entropy(no_sunny["Decision"]) * no_sunny_percent
)

0.8380423950607804

<IPython.core.display.Javascript object>

Could be convenient to make that a function to...

Below I'm just trying to make the process generic and confirming it comes to the same answer

In [14]:
df = tennis
target_column = "Decision"
x_column = "Outlook"
value = "Sunny"

yes_value = df[df[x_column] == value]
no_value = df[df[x_column] != value]

yes_entropy = series_entropy(yes_value[target_column])
no_entropy = series_entropy(no_value[target_column])

yes_percent = yes_value.shape[0] / df.shape[0]
no_percent = no_value.shape[0] / df.shape[0]


yes_entropy * yes_percent + no_entropy * no_percent

0.8380423950607804

<IPython.core.display.Javascript object>

Rewriting the above process as a function

In [15]:
def question_entropy(df, target_column, x_column, value):
    yes_value = df[df[x_column] == value]
    no_value = df[df[x_column] != value]

    yes_entropy = series_entropy(yes_value[target_column])
    no_entropy = series_entropy(no_value[target_column])

    yes_percent = yes_value.shape[0] / df.shape[0]
    no_percent = no_value.shape[0] / df.shape[0]

    return yes_entropy * yes_percent + no_entropy * no_percent

<IPython.core.display.Javascript object>

Confirm the function gets the same answer (aka did we copy/paste stuff wrong)


In [16]:
question_entropy(df=tennis, target_column="Decision", x_column="Outlook", value="Sunny")

0.8380423950607804

<IPython.core.display.Javascript object>

Now we can compare what the best question is to ask of the outlook column.

According to the below numbers, "Is it overcast?" is the best question to ask of the "Outlook" column in order to separate out the "Decision" column.

In [17]:
question_entropy(df=tennis, target_column="Decision", x_column="Outlook", value="Sunny")

0.8380423950607804

<IPython.core.display.Javascript object>

In [18]:
question_entropy(
    df=tennis, target_column="Decision", x_column="Outlook", value="Overcast"
)

0.7142857142857143

<IPython.core.display.Javascript object>

In [19]:
question_entropy(df=tennis, target_column="Decision", x_column="Outlook", value="Rain")

0.9371011056259821

<IPython.core.display.Javascript object>

We could go crazy and have a function to look into what values the column could take

In [20]:
# Could go crazy and have a function do more for us
def series_question_entropies(df, target_column, x_column):
    entropies = {}
    for value in df[x_column].unique():
        question = f"Is {x_column}=={value}?"
        entropies[question] = question_entropy(df, target_column, x_column, value)

    return entropies

<IPython.core.display.Javascript object>

This function gets the same results and we confirm that the best question to ask is: `'Is Outlook==Overcast?'`

In [21]:
series_question_entropies(tennis, "Decision", "Outlook")

{'Is Outlook==Sunny?': 0.8380423950607804,
 'Is Outlook==Overcast?': 0.7142857142857143,
 'Is Outlook==Rain?': 0.9371011056259821}

<IPython.core.display.Javascript object>

#### Compute the weighted mean of the entropy for temperature, humidity and wind as well and decide based on these values which should be the first variable chosen for a split.

Looks like the best question to ask is still: `'Is Outlook==Overcast?'`

In [22]:
series_question_entropies(tennis, "Decision", "Outlook")

{'Is Outlook==Sunny?': 0.8380423950607804,
 'Is Outlook==Overcast?': 0.7142857142857143,
 'Is Outlook==Rain?': 0.9371011056259821}

<IPython.core.display.Javascript object>

In [23]:
series_question_entropies(tennis, "Decision", "Temp.")

{'Is Temp.==Hot?': 0.9152077851647805,
 'Is Temp.==Mild?': 0.9389462162661898,
 'Is Temp.==Cool?': 0.9253298887416583}

<IPython.core.display.Javascript object>

In [24]:
series_question_entropies(tennis, "Decision", "Humidity")

{'Is Humidity==High?': 0.7884504573082896,
 'Is Humidity==Normal?': 0.7884504573082896}

<IPython.core.display.Javascript object>

In [25]:
series_question_entropies(tennis, "Decision", "Wind")

{'Is Wind==Weak?': 0.8921589282623617, 'Is Wind==Strong?': 0.8921589282623617}

<IPython.core.display.Javascript object>

We could go even crazier and make a function to do that process of checking each column.  While we're at it, I added in sorting so we don't need to eyeball the numbers.

In [26]:
def question_entropies(df, target_column, x_columns):
    questions = []
    entropies = []
    for x_column in x_columns:
        ent_dict = series_question_entropies(tennis, "Decision", x_column)
        questions.extend(ent_dict.keys())
        entropies.extend(ent_dict.values())

    ent_df = pd.DataFrame({"question": questions, "entropy": entropies})
    return ent_df.sort_values("entropy").reset_index(drop=True)

<IPython.core.display.Javascript object>

Confirm we get the same answers:

In [27]:
question_entropies(
    df=tennis,
    target_column="Decision",
    x_columns=["Outlook", "Temp.", "Wind", "Humidity"],
)

Unnamed: 0,question,entropy
0,Is Outlook==Overcast?,0.714286
1,Is Humidity==High?,0.78845
2,Is Humidity==Normal?,0.78845
3,Is Outlook==Sunny?,0.838042
4,Is Wind==Weak?,0.892159
5,Is Wind==Strong?,0.892159
6,Is Temp.==Hot?,0.915208
7,Is Temp.==Cool?,0.92533
8,Is Outlook==Rain?,0.937101
9,Is Temp.==Mild?,0.938946


<IPython.core.display.Javascript object>