### Information Gain
- In order to pick which, feature to split on, we need a way of measuring how good the split is. This is where information gain and entropy come in.

H(x) Shannon-entropy of a discrete random variable = $-\sum\limits_{i = 1}^{n} P(X_{i})log_{2}P(X_{i})$ \
$P_{i} = $ probability of occurence of value i 
- High entropy → All the classes are nearly equally likely
- Low entropy → A few classes are likely; most of the classes are rarely observed
- Assume 0 $log_{2}$ = 0
- For completely homogoeneous dataset (all True or all False): entrpy is 0
- If dataset is equally divided (same amount of True and all False): entrpy is 1


### ID3 (Iterative Dichotomize)
- ID3 algorithm is used to build the decision tree
- It utilizes entropy and information gain to build the tree
- Uses Information Theory (Entropy) to split on an attribute that gives the highest information gain
- ~ It is a top-down greedy search of possible branches

In [2]:
import pandas as pd
import numpy as np

In [3]:
df = pd.read_csv('golf.csv')

In [4]:
df

Unnamed: 0,Day,Outlook,Temperature,Humidity,Wind,Play Golf
0,D1,Sunny,Hot,High,Weak,No
1,D2,Sunny,Hot,High,Strong,No
2,D3,Overcast,Hot,High,Weak,Yes
3,D4,Rain,Mild,High,Weak,Yes
4,D5,Rain,Cool,Normal,Weak,Yes
5,D6,Rain,Cool,Normal,Strong,No
6,D7,Overcast,Cool,Normal,Strong,Yes
7,D8,Sunny,Mild,High,Weak,No
8,D9,Sunny,Cool,Normal,Weak,Yes
9,D10,Rain,Mild,Normal,Weak,Yes


In [5]:
df.columns

Index(['Day', 'Outlook', 'Temperature', 'Humidity', 'Wind', 'Play Golf'], dtype='object')

#### Step 1 → Using Shannon Entropy formula to determine H(PlayingGolf)

H(x) Shannon-entropy of a discrete random variable = $-\sum\limits_{i = 1}^{n} P(X_{i})log_{2}P(X_{i})$ → X is playing golf

In [6]:
df['Play Golf'].value_counts()

Yes    9
No     5
Name: Play Golf, dtype: int64

In [7]:
H_PlayGolf = -9/14 * np.log2(9/14) - 5/14 * np.log2(5/14)
H_PlayGolf

0.9402859586706311

In [8]:
df.groupby('Outlook')['Play Golf'].value_counts()

Outlook   Play Golf
Overcast  Yes          4
Rain      Yes          3
          No           2
Sunny     No           3
          Yes          2
Name: Play Golf, dtype: int64

In [9]:
H_Overcast = -4/4 * np.log2(4/4)
H_Overcast

-0.0

In [10]:
H_Rain = -3/5 * np.log2(3/5) - 2/5 * np.log2(2/5)
H_Rain

0.9709505944546686

In [11]:
H_Sunny = -3/5 * np.log2(3/5) - 2/5 * np.log2(2/5)
H_Sunny

0.9709505944546686

$Information Gain (PlayingGolf, Outlook) = H(PlayingGolf) - \sum\limits_{\in (Sunny, Overcast, Rain) } \frac{|S_{v}|}{|S|} Entropy(S_{v})$ 

$|S| → 14 $  

$S_{v} = Sunny, Overcast, Rain$

#### Gain_Outlook = H_PlayGolf - Overcast/|S| * H_Overcast - Rain/|S| * H_Rain - Sunny/|S| * H_Sunny

In [12]:
Gain_Outlook = H_PlayGolf - 4/14 * H_Overcast - 5/14 * H_Rain - 5/14 * H_Sunny
Gain_Outlook

0.24674981977443933

In [21]:
# Temperature
df.groupby('Temperature')['Play Golf'].value_counts()

Temperature  Play Golf
Cool         Yes          3
             No           1
Hot          No           2
             Yes          2
Mild         Yes          4
             No           2
Name: Play Golf, dtype: int64

In [22]:
H_Hot = -2/4 * np.log2(2/4) - 2/4 * np.log2(2/4)
print(H_Hot)

H_Mild = -4/6 * np.log2(4/6) - 2/6 * np.log2(2/6)
print(H_Mild)

H_Cold = -3/4 * np.log2(3/4) - 1/4 * np.log2(1/4)
print(H_Cold)

Gain_Temperature = H_PlayGolf - 4/14 * H_Hot - 6/14 * H_Mild - 4/14 * H_Cold
Gain_Temperature

1.0
0.9182958340544896
0.8112781244591328


0.02922256565895487

In [23]:
# Humidity
df.groupby('Humidity')['Play Golf'].value_counts()

Humidity  Play Golf
High      No           4
          Yes          3
Normal    Yes          6
          No           1
Name: Play Golf, dtype: int64

In [24]:
H_High = -3/7 * np.log2(3/7) - 4/7 * np.log2(4/7)
print(H_High)

H_Normal = -6/7 * np.log2(6/7) - 1/7 * np.log2(1/7)
print(H_Normal)

Gain_Humidity = H_PlayGolf - 7/14 * H_High - 7/14 * H_Normal
Gain_Humidity

0.9852281360342515
0.5916727785823275


0.15183550136234164

In [25]:
# Wind
df.groupby('Wind')['Play Golf'].value_counts()

Wind    Play Golf
Strong  No           3
        Yes          3
Weak    Yes          6
        No           2
Name: Play Golf, dtype: int64

In [26]:
# Wind
H_Strong = -3/6 * np.log2(3/6) - 3/6 * np.log2(3/6)
print(H_Strong)

H_Weak = -6/8 * np.log2(6/8) - 2/8 * np.log2(2/8)
print(H_Strong)

Gain_Wind = H_PlayGolf - 6/14 * H_Strong - 8/14 * H_Weak
Gain_Wind

1.0
1.0


0.04812703040826949

## Similarly, calculate the Gain_Temperature, Gain_Humidity, Gain_Wind

#### Gain_Outlook = 0.246
#### Gain_Temperature = 0.0289
#### Gain_Humidity = 0.1516
#### Gain_Wind = 0.0478

<img align="left" src="dt1.png"     style=" width:400px; padding: 10px; " >

#### Next determine the children of Sunny and Rain

In [12]:
H_Sunny

0.9709505944546686

In [13]:
df[df['Outlook'] == 'Sunny']

Unnamed: 0,Day,Outlook,Temperature,Humidity,Wind,Play Golf
0,D1,Sunny,Hot,High,Weak,No
1,D2,Sunny,Hot,High,Strong,No
7,D8,Sunny,Mild,High,Weak,No
8,D9,Sunny,Cool,Normal,Weak,Yes
10,D11,Sunny,Mild,Normal,Strong,Yes


In [14]:
df[df['Outlook'] == 'Sunny'].groupby('Temperature')['Play Golf'].value_counts()

Temperature  Play Golf
Cool         Yes          1
Hot          No           2
Mild         No           1
             Yes          1
Name: Play Golf, dtype: int64

In [15]:
H_Cool = -1/1 * np.log2(1/1)
H_Cool

-0.0

In [16]:
H_Hot = -2/2 * np.log2(2/2)
H_Hot

-0.0

In [17]:
H_Mild = -1/2 * np.log2(1/2) - 1/2 * np.log2(1/2)
H_Mild

1.0

In [18]:
Gain_Sunny_Temp = H_Sunny - 1/5 * H_Cool - 2/5 * H_Hot - 2/5 * H_Mild
Gain_Sunny_Temp

0.5709505944546686

In [19]:
df[df['Outlook'] == 'Sunny'].groupby('Humidity')['Play Golf'].value_counts()

Humidity  Play Golf
High      No           3
Normal    Yes          2
Name: Play Golf, dtype: int64

- For completely homogoeneous dataset (all True or all False): entrpy is 0
- If dataset is equally divided (same amount of True and all False): entrpy is 1

In [20]:
H_High = 0
H_Normal = 0

In [21]:
Gain_Sunny_Humid = H_Sunny - 3/5 * H_High - 2/5 * H_Normal
Gain_Sunny_Humid

0.9709505944546686

In [22]:
df[df['Outlook'] == 'Sunny'].groupby('Wind')['Play Golf'].value_counts()

Wind    Play Golf
Strong  No           1
        Yes          1
Weak    No           2
        Yes          1
Name: Play Golf, dtype: int64

In [23]:
H_Strong = 1
H_Weak = -2/3 * np.log2(2/3) - 1/3 * np.log2(1/3) 
H_Weak

0.9182958340544896

In [24]:
Gain_Sunny_Wind = H_Sunny - 2/5 * H_Strong - 3/5 * H_Weak
Gain_Sunny_Wind

0.01997309402197489

#### Gain_Sunny_Temp = 0.57
#### Gain_Sunny_Humid = 0.97
#### Gain_Sunny_Wind = 0.019

<img align="left" src="dt2.png"     style=" width:400px; padding: 10px; " >

In [25]:
df[df['Outlook'] == 'Rain']

Unnamed: 0,Day,Outlook,Temperature,Humidity,Wind,Play Golf
3,D4,Rain,Mild,High,Weak,Yes
4,D5,Rain,Cool,Normal,Weak,Yes
5,D6,Rain,Cool,Normal,Strong,No
9,D10,Rain,Mild,Normal,Weak,Yes
13,D14,Rain,Mild,High,Strong,No


In [26]:
df[df['Outlook'] == 'Rain'].groupby('Temperature')['Play Golf'].value_counts()

Temperature  Play Golf
Cool         No           1
             Yes          1
Mild         Yes          2
             No           1
Name: Play Golf, dtype: int64

- For completely homogoeneous dataset (all True or all False): entrpy is 0
- If dataset is equally divided (same amount of True and all False): entrpy is 1

In [27]:
H_Cool = 1

In [28]:
H_Mild = -2/3 * np.log2(2/3) - 1/3 * np.log2(1/3)
H_Mild

0.9182958340544896

In [29]:
Gain_Rain_Temp = H_Rain - 2/5 * H_Cool - 3/5 * H_Mild
Gain_Rain_Temp

0.01997309402197489

In [30]:
df[df['Outlook'] == 'Rain'].groupby('Humidity')['Play Golf'].value_counts()

Humidity  Play Golf
High      No           1
          Yes          1
Normal    Yes          2
          No           1
Name: Play Golf, dtype: int64

In [31]:
H_High = 1
H_Normal = -2/3 * np.log2(2/3) - 1/3 * np.log2(1/3)
H_Normal

0.9182958340544896

In [32]:
Gain_Rain_Humid = H_Rain - 2/5 * H_High - 3/5 * H_Normal
Gain_Rain_Humid

0.01997309402197489

In [33]:
df[df['Outlook'] == 'Rain'].groupby('Wind')['Play Golf'].value_counts()

Wind    Play Golf
Strong  No           2
Weak    Yes          3
Name: Play Golf, dtype: int64

- For completely homogoeneous dataset (all True or all False): entrpy is 0
- If dataset is equally divided (same amount of True and all False): entrpy is 1

In [34]:
H_Strong = 0
H_Weak = 0

In [35]:
Gain_Rain_Wind = H_Rain - 1/5 * H_Strong - 4/5 * H_Weak
Gain_Rain_Wind

0.9709505944546686

#### Gain_Rain_Temp = 0.019
#### Gain_Rain_Humid = 0.019
#### Gain_Rain_Wind = 0.97

<img align="left" src="dt3.png"     style=" width:400px; padding: 10px; " >