<a href="https://colab.research.google.com/github/kplr-training/Statistics-With-Python/blob/main/Exercices/2_Entropie_Gain_dInformation_et_Arbre_de_decision.ipynb" target="_blank"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **What is a Decision Tree?**
- A decision tree is a map of the possible outcomes of a series of related choices.
- It allows an individual or organization to weigh possible actions against one another based on their costs, probabilities, and benefits.

- As the name goes, it uses a tree-like model of decisions.
- They can be used either to drive informal discussion or to map out an algorithm that predicts the best choice mathematically.

- A decision tree typically starts with a single node, which branches into possible outcomes.
- Each of those outcomes leads to additional nodes, which branch off into other possibilities.
- This gives it a tree-like shape.

![image.png](https://user-images.githubusercontent.com/123752166/222189308-b7ea9713-b0f3-4e7f-b19c-0ed86ab5c698.png)

**Import data**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
play_data = pd.read_csv('https://gist.githubusercontent.com/bigsnarfdude/515849391ad37fe593997fe0db98afaa/raw/f663366d17b7d05de61a145bbce7b2b961b3b07f/weather.csv')

In [None]:
play_data

Unnamed: 0,outlook,temperature,humidity,windy,play
0,overcast,hot,high,False,yes
1,overcast,cool,normal,True,yes
2,overcast,mild,high,True,yes
3,overcast,hot,normal,False,yes
4,rainy,mild,high,False,yes
5,rainy,cool,normal,False,yes
6,rainy,cool,normal,True,no
7,rainy,mild,normal,False,yes
8,rainy,mild,high,True,no
9,sunny,hot,high,False,no


**A decision tree for above data**

![image.png](https://user-images.githubusercontent.com/123752166/222189681-ec77e040-8bf7-4365-af1e-86feb776fcd0.png)

## **2. Decision Tree Algorithm**



The algorithm can be summarized as :

1. At each stage (node), pick out the best feature as the test condition.

2. Now split the node into the possible outcomes (internal nodes).

3. Repeat the above steps till all the test conditions have been exhausted into leaf nodes.

- When you start to implement the algorithm, the first question is: ‘How to pick the starting test condition?’

- The answer to this question lies in the values of **‘Entropy’** and **‘Information Gain**’.
- Let us see what are they and how do they impact our decision tree creation.

- **Entropy**: Entropy in Decision Tree stands for homogeneity. If the data is completely homogenous, the entropy is 0, else if the data is divided (50-50%) entropy is 1.

- **Information Gain**: Information Gain is the decrease/increase in Entropy value when the node is split.

- An attribute should have the highest information gain to be selected for splitting.
- Based on the computed values of Entropy and Information Gain, we choose the best attribute at any particular step.
- **Entropy of play**
- Entropy(play) = – p(Yes) . log2p(Yes) – p(No) . log2p(No)

In [None]:
play_data.play.value_counts()

yes    9
no     5
Name: play, dtype: int64

### #TO DO 
- Try to calculate the Entropy of play 

In [None]:
#Fill_here

**Information Gain**
- The information gain is based on the decrease in entropy after a dataset is split on an attribute.
- Constructing a decision tree is all about finding attribute that returns the highest information gain (i.e., the most homogeneous branches).
- Gain(S, A) = Entropy(S) – ∑ [ p(S|A) . Entropy(S|A) ]
- We intend to choose the attribute, splitting by which information gain will be the most
- Next step is calculating information gain for all attributes

**Information Gain on splitting by Outlook**
- Gain(Play, Outlook) = Entropy(Play) – ∑ [ p(Play|Outlook) . Entropy(Play|Outlook) ]
- Gain(Play, Outlook) = Entropy(Play) – [ p(Play|Outlook=Sunny) . Entropy(Play|Outlook=Sunny) ] – [ p(Play|Outlook=Overcast) . Entropy(Play|Outlook=Overcast) ] – [ p(Play|Outlook=Rain) . Entropy(Play|Outlook=Rain) ]

In [None]:
play_data[play_data.outlook == 'sunny']

Unnamed: 0,outlook,temperature,humidity,windy,play
9,sunny,hot,high,False,no
10,sunny,hot,high,True,no
11,sunny,mild,high,False,no
12,sunny,cool,normal,False,yes
13,sunny,mild,normal,True,yes


### #TO DO 
- Try to calculate the Entropy of play if Outlook=Sunny 

In [None]:
# Entropy(Play|Outlook=Sunny)
#Fill_here

In [None]:
play_data[play_data.outlook == 'overcast']

Unnamed: 0,outlook,temperature,humidity,windy,play
0,overcast,hot,high,False,yes
1,overcast,cool,normal,True,yes
2,overcast,mild,high,True,yes
3,overcast,hot,normal,False,yes


In [None]:
# Entropy(Play|Outlook=overcast)
# Since, it's a homogenous data entropy will be 0

In [None]:
play_data[play_data.outlook == 'rainy']

Unnamed: 0,outlook,temperature,humidity,windy,play
4,rainy,mild,high,False,yes
5,rainy,cool,normal,False,yes
6,rainy,cool,normal,True,no
7,rainy,mild,normal,False,yes
8,rainy,mild,high,True,no


### #TO DO 
- Try to calculate the Entropy of play if Outlook=Rainy 

In [None]:
# Entropy(Play|Outlook=rainy)
#fill_here

**Gain on splitting by attribute outlook**

### #TO DO 
- Try to calculate the gain on splitting by attribute outlook : 
- Gain(Play, Outlook) = Entropy(Play) – [ p(Play|Outlook=Sunny) . Entropy(Play|Outlook=Sunny)]–[p(Play|Outlook=Overcast).Entropy(Play|Outlook=Overcast) ] – [ p(Play|Outlook=Rain) . Entropy(Play|Outlook=Rain) ] 

In [None]:
#Fill_here 

0.2467498197744391

**Other gains**
- Gain(Play, Temperature) - 0.029
- Gain(Play, Humidity) - 0.151
- Gain(Play, Wind) - 0.048

Conclusion - Outlook is winner & thus becomes root of the tree

![image.png](https://user-images.githubusercontent.com/123752166/222190048-e064451b-34e4-4392-a384-fc890e9449ef.png)

**Time to find the next splitting criteria**

In [None]:
play_data[play_data.outlook == 'overcast']

Unnamed: 0,outlook,temperature,humidity,windy,play
0,overcast,hot,high,False,yes
1,overcast,cool,normal,True,yes
2,overcast,mild,high,True,yes
3,overcast,hot,normal,False,yes


Conclusion - If outlook is overcast, play is true

**Let's find the next splitting feature**

In [None]:
play_data[play_data.outlook == 'sunny']

Unnamed: 0,outlook,temperature,humidity,windy,play
9,sunny,hot,high,False,no
10,sunny,hot,high,True,no
11,sunny,mild,high,False,no
12,sunny,cool,normal,False,yes
13,sunny,mild,normal,True,yes


### #TO DO 
- Try to calculate the entropy of play if outlook=sunny : 

In [None]:
# Entropy(Play_Sunny|)
#fill_here

**Information Gain for humidity**

### #TO DO 
- Try to calculate the information gain for humidity  : 


In [None]:
#fill_here

0.9709505944546686

**Information Gain for windy**
- False -> 3 -> [1+ 2-]
- True -> 2 -> [1+ 1-]

### #TO DO 
- Try to calculate the entropy of wind is false

In [None]:
#fill_here

### #TO DO 
- Try to calculate the information gain for windy  : 

In [None]:
#fill_here

0.01997309402197489

**Information Gain for temperature**
- hot -> 2 -> [2- 0+]
- mild -> 2 -> [1+ 1-]
- cool -> 1 -> [1+ 0-]

### #TO DO 
- Try to calculate the information gain for temperature  : 

In [None]:
#fill_here

0.5709505944546686

Conclusion : Humidity is the best choice on sunny branch
![image.png](https://user-images.githubusercontent.com/123752166/222190166-a2728fbb-38c2-4ab5-9c29-fb6a3ec5b305.png)

In [None]:
play_data[(play_data.outlook == 'sunny') & (play_data.humidity == 'high')]

Unnamed: 0,outlook,temperature,humidity,windy,play
9,sunny,hot,high,False,no
10,sunny,hot,high,True,no
11,sunny,mild,high,False,no


In [None]:
play_data[(play_data.outlook == 'sunny') & (play_data.humidity == 'normal')]

Unnamed: 0,outlook,temperature,humidity,windy,play
12,sunny,cool,normal,False,yes
13,sunny,mild,normal,True,yes


**Splitting the rainy branch**

In [None]:
play_data[play_data.outlook == 'rainy']

Unnamed: 0,outlook,temperature,humidity,windy,play
4,rainy,mild,high,False,yes
5,rainy,cool,normal,False,yes
6,rainy,cool,normal,True,no
7,rainy,mild,normal,False,yes
8,rainy,mild,high,True,no


### #TO DO 
- Try to calculate the entropy of play if outlook is rainy   : 

In [None]:
# Entropy (Play_Rainy|)
# fill_here

**Information Gain for temp**

- mild -> 3 [2+ 1-]
- cool -> 2 [1+ 1-]

### #TO DO 
- Try to calculate the information gain for temperature  : 

In [None]:
#fill_here

0.020150594454668602

**Information Gain for Windy**

### #TO DO 
- Try to calculate the information gain for windy  : 

In [None]:
#fill_here

0.9709505944546686

**Information Gain for Humidity**
- High -> 2 -> [1+ 1-]
- Normal -> 3 -> [2+ 1-]

### #TO DO 
- Try to calculate the entropy for humidity /rainy  : 

In [None]:
#Entropy_Play_Outlook_Rainy_Normal
#fill_here

### #TO DO 
- Try to calculate the information gain for humidity : 

In [None]:
#fill_here

0.01997309402197489

**Final Tree**

![image.png](https://user-images.githubusercontent.com/123752166/222190350-74511c3d-0ee8-46c3-b667-ccf475467654.png)