In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from IPython.display import Image

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Welcome to a full guide on Decision Trees 🌳🔎

# 1. Introduction

## Decision Trees are one of the most foundational and fundamental algorithms you would need to understand to further understand and effectively utilize some of the most effective algorithms out there such as those of [Random Forest](https://en.wikipedia.org/wiki/Random_forest) or [Boosting algorithms](https://en.wikipedia.org/wiki/Boosting_(machine_learning)).


#### Fortunately the idea behind Decision Trees is very intuitive and thus easy to grasp quickly. 
#### However, the actual calculation and math happening behind the scenes might be confusing to some.
#### So, I'm here to help you understand what actually happens behind this famous algorithm. 


#### Now, Decision Trees can be used for both linear and non-linear data but they shine the brightest when faced with non-linear data.
#### Additionally, they are also very easy to interpret, a bonus for users who are just beginning to use this algorithm for their work.


## Some of the fundamental information about this algorithm.
#### They are a supervised learning algorithm and can be used for both [regression](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html) and [classification](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) problems. 


#### In this notebook, I would be focusing mainly on Decision Trees for classification problems, which are the ones utilized the most. 

#### Now that I have presented with you a very brief tour of this algorithm, let us venture into the classification '*realm*' of Decision Trees!

<div style="width:100%;text-align: center;"> <img align=middle src="https://images.unsplash.com/photo-1474755032398-4b0ed3b2ae5c?ixlib=rb-1.2.1&ixid=MnwxMjA3fDB8MHxzZWFyY2h8MXx8Y29tZSUyMHdpdGglMjBtZXxlbnwwfHwwfHw%3D&auto=format&fit=crop&w=500&q=60" alt="Heat beating" style="height:600px;margin-top:1rem;"> </div>

# 2. Classification

#### Now as I've mentioned, Decision Trees is a tree-based model (obviously from the name itself 😑).
#### The data that we present it with would be split based on questions or certain metrics which the model deems to be the **best option**. 

#### Now here is where you would have this question in mind.

<p style="text-align:center; font-size:35px; font-weight:bold; font-family:cursive;"> Best option? How does it determine which is the best option? </p>

#### Well, that is definitely a valid question and one which you should be asking.

#### To understand how the model arrives at its best option, you would need to first understand some crucial concepts.

# 2.1 Entropy

#### Now, yes I know, to some of you physics geeks out there, entropy might have a different meaning to you, such as this one.

<p style="text-align:center; font-size:20px; font-weight:bold; font-family:cursive;"> "a thermodynamic quantity representing the unavailability of a system's thermal energy for conversion into mechanical work, often interpreted as the degree of disorder or randomness in the system" </p>

#### When I first studied this algorithm, this was what I had in mind too, so don't be ashamed!
#### But I would say the meaning it has with this algorithm does share some similarities, when it comes to 'degree of disorder or randomness'.

### With respect to Decision Trees, Entropy refers to the **measure of impurity in data**.

#### You might be curious as to what this means and what part it plays in this model. Stay with me and I'll enlighten you!
#### I said that Decision Trees split the data that they receive at every level based on certain metrics/features they seem to be the best fit. 
### The fact which you have to remember is that the **purpose of splitting up the data as such is to eventually arrive at an accurate conclusion**.


### So whenever we split the data the model receives, the ideal split would be **split the data with the same labels into the same group**.
#### The opposite of this is what is meant by *'impurity in data'*

#### Thus, if we have a **low impurity**, it would mean that most of the data in that group belongs to the same class/label, which is a very effective split. 
#### On the other hand, if we have a **high impurity**, it would mean that the data's classes are all over the place and mixed up, which we could say is a poor split.

#### Now that the concept is out of the way, here is Entropy's formula.

$$\LARGE -\sum_{i=1}^{N} p_ilog(p_i) $$
#### Where $ p_i $ refers to the probability of randomly selecting an example in class $ i $.
#### To better understand the formula, allow me to demonstrate a very simple example.

## Example for Entropy

#### Let us say we have 20 rows of data, out of which 14 are labelled '1' and the rest are labelled '2'.
#### How would we calculate the entropy for this set of data that we have here?

$$\LARGE Entropy = -\frac{14}{20}log(\frac{14}{20}) - \frac{6}{20}log(\frac{6}{20})  = 0.880$$

#### That wasn't that hard right?

#### Now another natural question that should pop up in your head is 

<p style="text-align:center; font-size:35px; font-weight:bold; font-family:cursive;"> How is Entropy useful for Decision Trees making a decision? </p>

#### This is where I introduce another new concept.

# 2.2 Information Gain

## To start off with its definition, it is the **decrease in the entropy after the dataset is split on basis of an attribute**.

#### In simpler terms, the algorithm strives to seek the **greatest information gain**, which refers to the **largest decrease in entropy**, and as we know by now, **the lower the entropy, the better it is**.

#### Of course, we can't just ignore the mathematical definition.
#### The mathematical formula for Information Gain is

$$\LARGE IG(T,A) = Entropy(T) - \sum_{v \in A} \frac{|T_v|}{T} \cdot Entropy(T_v)$$

<div style="width:100%;text-align: center;"> <img align=middle src="https://images.unsplash.com/photo-1561948955-570b270e7c36?ixlib=rb-1.2.1&ixid=MnwxMjA3fDB8MHxzZWFyY2h8N3x8c3VycHJpc2VkfGVufDB8fDB8fA%3D%3D&auto=format&fit=crop&w=500&q=60" alt="Heat beating" style="height:600px;margin-top:1rem;"> </div>

#### Don't worry, don't worry. 
#### Allow me to break that down for you.

#### IG refers to **I**nformation **G**ain. 
#### **T** refers to the feature column in question, the one we are deciding on whether to split or not.
#### You can understand **A** as the various classes/labels that are present and **v** as the data with those labels. 
#### To help you better understand, I'll present a simple example once more.

## Example for Information Gain

#### Let's assume I currently have a dataset with 100 rows, of which 50 belong to L1 and the other 50 to L2. 
#### From here, the algorithm has to decide on which feature to split the data on, and the feature that I'm going to show you is called Feature 1, which has only two options, Yes or No

#### Now, **43** out of the 100 rows are going to be split into 'Yes' and the remaining **57** will be split into 'No'.
#### Out of the **43** 'Yes', **35** belong to L1 and **8** belong to L2. 
#### Out of the **57** 'No', **15** belong to L1 and **42** belong to L2. 

<div style="width:100%;text-align: center;"> <img align=middle src='https://storage.googleapis.com/kagglesdsdata/datasets/2249576/3765708/photo_2022-06-08_22-37-38.jpg?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=databundle-worker-v2%40kaggle-161607.iam.gserviceaccount.com%2F20220608%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20220608T134351Z&X-Goog-Expires=345599&X-Goog-SignedHeaders=host&X-Goog-Signature=8524813f6a3b3c1b85c3d389d0144d53a0893e55918bb708617afefd9c9869309fa4e9fb3d9936386ff89b7af84ed68a90023aef1e7fac1c0d62bef68afe901f5d83f48f006cdff754793e23ba0ceed64163ebf0ee1442ddf52a71af542ac62488c13673d814f2d09f7b24ec8c5b1567aa3aa1a23ec43d7589618c290f700af28e66d23af8abc5817c0ec54fe800584afd52df8cad5cd056e9e85612fbb5eadc553ab9619b29fcfeea8051105a0a16da361dd901bf1bfec93e24e4a93efe4e3cc247d2580dd45ed6eee2c9b789ef4a8f872fe50a3bf01b038197e4d4bb0713c06b811f3217ee660f1e7c0b4fd08dacc584d974de6f414f2349371104b3490d81' alt="A visualization for the example above" style="height:600px;margin-top:1rem;"> </div>

#### Hope the poor quality drawing above makes the example clearer.

#### Now let us calculate $ Entropy(T) $.

$$\LARGE Entropy(T) = -\frac{1}{2} \cdot log(\frac{1}{2}) -\frac{1}{2} \cdot log(\frac{1}{2}) = 0.301$$

#### Alright, now let's calculate the rest.

$$\LARGE \sum_{v \in A} \frac{|T_v|}{T} \cdot Entropy(T_v) $$ $$\LARGE = \frac{43}{100} \cdot [-\frac{35}{43} \cdot log(\frac{35}{43}) -\frac{8}{45} \cdot log(\frac{8}{45})] + \frac{57}{100} \cdot [-\frac{15}{57} \cdot log(\frac{15}{57}) -\frac{42}{57} \cdot log(\frac{42}{57})] = 0.231$$

#### Now we can calculate the Information Gain which is 

$$\LARGE Entropy(T) - \sum_{v \in A} \frac{|T_v|}{T} \cdot Entropy(T_v) $$
$$\LARGE = 0.07 $$

#### There it is! How Information Gain actually works. 
#### Now by observing these calculations, I hope you got a better grasp at how the algorithm actually determines which features to split the data on, such that they experience the most Information Gain possible.

#### However, this is not all. 
#### There is one more concept that you need to know, alongside Information Gain, and that is called...

# 2.3 Gini Coefficient

#### Now this is a concept similar to Entropy, in the sense that it is used by the algorithm to evaluate impurity in data.

#### As for the coefficient, it could be understood that the lower this coefficient, the better it is. 

#### As always, do allow me to show you at least the mathematical formula for this.
#### Be assured, you will find it way easier to understand compared to the earlier ones. 

$$\LARGE Gini = 1 - \sum_{i=1}^{N} (p_i)^2 $$
#### Where $ p_i $ refers to the probability of randomly selecting an example in class $ i $.

#### Why this is significant is the fact that [Classification and Regression Tree (CART) algorithm](https://www.analyticssteps.com/blogs/classification-and-regression-tree-cart-algorithm#:~:text=In%20the%20decision%20tree%2C%20the,of%20the%20Gini%20Index%20criterion.) deploys the method of the Gini Index to originate binary splits.

#### One final question that might pop up is

<p style="text-align:center; font-size:35px; font-weight:bold; font-family:cursive;"> What is the difference between Gini Coefficient and Information Gain if you say they are similar? </p>

#### That is a very crucial question to ask and one which you deserve the answers to.

#### 1. The Gini Index facilitates the bigger distributions so easy to implement whereas the Information Gain favors lesser distributions having small count with multiple specific values.
#### 2. The method of the Gini Index is used by CART algorithms as mentioned previously while Information Gain is used in [ID3](https://en.wikipedia.org/wiki/ID3_algorithm), [C4.5 algorithms](https://en.wikipedia.org/wiki/C4.5_algorithm).

#### However, apart from these differences the link that brings all of these three concepts together can be summed up as such, which I think is very **important**!


<p style="text-align:center; font-size:35px; font-weight:bold; font-family:cursive;"> Gini Coefficient and Entropy is the criterion for calculating Information Gain. Decision Tree algorithms use Information Gain to split a node. </p>

#### This is how the algorithm knows which features to use to split the data and result in a highly accurate and effective model.

<div style="width:100%;text-align: center;"> <img align=middle src="https://images.unsplash.com/photo-1569974507005-6dc61f97fb5c?ixlib=rb-1.2.1&ixid=MnwxMjA3fDB8MHxzZWFyY2h8Mnx8YWhhfGVufDB8fDB8fA%3D%3D&auto=format&fit=crop&w=500&q=60" alt="Heat beating" style="height:600px;margin-top:1rem;"> </div>

# 3. Conclusion

#### I hope this statement above really sparks a true understanding of Decision Trees inside you. 
#### I would say if after all of this, you truly understand the meaning of that single statement, you have understood one of the most foundational, fundamental and effective machine learning algorithm.

#### With this, I would like to conclude the guide on Decision Trees and sincerely hope you walk away more enlightened about this significant algorithm.
#### I hope with this newly discovered knowledge, you would be able to better understand models built upon Decision Trees, such as Random Forest or Boosting algorithm. 

#### Please feel free to check out my other works, such as the one on [Linear Regression](https://www.kaggle.com/code/kimmik123/all-about-linear-regression) or [Support Vector Machines](https://www.kaggle.com/code/kimmik123/all-about-support-vector-machine). 
#### If you guys like my work, an upvote would go a long way! 
#### Till next time, cheers!

# 4. Credit

* https://thatascience.com/learn-machine-learning/gini-entropy/#:~:text=Gini%20index%20and%20entropy%20is,only%20one%20class%20is%20pure.
* https://medium.com/analytics-steps/understanding-the-gini-index-and-information-gain-in-decision-trees-ab4720518ba8
* https://www.analyticsvidhya.com/blog/2021/02/machine-learning-101-decision-tree-algorithm-for-classification/
