# Introduction

In the previous session, you learnt about the underlying concepts of decision trees and their interpretation. You also learnt how to construct a decision tree. You learnt about the advantages of decision trees and also understood how they help in solving the regression problems that cannot be handled by linear regression. In this session, you will learn about the methods for decision tree construction.

 

**In this session:**
- Splitting and homogeneity
- Impurity measures
- Best split
- Regression trees

# Splitting and homogeneity


Recall the scenario from session one where you were a doctor asking a series of questions to determine whether a person has heart disease or not. Based on your past experience with other patients, the answers to these questions would finally lead to a prediction depicting whether the person has heart disease or not. Now, if you're a doctor, you would know which questions to ask first, right? When a person might be at the risk of heart disease, the doctor would probably first ask/measure the cholesterol of the person. If it's higher than 300, the doctor can be pretty sure that the person is at risk. Then he may ask some more questions to confirm his predictions even further.

 

Now suppose instead of asking/measuring the cholesterol, the doctor first asks the age of the person. If the person says, say, 55, the doctor may not be completely sure whether the person has heart disease or not. He/She would need to ask further questions.

 

From both these scenarios, it is easy to infer that to predict the target variable (in this case, heart disease), there are obviously some questions/attributes that are more important for its prediction than others. In our case, **you saw that the cholesterol level is a more significant attribute than age in the prediction of heart disease**. Why is that? It's so because, for the doctor, it might be evident from past records that people will higher cholesterol levels have a high chance of having heart disease. Say, in his/her experience that doctor has seen that out of every 100 patients who have cholesterol higher than 300, 90 had a heart disease while 10 did not. Also, say, out of 100 patients that the doctor has consulted over the age of 54, fifty of them had a heart disease. Now, it is evident that 'cholesterol' would be better to determine heart disease than 'age' since 90% of the patients having cholesterol greater than 300 were diagnosed with heart disease.

 

You can clearly see that one of the classes (disease) is significantly dominant over the other (non-disease) after splitting on the basis of cholesterol > 300 and this dominance is helping the doctor make a more confident prediction unlike 'age' which gives us a 50-50 split of both classes leaving us in a dilemma again.

 

So we basically arrive at these questions; **Given many attributes, how do you decide which rules obtained from the attributes to choose in order to split the data set?** From a single feature, you can get many rules and you may use any of these to make the split. **Do you randomly select these and split the data set or should there be a selection criterion for choosing one over the other?** What are you trying to achieve with the split?

 

All these questions will be covered in the following session and you will learn about the various algorithms and criteria that are involved in constructing a decision tree.

![9.png](attachment:a3ccaebf-a24c-464a-a79e-2d8e0044bdad.png)


If a partition contains data points with identical labels (for example, label 1), then you can classify the entire partition as that particular label (label 1). However, this is an oversimplified example. In real-world data sets, you will almost **never have completely homogenous data sets (or even nodes) after splitting the data**. Hence, it is important that you **try to split the nodes such that the resulting nodes are as homogenous as possible**. One important thing to remember is that homogeneity here is always referred to response **(target) variable's homogeneity**.

**For example**, let's suppose we consider the same heart disease example in which you wanted to classify whether a person has a heart disease or not. If one of the **nodes is labelled ‘Blood Pressure’, try to split it with a rule such that all the data points that pass the rule have one label and those that do not pass the rule have a different label**. Thus, you need to ensure that the response variable's homogeneity in the resultant splits is as high as possible.

A split that results in a homogenous subset is much more desirable than the one that results in a 50-50 distribution (in the case of two labels). In a completely homogenous set, all the data points belong to one particular label. Hence, you must try to generate partitions that result in such sets.

**For classification purposes, a data set is completely homogeneous if it contains only a single class label. For regression purposes, a data set is completely homogeneous if its variance is as small as possible**. You will understand regression trees better in the upcoming segments.

Let’s take a look at the illustration given below to further understand homogeneity.

Consider a data set ‘D’ with homogeneity ‘H’ and a defined threshold value. **When homogeneity exceeds the threshold value, you need to stop splitting the node and assign the prediction to it**. As this node does not need further splitting, it becomes the leaf node.

Suppose that you keep the threshold value as 70%. The homogeneity of the node in the illustration given below is clearly above 70% (~86%). **The homogeneity value, in this case, falls above the threshold limit. Hence no splitting is required and this node becomes a leaf node.**

![10.png](attachment:2030ef19-abde-4d04-8f09-2fe62c6ffab8.png)

**But what do you do when the homogeneity ‘H’ is less than the threshold value?**

Now keeping the same threshold value as 70%, you can see from the below illustration that homogeneity of the first node is exactly 50%. The node contains an equal number of data points from both the class labels. Hence, you cannot give a prediction to this node as it does not meet the passing criterion which has been set. There is still some ambiguity existing in this node as there is no clarity on the label that can be assigned as the final prediction. This brings in the need for further splitting and we compare the values of homogeneity and threshold again to arrive at a decision. This is continued until the homogeneity value exceeds the threshold value.

![11.png](attachment:28878045-9325-44bb-833d-981822ac5e12.png)


**Till the homogeneity ‘H’ is less than the threshold, you need to continue splitting the node.** The process of splitting needs to be continued until homogeneity exceeds the threshold value and the majority data points in the node are of the same class.

This is an abstract example to give you an intuition on homogeneity and splitting. You will get better clarity in the next segment when you learn how to quantify and measure homogeneity to arrive at a prediction through continuous splitting. You will learn how to use specific methods to **measure homogeneity, namely the Gini index, entropy, classification error (for classification), and MSE (for regression).**

**Now, how do you handle best split for different attributes in a decision tree using CART algorithm.**

A tree can be split based on different rules of an attribute and these attributes can be categorical or continuous in nature. If an attribute is nominal categorical, then there are ![](https://latex.upgrad.com/render?formula=2%5E%7Bk-1%7D-1) possible splits for this attribute, where ![](https://latex.upgrad.com/render?formula=k) is the number of classes. In this case, each possible subset of categories is examined to determine the best split.

If an attribute is ordinal categorical or continuous in nature with n different values, there are n - 1 different possible splits for it. **Each value of the attribute is sorted from the smallest to the largest and candidate splits based on the individual values is examined to determine the best split point which maximizes the homogeneity at a node**.

There are various **other techniques like calculating percentiles and midpoints of the sorted values** for handling continuous features in different algorithms and this process is known as discretization. Although the exact technicalities are out of the scope of this module, curious students can read more about this in detail from the additional resources given below.

Now let's answer some questions based on your learnings so far.

## Nominal categorical variable


A **nominal categorical variable** is a variable whose values represent **categories with no natural order or ranking**.
The categories are simply **names or labels**, and one is not “greater” or “less” than another.

### Examples

| Variable           | Example categories        |
| ------------------ | ------------------------- |
| **Gender**         | Male, Female, Other       |
| **Marital status** | Single, Married, Divorced |
| **Blood type**     | A, B, AB, O               |
| **Type of car**    | Sedan, SUV, Hatchback     |

### Key characteristics

* **Purely qualitative**: The numbers or names assigned to the categories are just identifiers.
* **No inherent ordering**: You cannot say “SUV > Sedan” in a mathematical sense.
* **Analysis implication**: In decision trees, splits for nominal variables are based on **groupings of categories**, not on any numeric threshold.

---

**Contrast with Ordinal categorical variables:**
Ordinal categories **do have an order**, such as “Low, Medium, High” or “Freshman, Sophomore, Junior, Senior”, where ranking matters.


## Ordinal categorical variable

An **ordinal categorical variable** is a categorical variable whose categories have a **meaningful order or ranking**, but the **differences between categories are not necessarily measurable** or evenly spaced.

---

### Key properties

* **Categories can be ranked**: You can say one category represents “more” or “less” of something than another.
* **Gaps between ranks are not numerical**: The step from “Low” to “Medium” is not guaranteed to be the same as from “Medium” to “High”.

---

### Examples

| Variable              | Ordered categories                                                      |
| --------------------- | ----------------------------------------------------------------------- |
| Education level       | High school < Bachelor’s < Master’s < PhD                               |
| Customer satisfaction | Very dissatisfied < Dissatisfied < Neutral < Satisfied < Very satisfied |
| Disease severity      | Mild < Moderate < Severe                                                |
| Socioeconomic status  | Low < Middle < High                                                     |

---

### Why it matters in modeling

* In decision trees, splits on an **ordinal variable** can respect the order: e.g. “Education ≤ Bachelor’s” vs “> Bachelor’s”.
* But unlike continuous variables, you **cannot assume numeric distances** between the categories.

---

**In short:**
Ordinal categorical = **ordered labels**, where order matters but exact numeric spacing does not.


## Univariate

**Univariate** literally means **“one variable.”**

In statistics or machine-learning contexts it usually refers to:

* **Univariate analysis** – looking at or summarizing data for **a single variable at a time**
  *Example:* computing the mean and standard deviation of “height” alone.

* **Univariate split (in decision trees)** – at each node the decision is based on **just one predictor (feature)** at a time.
  *Example:* the node might check only `age ≤ 40` to decide how to branch, rather than a formula combining age and income.

---

✅ **Key idea:** Only one variable is involved in the analysis or in making the decision at that step—**not a combination of multiple variables**.


## Impurity Measures 

Now, you have narrowed down the decision tree construction problem to this: you want to split the data set such that the homogeneity of the resultant partitions is maximum. But how do you measure this homogeneity?

 

Various methods, such as the **classification error, Gini index and entropy**, can be used to **quantify homogeneity**. You will learn about each of these methods in the upcoming video.

 ![12.png](attachment:8b412146-b095-4bb3-8f9b-2cbadc2f4df8.png)

 
The classification error is calculated as follows:

- ![](https://latex.upgrad.com/render?formula=E%20%3D%201-max%5E%7B%7D%28%20p_%7Bi%7D%29)

The Gini index is calculated as follows:

- ![](https://latex.upgrad.com/render?formula=G%20%3D%20%5Csum_%7Bi%3D1%7D%5E%7Bk%7Dp_%7Bi%7D%20%281-%20p_%7Bi%7D%29)

Entropy is calculated as follows:

- ![](https://latex.upgrad.com/render?formula=D%20%3D%20-%20%5Csum_%7Bi%3D1%7D%5E%7Bk%7Dp_%7Bi%7D.log_%7B2%7D%28p_%7Bi%7D%29)

where ![](https://latex.upgrad.com/render?formula=p_%7Bi%7D) is the probability of finding a point with the label Equation, and Equation is the number of classes.

Let's now tweak the above example and try to understand how these impurity measures change with the class distribution.

![13.png](attachment:ca820514-b737-4672-8b46-9a547df0ebb8.png)

From the above example, you understood how to calculate different impurity measures and how do they change with different class distributions.

| Impurity Measures         | Case I         | Case II         | Case III        |
|--------------------------|---------------|-----------------|-----------------|
| Class distribution       | 0: 20<br>1: 80 | 0: 50<br>1: 50 | 0: 80<br>1: 20  |
| Classification Error     | 0.2           | 0.5             | 0.2             |
| Gini Impurity            | 0.32          | 0.5             | 0.32            |
| Entropy                  | 0.72          | 1               | 0.72            |

You can see that for a completely** non-homogeneous data** with equal class distribution, the value of **Classification Error** and **Gini Impurity** are the **same** i.e. 0.5 and that of **Entropy is 1**.

The **scaled version of the entropy** in the illustration shown in the video is nothing but **entropy/2**. It has been used to emphasize that the Gini index is an intermediate measure between entropy and the classification error.

In practice, classification error does not perform well. So, we generally prefer using either the Gini index or entropy over it.

### Gini Index

**Gini index** is the degree of a randomly chosen datapoint being classified incorrectly. The formula for Gini index can also be written as follows:

![](https://latex.upgrad.com/render?formula=G%20%3D%20%5Csum_%7Bi%3D1%7D%5E%7Bk%7Dp_%7Bi%7D%20%281-%20p_%7Bi%7D%29%3D%20%5Csum_%7Bi%3D1%7D%5E%7Bk%7D%28p_%7Bi%7D%20-%20p_%7Bi%7D%5E%7B2%7D%29%3D%20%5Csum_%7Bi%3D1%7D%5E%7Bk%7Dp_%7Bi%7D-%5Csum_%7Bi%3D1%7D%5E%7Bk%7Dp_%7Bi%7D%5E%7B%5E%7B2%7D%7D%3D1-%5Csum_%7Bi%3D1%7D%5E%7Bk%7Dp_%7Bi%7D%5E%7B%5E%7B2%7D%7D)

where ![](https://latex.upgrad.com/render?formula=p_%7Bi%7D) is the probability of finding a point with the label ![](https://latex.upgrad.com/render?formula=i), and  ![](https://latex.upgrad.com/render?formula=k) is the number of classes.

**(Think why was ![](https://latex.upgrad.com/render?formula=%5Csum_%7Bi%3D1%7D%5E%7Bk%7Dp_%7Bi%7D) equal to 1?)

**Gini index of 0 indicates that all the data points belong to a single class**. **Gini index of 0.5 indicates** that the data points are **equally distributed** among the different classes.

Suppose you have a data set with two class labels. If the data set is completely homogeneous, i.e., all the data points belong to label 1, then the probability of finding a data point corresponding to label 2 will be 0 and that of label 1 will be 1. So, ![](https://latex.upgrad.com/render?formula=p_%7B1%7D) = 1 and ![](https://latex.upgrad.com/render?formula=p_%7B2%7D) = 0. The Gini index, which is equal to 0, will be the lowest in such a case. Hence, the **higher the homogeneity, the lower the Gini index**.

### Entropy

Entropy quantifies the degree of disorder in the given data, its value varies from 0 to 1. Entropy and the Gini index are similar numerically. If a data set is completely homogenous, then the entropy of such a data set will be 0, i.e., there is no disorder in the data. If a data set contains an equal distribution of both the classes, then the entropy of that data set will be 1, i.e., there is complete disorder in the data. Hence, like the Gini index, the **higher the homogeneity, the lower the entropy.**

Now that you have understood the different methods to quantify the purity/impurity of a node, how do you identify the attribute that results in the best split? Let’s learn more about this in the upcoming video.


# Algorithms For Desision Tree

- **CART (Classification and Regression Trees Algorithm)** - most popular and default for sikitlearn
- C4.5
- C5.0
- ID3
- CHAID


## CART Algorithm
A  CART  tree  is  a  binary  decision  tree  that  is  constructed  by  splitting  a  node  into  two  child  nodes  repeatedly,  beginning with the root node that contains the whole learning sample.  

![14.png](attachment:1ccfd629-6811-4448-91b1-7c7ecc71ad00.png)


The **change in impurity** or the **purity gain** is given by the difference of impurity post-split from impurity pre-split, i.e.,


**Δ Impurity = Impurity (pre-split) – Impurity (post-split)**

The **post-split impurity** is calculated by finding the **weighted average of two child nodes**. The split that results in **maximum gain** is chosen as the **best split**.

To summarise, the information gain is calculated by:

![](https://latex.upgrad.com/render?formula=Gain%20%3D%20D%20-%20D_%7BA%7D)

where ![](https://latex.upgrad.com/render?formula=D) is the entropy of the parent set (data before splitting),![](https://latex.upgrad.com/render?formula=D_%7BA%7D) is the entropy of the partitions obtained after splitting on attribute ![](![image.png](attachment:bb407985-f7fd-4978-8e15-e5479c4cc7fe.png)![image.png](attachment:5b276527-b6c0-429e-bbc7-08d003e59188.png)). Note that reduction in entropy implies information gain.

Let's understand **how do we compute information gain** with an example. 

Suppose you have four data points out of which two belong to the class label '1', and the other two belong to the class label '2'. You split the points such that the left partition has two data points belonging to label '1', and the right partition has the other two data points that belong to label '2'. Now let's assume that you split on some attribute called 'A'.

![15.png](attachment:9099987a-ab1b-4b29-86a7-313d0081e976.png)


1. Entropy of original/parent data set is ![](https://latex.upgrad.com/render?formula=D%20%3D-%20%5B%28%5Cfrac%7B2%7D%7B4%7D%29log_%7B2%7D%28%5Cfrac%7B2%7D%7B4%7D%29%20%2B%20%28%5Cfrac%7B2%7D%7B4%7D%29log_%7B2%7D%28%5Cfrac%7B2%7D%7B4%7D%29%5D%20%3D%201.0).
2. Entropy of the partitions after splitting is ![](https://latex.upgrad.com/render?formula=D_%7BA%7D%20%3D%20-%201%2A%7Blog_%7B2%7D%7D%28%5Cfrac%7B2%7D%7B2%7D%29%20-%201%2Alog_%7B2%7D%28%5Cfrac%7B2%7D%7B2%7D%29%20%3D%200)..
3. Information gain after splitting is ![](https://latex.upgrad.com/render?formula=Gain%20%3D%20D%20-%20D_%7BA%7D%20%3D%201.0).

So, the information gain after splitting the original data set on attribute 'A' is **1.0**. You always try to maximise information gain by achieving maximum homogeneity and this is possible only when the value of entropy decreases from the parent set after splitting.

In case of a classification problem, you always try to **maximise purity gain** or **reduce the impurity** at a node after every split and this process is repeated till you reach the leaf node for the final prediction. 


## The Gini Index

<div class="text_component"><p>Let’s consider the heart disease example that was introduced in the earlier segments to understand decision trees. Now, you will calculate the homogeneity measure for some of the features on some&nbsp;numbers using the Gini index to determine the attribute that you should split on first.</p><p>Recall that the Gini index is calculated as follows:</p><p>&nbsp;<img alt="Equation" data-latex="G =  \sum_{i=1}^{k}p_{i} (1- p_{i}) = 1- \sum_{i=1}^{k}p_{i}^2" src="https://latex.upgrad.com/render?formula=G%20%3D%20%20%5Csum_%7Bi%3D1%7D%5E%7Bk%7Dp_%7Bi%7D%20(1-%20p_%7Bi%7D)%20%3D%201-%20%5Csum_%7Bi%3D1%7D%5E%7Bk%7Dp_%7Bi%7D%5E2"></p><p>where&nbsp;<img alt="Equation" data-latex="p_{i}" src="https://latex.upgrad.com/render?formula=p_%7Bi%7D" style="vertical-align: middle;display: inline;"> is the probability of finding a point with the label <img alt="Equation" data-latex="i" src="https://latex.upgrad.com/render?formula=i" style="vertical-align: middle;display: inline;">, and <img alt="Equation" data-latex="k" src="https://latex.upgrad.com/render?formula=k" style="vertical-align: middle;display: inline;">&nbsp;is the number of classes.</p><p><br>The data set is not homogeneous, and you need to split the data such that the resulting partitions are as homogenous as possible. This is a classification problem, and there are two output classes or labels - having a heart disease or not. Here, you use the Gini index as the homogeneity measure. Let's go ahead and see how Gini index can be used to decide where to make the split on the data point.&nbsp;While making your first split, you need to choose an attribute such that the purity gain is maximum. You can&nbsp;calculate the Gini index of the split on ‘sex’ (gender) and compare that with the Gini index of the split on ‘cholesterol’.</p><p>Suppose you gave the data for 100 patients and the target variable consists of two classes: class 0 having 60 people with no heart disease and class 1 having 40 people with a heart disease.</p><p><img data-height="63" data-width="261" height="63" src="https://images.upgrad.com/6164dd8e-bcb4-4566-8bef-e893fd9066e2-3.PNG" width="261"></p><p>Expressing this in terms of probabilities you&nbsp;get:</p><p><img data-height="55" data-width="407" height="55" src="https://images.upgrad.com/a58afad6-286d-4d5e-94ed-a0a82d683b2d-4.PNG" width="407"></p><p>Now, you can calculate the gini idex for the data before making any splits&nbsp;as follows:</p><p><img data-height="67" data-width="465" height="67" src="https://images.upgrad.com/ea7ccb1f-1f8f-471a-b8db-25a7ee14a184-5.PNG" width="465"></p><p>Let's now evaluate which split gives the maximum reduction in impurity among the possible choices. You have the following information about the target variable and the two attributes.</p><p style="text-align: center;"><strong><img data-height="355" data-width="569" height="355" src="https://images.upgrad.com/c7bfc4fb-2358-45fd-ae87-7c65bd54f3d6-Dtassessment.PNG" width="569"></strong></p><p>As you can see, the table above shows the number of diseased/non-diseased person w.r.t. the levels in the two attributes - 'Sex' and 'Cholesterol'. Let's calculate the homogeneity reduction on each attribute individually, starting with 'Sex'.</p><p><strong>Split based on Sex</strong></p><p>Let's consider the first candidate split based on sex/gender. As you can see from the first table, of the 100 people, you&nbsp;have 70 males and 30 females. Among the 70 males i.e. the child node containing males,<strong> 50 belong to class 0</strong> i.e, they do not have a heart disease and the rest <strong>20 males&nbsp;belong to class 1</strong> having a heart disease. So basically for the split on "Sex", you have something like this&nbsp;—</p><p style="text-align: center;"><img data-height="201" data-width="300" height="201" src="https://images.upgrad.com/467cdee3-0c63-490f-91a2-0f324c4c47f2-sex1.png" width="300"></p><p><strong>[Note that (x, y) on any node means (# Label 0, # Label 1)]</strong></p><p>Now the probabilities of the two classes within the male subset comes out to be:</p><p>&nbsp;<img alt="Equation" data-latex="p_{0}=\frac{50}{70}=0.714" src="https://latex.upgrad.com/render?formula=p_%7B0%7D%3D%5Cfrac%7B50%7D%7B70%7D%3D0.714">&nbsp; &nbsp; &nbsp;and&nbsp; &nbsp;<img alt="Equation" data-latex="p_{1}=\frac{20}{70}=0.286" src="https://latex.upgrad.com/render?formula=p_%7B1%7D%3D%5Cfrac%7B20%7D%7B70%7D%3D0.286"></p><p>Now using the same formula, Gini impurity for males becomes:</p><p><img alt="Equation" data-latex="0.714(1-0.714)+0.286(1-0.286)=0.41" src="https://latex.upgrad.com/render?formula=0.714%281-0.714%29%2B0.286%281-0.286%29%3D0.41" style="vertical-align: middle;display: inline;"></p><p><br>Let's now take the other case i.e. the child node containing females, where there are 30 females out of which <strong>10 belong to class 0</strong> having no&nbsp;heart disease and <strong>20 belong to class 1</strong>&nbsp;having a&nbsp;heart disease. The probabilities of the two classes within the female subset comes out to be:</p><p>&nbsp;<img alt="Equation" data-latex="p_{0}=\frac{10}{30}=0.333" src="https://latex.upgrad.com/render?formula=p_%7B0%7D%3D%5Cfrac%7B10%7D%7B30%7D%3D0.333">&nbsp; &nbsp; &nbsp;and&nbsp; &nbsp; &nbsp;<img alt="Equation" data-latex="p_{1}=\frac{20}{30}=0.667" src="https://latex.upgrad.com/render?formula=p_%7B1%7D%3D%5Cfrac%7B20%7D%7B30%7D%3D0.667"></p><p>Now using the formula, Gini impurity for females becomes:</p><p><img alt="Equation" data-latex="0.333(1-0.333)+0.667(1-0.667)=0.44" src="https://latex.upgrad.com/render?formula=0.333%281-0.333%29%2B0.667%281-0.667%29%3D0.44" style="vertical-align: middle;display: inline;"></p><p>Now how do you get the overall impurity for the attribute 'sex'&nbsp;after the split?&nbsp;You can aggregate the Gini impurity of these&nbsp;two nodes by&nbsp;taking a weighted average of the impurities of the male and female nodes. So, you have -</p><p>&nbsp;<img alt="Equation" data-latex="p_{male}=\frac{70}{100}=0.7" src="https://latex.upgrad.com/render?formula=p_%7Bmale%7D%3D%5Cfrac%7B70%7D%7B100%7D%3D0.7">&nbsp; &nbsp; and&nbsp; &nbsp;&nbsp;<img alt="Equation" data-latex="p_{female}=\frac{30}{100}=0.3" src="https://latex.upgrad.com/render?formula=p_%7Bfemale%7D%3D%5Cfrac%7B30%7D%7B100%7D%3D0.3"></p><p>This gives the Gini impurity after the split based on gender as:</p><p><img alt="Equation" data-latex="0.7\times0.41+0.3\times0.44=0.42" src="https://latex.upgrad.com/render?formula=0.7%5Ctimes0.41%2B0.3%5Ctimes0.44%3D0.42" style="vertical-align: middle;display: inline;"></p><p>Thus, the split based on <strong>gender</strong> gives the following insights:</p><ul><li>Gini impurity before split = 0.48</li><li>Gini impurity after split = 0.42</li><li><strong>Reduction in Gini impurity = 0.48 - 0.42 = 0.06</strong></li></ul><p>Hence, you get the following tree after splitting on 'Sex'&nbsp;—</p><p style="text-align: center;"><img data-height="392" data-width="500" height="392" src="https://images.upgrad.com/a89c7ae6-1e1b-4992-a0e7-ca17257441d6-sex2.png" width="500"></p><p><strong>Split based on Cholesterol</strong></p><p>Let's now take another candidate split based on cholesterol. You divide the dataset into two subsets: Low Cholesterol&nbsp;(Cholesterol &lt; 250)&nbsp;and High&nbsp;Cholesterol (Cholesterol &gt; 250). There are 60 people belonging to the low cholesterol group and 40 people belonging to the high cholesterol group.&nbsp;</p><p>If you see the second table given above, you will notice that among the 60 low cholesterol people, <strong>50 belong to class 0</strong>, i.e, they do not have a heart disease and the rest <strong>10 belong to class 1</strong> having a heart disease. So basically for the split on "Cholesterol", you have something like this&nbsp;—</p><p style="text-align: center;"><img data-height="201" data-width="300" height="201" src="https://images.upgrad.com/cb9bf06d-ce99-489e-913c-3d3757b792ea-chol1.png" width="300"></p><p>Now the probabilities of the two classes within the low cholesterol subset comes out to be:</p><p><img alt="Equation" data-latex="p_{0}=\frac{50}{70}=0.714" src="https://latex.upgrad.com/render?formula=p_%7B0%7D%3D%5Cfrac%7B50%7D%7B70%7D%3D0.714">&nbsp; &nbsp; and&nbsp; &nbsp;&nbsp;<img alt="Equation" data-latex="p_{1}=\frac{20}{70}=0.286" src="https://latex.upgrad.com/render?formula=p_%7B1%7D%3D%5Cfrac%7B20%7D%7B70%7D%3D0.286"></p><p>Now using the formula, Gini impurity for low cholesterol subset becomes:</p><p><img alt="Equation" data-latex="0.833(1-0.833)+0.167(1-0.167)\approx0.27" src="https://latex.upgrad.com/render?formula=0.833%281-0.833%29%2B0.167%281-0.167%29%5Capprox0.27" style="vertical-align: middle;display: inline;"></p><p>Let's now take the other case where there are 40&nbsp;high cholesterol (Cholesterol &gt; 250)&nbsp;people out of which 10&nbsp;belong to class 0 having no&nbsp;heart disease and 30&nbsp;belong to class 1&nbsp;having a&nbsp;heart disease. The probabilities of the two classes within the high cholesterol subset comes out to be:</p><p>&nbsp;<img alt="Equation" data-latex="p_{0}=\frac{10}{30}=0.333" src="https://latex.upgrad.com/render?formula=p_%7B0%7D%3D%5Cfrac%7B10%7D%7B30%7D%3D0.333">&nbsp; &nbsp;and&nbsp; &nbsp;&nbsp;<img alt="Equation" data-latex="p_{1}=\frac{20}{30}=0.667" src="https://latex.upgrad.com/render?formula=p_%7B1%7D%3D%5Cfrac%7B20%7D%7B30%7D%3D0.667"></p><p>Now using the formula, Gini impurity for high cholesterol subset becomes:</p><p><img alt="Equation" data-latex="0.25(1-0.25)+0.75(1-0.75)\approx0.37" src="https://latex.upgrad.com/render?formula=0.25%281-0.25%29%2B0.75%281-0.75%29%5Capprox0.37" style="vertical-align: middle;display: inline;"></p><p>The overall impurity for the data after the split based on cholesterol can be computed by&nbsp;taking a weighted average of the impurities of the high and low cholesterol&nbsp;nodes. So, you have -</p><p>&nbsp;<img alt="Equation" data-latex="p_{low-cholesterol}=\frac{60}{100}=0.6" src="https://latex.upgrad.com/render?formula=p_%7Blow-cholesterol%7D%3D%5Cfrac%7B60%7D%7B100%7D%3D0.6">&nbsp; &nbsp;&nbsp;and&nbsp;&nbsp;<img alt="Equation" data-latex="p_{female}=\frac{30}{100}=0.3" src="https://latex.upgrad.com/render?formula=p_%7Bfemale%7D%3D%5Cfrac%7B30%7D%7B100%7D%3D0.3"></p><p>This gives the Gini impurity after the split based on cholesterol as:</p><p><img alt="Equation" data-latex="0.6\times0.27+0.4\times0.37\approx0.3" src="https://latex.upgrad.com/render?formula=0.6%5Ctimes0.27%2B0.4%5Ctimes0.37%5Capprox0.3" style="vertical-align: middle;display: inline;"></p><p>Thus, the split based on <strong>cholesterol</strong> gives the following insights:</p><ul><li>Gini impurity before split = 0.48</li><li>Gini impurity after split = 0.3</li><li><strong>Reduction in Gini impurity = 0.48 - 0.3&nbsp;= 0.18</strong></li></ul><p>Hence, you get the following tree after splitting on 'Cholesterol'&nbsp;—</p><p style="text-align: center;"><img data-height="353" data-width="500" height="353" src="https://images.upgrad.com/ca0905ce-3cf5-49a8-b9e1-e67ddf974b76-chol2.png" width="500"></p><p>Hence, from the above example, it is evident that we get a significantly higher reduction in Gini impurity when you split the dataset on cholesterol as compared to when you split on gender.</p><p>Let's summarise all the steps you performed.</p><ol><li>Calculate the Gini impurity before any split on the whole dataset.</li><li>Consider any one of the available attributes.</li><li>Calculate the Gini impurity after splitting on this attribute for each of the levels of the attribute. In the example above, we considered the attribute 'Sex' and then calculated the Gini impurity for both males and females separately.</li><li>Combine the Gini impurities of all the levels to get the Gini impurity of the overall attribute.</li><li>Repeat steps 2-5 with another attribute till you have exhausted all of them.</li><li>Compare the decrease in Gini impurity across all attributes and select the one which offers maximum reduction.</li></ol><p>You can also perform the same exercise using Entropy instead of Gini index as your measure.</p><p><strong>Important:&nbsp;</strong>Please note that&nbsp;Gini index is also often referred to as Gini impurity. Also, some sources/websites/books&nbsp;might have mentioned a different formula for the Gini index. There is nothing wrong in using either of the formulas (because the ultimate interpretation regarding the impurity of the feature remains unchanged across both the formulas), but in order to avoid any confusion, we would recommend you to stick to the one mentioned in this session&nbsp;as this is the formula that we will be consistently using throughout the whole module. Here's it again for you!</p><p>&nbsp;<img alt="Equation" data-latex="G =  \sum_{i=1}^{k}p_{i} (1- p_{i})" src="https://latex.upgrad.com/render?formula=G%20%3D%20%20%5Csum_%7Bi%3D1%7D%5E%7Bk%7Dp_%7Bi%7D%20(1-%20p_%7Bi%7D)"></p><p>In the next segment, you will learn about the property of feature importance in decision trees.</p></div>