## Content 
- **Entropy for Continuous Random Variable**

- **Comparing Entropy for Continuous R.V**

- **Gini Impurity Code walkthrough**

## Formulation - Continuous R.V


In discrete random variable, we use summation over all values.

When we have continous RV, this summation turns into integration.

The forumlation becomes

$H(Y) = -∫_∞^∞P(y)log(P(y))dy$

where p(y) is probability density function




<img src='https://drive.google.com/uc?id=1L6Fn7ToAiY55yMPOjBU1FSuhD4HrEBEc'>



## Comparing entropy for continuous R.V


```
Say we have 3 continous random variables X, Y, Z 
and their distribution (PFF) is as follows:

```


<img src='https://drive.google.com/uc?id=1dh0CWPv6eAzELWlFHRIY6iODOAdGQdSz' width = 800>

What will be the order of random variable in increasing order of entropy ? 

 **Z < Y < X**

Remember that entropy is the measure of randomness.

- Here, Z has a peaked curve
    - We are more likely to observe values in very small range (low variance) 
    - Means there is more certainity and less randomness
    - Hence, Z will have lowest entropy

- Random Variable Y has a large spread compared to Z
    - we are more likely to observe value from a wider range
    - meaning, there is more randomness in values
    - Hence, it'll have high entropy compared to Z

In similar fashion, X having the largest spread will have maximum entropy

**More the variance, more the entropy**

So, the order becomes Z < Y < X

## Gini Impurity

### Code Walkthrough

In [None]:
import pandas as pd
import numpy as np


In [None]:
!gdown  1l53Fgkg1G1ekCxxgaDQ00EXrnSMTeJj-

Downloading...
From: https://drive.google.com/uc?id=1l53Fgkg1G1ekCxxgaDQ00EXrnSMTeJj-
To: /content/sample_data.csv
  0% 0.00/32.5k [00:00<?, ?B/s]100% 32.5k/32.5k [00:00<00:00, 24.8MB/s]


In [None]:
sample_data = pd.read_csv('sample_data.csv')

In [None]:
sample_data

Unnamed: 0,Gender,Age_less_35,JobRole,Attrition
0,Male,True,Laboratory Technician,0
1,Male,False,Sales Executive,1
2,Male,True,Sales Representative,1
3,Female,False,Healthcare Representative,0
4,Male,True,Sales Executive,0
...,...,...,...,...
995,Male,False,Laboratory Technician,1
996,Female,False,Manufacturing Director,0
997,Female,True,Sales Executive,0
998,Male,False,Manager,0


In [None]:
sample_data.Attrition.value_counts()

0    831
1    169
Name: Attrition, dtype: int64

In [None]:
def gini_impurity(y):
    
  if isinstance(y, pd.Series):
    p = y.value_counts()/y.shape[0]
    gini = 1-np.sum(p**2)
    return gini

  else:
    raise('Object must be a Pandas Series.')

In [None]:
gini_impurity(sample_data.Attrition)

0.28087799999999996

#### Weighted Gini impurity for child node

In [None]:
def calculate_weighted_gini(feature, y):
    categories = feature.unique()

    weighted_gini_impurity = 0

    for category in categories:
        y_category = y[feature == category]
        gini_impurity_category = gini_impurity(y_category)
        # print(category)
        # print(gini_impurity_category)
        weighted_gini_impurity += y_category.shape[0]/y.shape[0]*gini_impurity_category

    
    return weighted_gini_impurity

In [None]:
calculate_weighted_gini(sample_data.Age_less_35, sample_data.Attrition)

0.2724771918985819