<a href="https://colab.research.google.com/github/rohansiddam/Python-Journey/blob/main/046%20-%20Lesson%2046%20(Probability).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lesson 46: Probability

### Teacher-Student Activities

In this class, we will learn the concept of probability and binomial distribution.

Here we have a dataset of 583 patients who were diagnosed for the liver disease. Out of them, 416 patients were tested positive and 167 were tested negative for the liver disease. The data is collected from the North East of Andhra Pradesh, India. We need to find out what is the probability of a patient having the disease. We also need to find out what is the probability of a patient having the disease given that the patient is

- a juvenile i.e. a person whose age is less than 18 years.

- an adult i.e. a person whose age is greater than or equal to 18 years but less than 50 years.

- an elderly i.e. a person whose age is at least 50 years.

We also need to create a binomial distribution model to check what is the probability distribution of a patient having a disease.

The `Dataset` column is a class label used to divide groups into liver patients (liver disease) or not (no disease). This data set contains 441 male patient records and 142 female patient records.


Use these patient records to determine which patients have liver disease and which ones do not.

---

#### Data Description

The dataset contains the following columns (or features):

1. `Age`: Age of the patient. Any patient whose age exceeded 89 is listed as being of age "90".

2. `Gender`: Gender of the patient

3. `Total_Bilirubin`: Total Bilirubin

4. `Direct_Bilirubin`: Direct Bilirubin

5. `Alkaline_Phosphotase`: Alkaline Phosphatase

6. `Alamine_Aminotransferase`: Alamine Aminotransferase

7. `Aspartate_Aminotransferase`: Aspartate Aminotransferase

8. `Total_Protiens`: Total Proteins

9. `Albumin`: Albumin

10. `Albumin_and_Globulin_Ratio`: Ratio Albumin and Globulin Ratio

11. `Dataset`: Whether a patient has the liver disease or not.
 `1` means a patient has the liver disease and `2` means a patient does not have the liver disease.

Link: https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/whitehat-ds-datasets/indian-liver-patients/indian_liver_patient.csv

#### Acknowledgements

This dataset was downloaded from the UCI ML Repository:

Lichman, M. (2013). [UCI Machine Learning Repository](https:/archive.ics.uci.edu/ml/datasets/ILPD+(Indian+Liver+Patient+Dataset)). Irvine, CA: University of California, School of Information and Computer Science.

---

#### Activity 1: Probability

*Probability of an outcome or an event is defined as the ratio of the number of favourable outcomes to the total number of possible outcomes.*

Consider throwing a die.

<img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/fair-die.jpg' width=200>

It has six sides labelled as 1, 2, 3, 4, 5 and 6. So whenever you throw a die you will get either 1, 2, 3, 4, 5 or 6.

Suppose you are playing a game in which you need to get $6$ when you roll a die to win the game. So $6$ is your favourable outcome to win the game. Hence, the probability of getting 6 is **one over six**, i.e. $\frac{1}{6}$ because there is only **one** favourable outcome (i.e. getting $6$ on dice) out of six possible outcomes.

Mathematically, the probability of an outcome or an event $E$ is given by $$P(E) = \frac{n(E)}{n(S)}$$

or

$$P(E) = \frac{\text{Number of favourable outcomes}}{\text{Total number of possible outcomes}}$$

where

- $E$ is a set containing favourable outcomes

- $n(E)$ is the number of items contained in the set $E$

- $S$ is a set of all possible outcomes which is also known as **sample space**

- $n(S)$ the number of items contained in the set $S$

In the game of getting $6$ to win the game, the set of favourable outcome(s) is $E = \{6\}$ and the set of possible outcomes is $S = \{1, 2, 3, 4, 5, 6\}$

Hence, the probability of the outcome of getting $6$ is $\frac{1}{6}$

Let's change the rules of the game by saying that a player can take out their pawn only if they get a prime number. In this case, the set of favourable outcomes becomes $E = \{2, 3, 5\}$

So the probability of taking out the pawn or in other words, the probability of the outcome of getting either 2 or 3 or 5 is $\frac{3}{6} = \frac{1}{2}$ because there are three items in the set $E$.

**Note:**

- Probability is just a measure of finding out which events are more likely to occur. It does not mean that the highly likely events will definitely occur. The probability of an event which definitely will occur is 1. Similarly, the probability of an event which will definitely NOT occur is 0.

- The sum of the probability of occurrence of an event and the probability of that event not occurring is always 1.

With this idea in mind, let's find out the probability of a patient having the liver disease.

---

#### Activity 2: Data Preparation

Let's prepare the dataset for analysis by:

- Treating the null values (if there are any)

- Renaming the `Dataset` column with the name `Disease`, the `Alkaline_Phosphotase` column with the name `Alkaline_Phosphatase` and `Alamine_Aminotransferase` column with the name `Alanine_Aminotransferase`

- Labelling each patient as a juvenile, an adult or an elderly based on their age

- Encoding (or converting) the `Male` and `Female` values for the `Gender` column to the numeric values, i.e., `0` and `1`

In [None]:
# S2.1: Import the libraries.
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

In [None]:
# S2.2: Load the dataset
df = pd.read_csv('https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/whitehat-ds-datasets/indian-liver-patients/indian_liver_patient.csv')

In [None]:
# S2.3: Get the dataset information.
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 583 entries, 0 to 582
Data columns (total 11 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Age                         583 non-null    int64  
 1   Gender                      583 non-null    object 
 2   Total_Bilirubin             583 non-null    float64
 3   Direct_Bilirubin            583 non-null    float64
 4   Alkaline_Phosphotase        583 non-null    int64  
 5   Alamine_Aminotransferase    583 non-null    int64  
 6   Aspartate_Aminotransferase  583 non-null    int64  
 7   Total_Protiens              583 non-null    float64
 8   Albumin                     583 non-null    float64
 9   Albumin_and_Globulin_Ratio  579 non-null    float64
 10  Dataset                     583 non-null    int64  
dtypes: float64(5), int64(5), object(1)
memory usage: 50.2+ KB


In [None]:
# S2.4: Check for the null values.
df.isnull().sum()

Age                           0
Gender                        0
Total_Bilirubin               0
Direct_Bilirubin              0
Alkaline_Phosphotase          0
Alamine_Aminotransferase      0
Aspartate_Aminotransferase    0
Total_Protiens                0
Albumin                       0
Albumin_and_Globulin_Ratio    0
Dataset                       0
dtype: int64

In [None]:
# S2.5: Treat the missing values and check for the null values again.
df.loc[df['Albumin_and_Globulin_Ratio'].isnull() == True, 'Albumin_and_Globulin_Ratio'] = df['Albumin_and_Globulin_Ratio'].median()

**Renaming a Column**

To rename a column in a Pandas DataFrame, use the `rename()` function. It requires a dictionary as an input to the `columns` parameter. The dictionary must contain the old column names and new column names as a key-value pair respectively.

**Syntax:** `dataframe.rename(columns=dictionary_of_new_and_old_names)`

where `dictionary_of_new_and_old_names` is a dictionary containing old column names and new column names as a key-value pair respectively.

In [None]:
# S2.6: Rename the 'Dataset' column with the name 'Disease'. Also, rename the 'Alkaline_Phosphotase' column with 'Alkaline_Phosphatase'
# Also, rename the 'Alamine_Aminotransferase' column with 'Alanine_Aminotransferase'
df.rename(columns = {'Dataset': 'Disease', 'Alamine_Aminotransferase': 'Alanine_Aminotransferase' , 'Alkaline_Phosphotase' : 'Alkaline_Phosphatase'  },inplace = True )

**Note:** By setting the `True` value to the `inplace` parameter, we are telling Python to override the current column name permanently. Otherwise, the column will get changed only for that code execution. For further code execution, the original column name will get reset automatically.

In [None]:
# S2.7: Check whether we have the desired column names.
df

Unnamed: 0,Age,Gender,Total_Bilirubin,Direct_Bilirubin,Alkaline_Phosphatase,Alanine_Aminotransferase,Aspartate_Aminotransferase,Total_Protiens,Albumin,Albumin_and_Globulin_Ratio,Disease
0,65,Female,0.7,0.1,187,16,18,6.8,3.3,0.90,1
1,62,Male,10.9,5.5,699,64,100,7.5,3.2,0.74,1
2,62,Male,7.3,4.1,490,60,68,7.0,3.3,0.89,1
3,58,Male,1.0,0.4,182,14,20,6.8,3.4,1.00,1
4,72,Male,3.9,2.0,195,27,59,7.3,2.4,0.40,1
...,...,...,...,...,...,...,...,...,...,...,...
578,60,Male,0.5,0.1,500,20,34,5.9,1.6,0.37,2
579,40,Male,0.6,0.1,98,35,31,6.0,3.2,1.10,1
580,52,Male,0.8,0.2,245,48,49,6.4,3.2,1.00,1
581,31,Male,1.3,0.5,184,29,32,6.8,3.4,1.00,1


**Labelling Age Group**

We need to label each patient as

- a juvenile i.e. a person whose age is less than 18 years.

- an adult i.e. a person whose age is greater than or equal to 18 years but less than 50 years.

- an elderly i.e. a person whose age is at least 50 years.

so that later we can find out the probability of a patient having the liver disease given that he/she is

- a juvenile

- an adult

- an elderly

In [None]:
# S2.8: Create a function which takes a Pandas series as an input and returns another Pandas series as an output containing items 1, 2 and 3.
def age_group(series1):
  series2 = []
  for i in series1:
    if i < 18:
      series2.append(1)
    elif (i >= 18 and i < 50):
      series2.append(2)
    else:
      series2.append(3)
  return pd.Series(series2, index = series1.index)


In [None]:
age_group_data = age_group(df['Age'])
age_group_data.value_counts()

2    328
3    230
1     25
dtype: int64

In [None]:
# S2.9: Add a new column to the 'df' DataFrame.
df['age_group'] = age_group_data
df.head()

Unnamed: 0,Age,Gender,Total_Bilirubin,Direct_Bilirubin,Alkaline_Phosphatase,Alanine_Aminotransferase,Aspartate_Aminotransferase,Total_Protiens,Albumin,Albumin_and_Globulin_Ratio,Disease,age_group
0,65,Female,0.7,0.1,187,16,18,6.8,3.3,0.9,1,3
1,62,Male,10.9,5.5,699,64,100,7.5,3.2,0.74,1,3
2,62,Male,7.3,4.1,490,60,68,7.0,3.3,0.89,1,3
3,58,Male,1.0,0.4,182,14,20,6.8,3.4,1.0,1,3
4,72,Male,3.9,2.0,195,27,59,7.3,2.4,0.4,1,3


**Encoding Gender**

Encode (or convert) the `Male` and `Female` values for the `Gender` column to the numeric values, i.e., `0` and `1`. For this, we can use the `replace()` function.

In [None]:
# S2.10: Find the number of male and female patients before encoding.
df['Gender'].value_counts()

Male      441
Female    142
Name: Gender, dtype: int64

In [None]:
# S2.11: Encode the 'Male' & 'Female' values for the 'Gender' column to the numeric values, i.e. 0 and 1.
df['Gender'].replace({'Male': 0, 'Female': 1}, inplace = True)

**Note:** By setting the `True` value to the `inplace` parameter, we are telling Python to override the current values in the `Gender` column permanently. Otherwise, the gender values will get changed only for that code execution. For further code execution, the original gender values will get reset automatically.

In [None]:
# S2.12: Find the number of male and female patients after encoding.
df['Gender'].value_counts()

0    441
1    142
Name: Gender, dtype: int64

---

#### Activity 3: Computing Probabilities

Let's answer the following questions:

1. What is the probability that a patient is a juvenile?

2. What is the probability that a patient is an adult?

3. What is the probability that a patient is an elderly?

4. What is the probability that a patient has the liver disease given that the patient is a juvenile?

5. What is the probability that a patient has the liver disease given that the patient is an adult?

6. What is the probability that a patient has the liver disease given that the patient is an elderly?

In [None]:
# S3.1 Find the number of patients who are juveniles, adults and elderlies.
df['age_group'].value_counts()

2    328
3    230
1     25
Name: age_group, dtype: int64

Juveniles are very few in number.

In [None]:
# S3.2: What is the probability that a patient is a juvenile? What is the probability that a patient is an adult?
# What is the probability that a patient is an elderly?
prob_juve = sum(df['age_group'] == 1) / df.shape[0]
prob_adult = sum(df['age_group'] == 2) / df.shape[0]
prob_elderly = sum(df['age_group'] == 3) / df.shape[0]

As expected, the probability that a patient having the liver disease is an adult is the greatest amongst the three age groups.

The sum of the above three probabilities should be one.

In [None]:
# S3.3: Calculate the sum of the above three probabilities.
print(prob_juve)
print(prob_adult)
print(prob_elderly)

sum_prob = prob_juve + prob_adult + prob_elderly
sum_prob

0.04288164665523156
0.5626072041166381
0.39451114922813035


1.0

The remaining three probabilities

- What is the probability that a patient has the liver disease given that the patient is a juvenile?

- What is the probability that a patient has the liver disease given that the patient is an adult?

- What is the probability that a patient has the liver disease given that the patient is an elderly?

are **conditional probabilities** because they have a condition involved in them.

- In the first case, the condition is that a patient is a juvenile.

- In the second case, the condition is that a patient is an adult.

- In the third case, the condition is that a patient is an elderly.

So to solve the first case, we need to find the number of patients amongst the juveniles having the disease, then calculate the probability of juveniles having the liver disease and then multiply it with the probability that a patient is a juvenile.

<img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/prob_juve_img1.png' width=700>

We will repeat the same process for the adults and the elderlies as well.

<img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/prob_juve_img2.png' width=700>




In [None]:
# S3.4: Find the probability that a patient has the liver disease given that they are a juvenile.
# Also, find the probability that a patient doesn't have the liver disease given that they are a juvenile.
df_juve = df[df['age_group'] == 1]
df_adult = df[df['age_group'] == 2]
df_elderly = df[df['age_group'] == 3]

prob_disease_juve = sum(df_juve['Disease'] == 1) / df_juve.shape[0]
prob_not_disease_juve = sum(df_juve['Disease'] == 2) / df_juve.shape[0]

print(prob_disease_juve * prob_juve)
print(prob_not_disease_juve * prob_juve)

0.020583190394511147
0.022298456260720412


The probability tree now looks like this

<img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/prob_juve_img3.png' width=700>


In [None]:
# S3.5: Find the probability that a patient has the liver disease given that they are an adult.
# Also, find the probability that a patient doesn't have the liver disease given that they are an adult.
prob_disease_adult = sum(df_adult['Disease'] == 2) / df_juve.shape[0]
prob_not_disease_adult = sum(df_adult['Disease'] == 1) / df_juve.shape[0]

print(prob_disease_adult * prob_adult)
print(prob_not_disease_adult * prob_adult)

2.227924528301887
5.153481989708405


The probability tree now looks like this

<img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/prob_juve_img4.png' width=700>


In [None]:
# S3.6: Find the probability that a patient has the liver disease given that they are an elderly.
# Also, find the probability that a patient doesn't have the liver disease given that they are an elderly.
prob_disease_elderly = sum(df_elderly['Disease'] == 2) / df_elderly.shape[0]
prob_not_disease_elderly = sum(df_elderly['Disease'] == 1) / df_elderly.shape[0]

juve = df.loc[(df['age_group'] == 1) & (df['Disease'] == 1), 'age_group'].shape[0] / df.shape[0]
print(juve)

print(prob_disease_elderly * prob_elderly)
print(prob_not_disease_elderly * prob_elderly)

0.02058319039451115
0.09433962264150944
0.30017152658662094


The probability tree now looks like this

<img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/prob_juve_img5.png' width=700>


You can collect the favourable cases, i.e, the juvenile patients having the disease using the `&` operator and the calculate the probabilities.

In [None]:
# T3.1: Find the probability that a patient has the liver disease given that they are a juvenile.
# Also, find the probability that a patient doesn't have the liver disease given that they are a juvenile.


**Note:** We can apply the `&` operator only when the occurrence of two events are independent of each other. Age is not the cause of the liver disease. Hence, age and liver disease are independent of each other. Statistically, there is a relatively high probability that an adult is more likely to have the  disease. Still, it doesn't mean that age is the cause of liver diseases.

It is the same as saying that it is not necessary that all tall players are good basketball players. The tall players have more advantage than short players.

In [None]:
# S3.7: Find the remaining two conditional probabilities.
adult = df.loc[(df['age_group'] == 2) & (df['Disease'] == 1), 'age_group'].shape[0] / df.shape[0]
print(adult)

adult_no_disease = df.loc[(df['age_group'] == 2) & (df['Disease'] == 2), 'age_group'].shape[0] / df.shape[0]
print(adult_no_disease)

elderly = df.loc[(df['age_group'] == 3) & (df['Disease'] == 1), 'age_group'].shape[0] / df.shape[0]
print(elderly)

elderly_no_disease = df.loc[(df['age_group'] == 3) & (df['Disease'] == 2), 'age_group'].shape[0] / df.shape[0]
print(elderly_no_disease)

0.3927958833619211
0.16981132075471697
0.30017152658662094
0.09433962264150944


So the conditional probabilities match with the ones calculated earlier.

---