# Conditional Probability

The formula for conditional probability is

P(A|B) = P(A ∩ B) / P(B).
The parts: P(A|B) = probability of A occurring, given B occurs P(A ∩ B) = probability of both A and B occurring P(B) = probability of B occurring

Calculate the probability a student gets an A (80%+) in math, given they miss 10 or more classes

In [13]:
import pandas as pd
import numpy as np
df = pd.read_csv('C:/Users/DELL/Desktop/MSC.CS/student-mat.csv')
df.head(4)

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,4,3,4,1,1,3,6,5,6,6
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,5,3,3,1,1,3,4,5,5,6
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,4,3,2,2,3,3,10,7,8,10
3,GP,F,15,U,GT3,T,4,2,health,services,...,3,2,2,1,1,5,2,15,14,15


check the number of records.

In [3]:
len(df)

395

We’re only concerned with the columns, absences (number of absences), and G3 (final grade from 0 to 20).

Let’s create a couple new boolean columns based on these columns to make our lives easier.

Add a boolean column called grade_A noting if a student achieved 80% or higher as a final score.

Original values are on a 0–20 scale so we multiply by 5.



In [4]:
df['grade_A'] = np.where(df['G3']*5 >= 80, 1, 0)
df.head(4)

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3,grade_A
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,3,4,1,1,3,6,5,6,6,0
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,3,3,1,1,3,4,5,5,6,0
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,3,2,2,3,3,10,7,8,10,0
3,GP,F,15,U,GT3,T,4,2,health,services,...,2,2,1,1,5,2,15,14,15,0




Make another boolean column called high_absenses with a value of 1 if a student missed 10 or more classes.


In [5]:
df['high_absenses'] = np.where(df['absences'] >= 10, 1, 0)
df.head(4)


Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,goout,Dalc,Walc,health,absences,G1,G2,G3,grade_A,high_absenses
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,4,1,1,3,6,5,6,6,0,0
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,3,1,1,3,4,5,5,6,0,0
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,2,2,3,3,10,7,8,10,0,1
3,GP,F,15,U,GT3,T,4,2,health,services,...,2,1,1,5,2,15,14,15,0,0


Add one more column to make building a pivot table easier.



In [6]:
df['count'] = 1
df.head(4)


Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,Dalc,Walc,health,absences,G1,G2,G3,grade_A,high_absenses,count
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,1,1,3,6,5,6,6,0,0,1
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,1,1,3,4,5,5,6,0,0,1
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,2,3,3,10,7,8,10,0,1,1
3,GP,F,15,U,GT3,T,4,2,health,services,...,1,1,5,2,15,14,15,0,0,1


And drop all columns we don’t care about.



In [7]:
df = df[['grade_A','high_absenses','count']]
df.head()


Unnamed: 0,grade_A,high_absenses,count
0,0,0,1
1,0,0,1
2,0,1,1
3,0,0,1
4,0,0,1


Now we’ll create a pivot table from this.



In [8]:
table = pd.pivot_table(
    df, 
    values='count', 
    index=['grade_A'], 
    columns=['high_absenses'], 
    aggfunc=np.size, 
    fill_value=0
)
print(table)


high_absenses    0   1
grade_A               
0              277  78
1               35   5


We now have all the data we need to do our calculation. Let’s start by calculating each individual part in the formula. In our case: P(A) is the probability of a grade of 80% or greater. P(B) is the probability of missing 10 or more classes. P(A|B) is the probability of a 80%+ grade, given missing 10 or more classes. Calculations of parts: P(A) = (35 + 5) / (35 + 5 + 277 + 78) = 0.10126582278481013 P(B) = (78 + 5) / (35 + 5 + 277 + 78) = 0.21012658227848102 P(A ∩ B) = 5 / (35 + 5 + 277 + 78) = 0.012658227848101266 And per the formula, P(A|B) = P(A ∩ B) / P(B), put it together. P(A|B) = 0.012658227848101266/ 0.21012658227848102= 0.06 There we have it. The probability of getting at least an 80% final grade, given missing 10 or more classes is 6%. Conclusion While the learning from our specific example is clear - go to class if you want good grades

It displays total of Pivot table



In [9]:
table['P[total]'] = (len(df))
print (table)

high_absenses    0   1  P[total]
grade_A                         
0              277  78       395
1               35   5       395


show the probablity of grade 80 or greater from table



In [10]:
table['P[grade_80%_or_greater]'] = (df.grade_A.sum()/len(df))
print(table)

high_absenses    0   1  P[total]  P[grade_80%_or_greater]
grade_A                                                  
0              277  78       395                 0.101266
1               35   5       395                 0.101266


show the probablity of high absences from table



In [11]:
table['P[high_absences]'] = (df.high_absenses.sum()/len(df))
print(table)

high_absenses    0   1  P[total]  P[grade_80%_or_greater]  P[high_absences]
grade_A                                                                    
0              277  78       395                 0.101266          0.210127
1               35   5       395                 0.101266          0.210127


show the probablity of schools those high absences and grade A from table



In [12]:
table['P[high_absenses and grade_A]'] = (df.groupby(df.grade_A).apply(lambda x : (x['grade_A'] == 1) & (x['high_absenses'] == 1)).sum()/len(df))
print(table)

high_absenses    0   1  P[total]  P[grade_80%_or_greater]  P[high_absences]  \
grade_A                                                                       
0              277  78       395                 0.101266          0.210127   
1               35   5       395                 0.101266          0.210127   

high_absenses  P[high_absenses and grade_A]  
grade_A                                      
0                                  0.012658  
1                                  0.012658  


Now, P(Getting 80% or more given that person has missed 10 or more classes) = P(high_absenses and grade_A)/P(high_absences)

In [14]:
table['P[Getting 80% | student has high absenses]'] = table['P[high_absenses and grade_A]']/table['P[high_absences]']
print(table)

high_absenses    0   1  P[total]  P[grade_80%_or_greater]  P[high_absences]  \
grade_A                                                                       
0              277  78       395                 0.101266          0.210127   
1               35   5       395                 0.101266          0.210127   

high_absenses  P[high_absenses and grade_A]  \
grade_A                                       
0                                  0.012658   
1                                  0.012658   

high_absenses  P[Getting 80% | student has high absenses]  
grade_A                                                    
0                                                0.060241  
1                                                0.060241  


Final probability is 0.060241