# Simpson's Paradox
Use `admission_data.csv` for this exercise.

In [2]:
# Load and view first few lines of dataset
import pandas as pd
import numpy as np

df = pd.read_csv('admission_data.csv')
df.head()

Unnamed: 0,student_id,gender,major,admitted
0,35377,female,Chemistry,False
1,56105,male,Physics,True
2,31441,female,Chemistry,False
3,51765,male,Physics,True
4,53714,female,Physics,True


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 4 columns):
student_id    500 non-null int64
gender        500 non-null object
major         500 non-null object
admitted      500 non-null bool
dtypes: bool(1), int64(1), object(2)
memory usage: 12.3+ KB


In [15]:
df.shape[0], len(df)

(500, 500)

### Proportion and admission rate for each gender

In [18]:
# Proportion of students that are female
num_f = len(df[df['gender'] == 'female']['gender'])
ttl = len(df['gender'])
prop_f = num_f / ttl
prop_f

0.514

In [20]:
# Proportion of students that are male
num_m = len(df[df['gender'] == 'male']['gender'])
prop_m = num_m / ttl
prop_m

0.486

In [57]:
# Admission rate for females
admin_f = len(df[(df['gender'] == 'female') & (df['admitted'])])
admin_f / num_f, admin_f

(0.28793774319066145, 74)

In [61]:
# Admission rate for males
admin_m = len(df[(df['gender'] == 'male') & (df['admitted'])])
admin_m / num_m, admin_m

(0.48559670781893005, 118)

### Proportion and admission rate for physics majors of each gender

In [29]:
df.head(3)

Unnamed: 0,student_id,gender,major,admitted
0,35377,female,Chemistry,False
1,56105,male,Physics,True
2,31441,female,Chemistry,False


In [38]:
# What proportion of female students are majoring in physics?
py_f = len(df[(df['gender'] == 'female') & (df['major'] == 'Physics')])
py_f / num_f

0.12062256809338522

In [52]:
# What proportion of male students are majoring in physics?
# use query 

df.query("gender == 'male' & major == 'Physics'").count() / \
    (df.query("gender == 'male'").count())


student_id    0.925926
gender        0.925926
major         0.925926
admitted      0.925926
dtype: float64

In [67]:
# Admission rate for female physics majors
df.query("gender == 'female' & major == 'Physics' & admitted").count() / \
    df.query("gender == 'female' & major == 'Physics'").count()

student_id    0.741935
gender        0.741935
major         0.741935
admitted      0.741935
dtype: float64

In [68]:
# Admission rate for male physics majors
df.query("gender == 'male' & major == 'Physics' & admitted").count() / \
    df.query("gender == 'male' & major == 'Physics'").count()

student_id    0.515556
gender        0.515556
major         0.515556
admitted      0.515556
dtype: float64

### Proportion and admission rate for chemistry majors of each gender

In [69]:
# What proportion of female students are majoring in chemistry?
df.query("gender == 'female' & major == 'Chemistry'").count() / num_f

student_id    0.879377
gender        0.879377
major         0.879377
admitted      0.879377
dtype: float64

In [70]:
# What proportion of male students are majoring in chemistry?
df.query("gender == 'male' & major == 'Chemistry'").count() / num_m

student_id    0.074074
gender        0.074074
major         0.074074
admitted      0.074074
dtype: float64

In [78]:
# Admission rate for female chemistry majors
admin_f_ch = len(df[(df['gender'] == 'female') & (df['major'] == 'Chemistry') & df['admitted']])
ch_f = len(df[(df['gender'] == 'female') & (df['major'] == 'Chemistry')])
admin_f_ch / ch_f
           

0.22566371681415928

In [79]:
# Admission rate for male chemistry majors
admin_m_ch = len(df[(df['gender'] == 'male') & (df['major'] == 'Chemistry') & df['admitted']])
ch_m = len(df[(df['gender'] == 'male') & (df['major'] == 'Chemistry')])
admin_m_ch / ch_m

0.1111111111111111

### Admission rate for each major

In [87]:
# Admission rate for physics majors
admin_py = len(df[(df['major'] == 'Physics') & df['admitted']])/ len(df[df['major'] == 'Physics'])
admin_py

0.54296875

In [89]:
# Admission rate for chemistry majors
admin_ch = len(df[(df['major'] == 'Chemistry') & df['admitted']])/ len(df[df['major'] == 'Chemistry'])
admin_ch

0.21721311475409835

## My reflection

In view of the above analysis, 
when we only look to the admission rate for each gender, the admission rate for females to that for males is 28.79% to 48.56%. The males have the higher admission rate.

However, when we further look to the admission rate for each gender in each major, we found that the females have the higher admission rates in both majors.
For physics majors, the females have the higher admission rate than the males in the major, which is 74.19% to 51.56%. For chemistry majors, the females also have the higher admission rate than the males, which is 22.57% to 11.11%.

So, here is the Simpson Paradox.

Finally, we look to the admission rate for physics majors to that for chemistry majors is 54.30% to 21.72%, indicating that there is higher admission rate for the physics majors. It corresponds with the previous hypothesis.
