# Simpson's Paradox: Arizona University Admissions
Simpson's paradox using the Arizona University admissions data.

## 1. Load Data

In [13]:
import pandas as pd

df = pd.read_csv('university.csv')

df.head(20)

Unnamed: 0,Admit,Gender,Dept,Freq
0,Admitted,Male,A,512
1,Rejected,Male,A,313
2,Admitted,Female,A,89
3,Rejected,Female,A,19
4,Admitted,Male,B,353
5,Rejected,Male,B,207
6,Admitted,Female,B,17
7,Rejected,Female,B,8
8,Admitted,Male,C,120
9,Rejected,Male,C,205


## 2. Overall Admission Rates

In [5]:
# Pivot table of counts
total = df.pivot_table(
    index='Gender',
    columns='Admit',
    values='Freq',
    aggfunc='sum'
)
# Compute acceptance rate
total['Acceptance_rate'] = total['Admitted'] / (total['Admitted'] + total['Rejected'])
print(total)


Admit   Admitted  Rejected  Acceptance_rate
Gender                                     
Female       557      1278         0.303542
Male        1198      1493         0.445188


In [12]:
# Pivot by Dept and Gender
dept = df.pivot_table(
    index=['Dept', 'Gender'],
    columns='Admit',
    values='Freq',
    aggfunc='sum'
)
# Compute acceptance rate
dept['Acceptance_rate'] = dept['Admitted'] / (dept['Admitted'] + dept['Rejected'])
# Reshape for clarity
dept_rates = dept.reset_index().pivot(
    index='Dept',
    columns='Gender',
    values='Acceptance_rate'
)
# print(dept_rates)
for dept_name, row in dept_rates.iterrows():
    print(f"{dept_name}: Male: {row['Male']*100:.1f}%, Female: {row['Female']*100:.1f}%")


A: Male: 62.1%, Female: 82.4%
B: Male: 63.0%, Female: 68.0%
C: Male: 36.9%, Female: 34.1%
D: Male: 33.1%, Female: 34.9%
E: Male: 27.7%, Female: 23.9%
F: Male: 5.9%, Female: 7.0%


## 4. Conclusion
- **Simpson's Paradox:** Aggregated data shows men admitted at a higher rate (44% vs. 30%), whereas within each department, women fare as well or better.

- **Explanation Validity:** The application distribution confirms that women applied more to competitive departments with lower acceptance rates, thereby lowering their overall acceptance rate. This inference is supported by the application share analysis.

Subgroup differences in application patterns can invert apparent trends in aggregated data — a textbook Simpson's paradox example.