In [1]:
%%capture
%run merging_data.ipynb

In [2]:
# Combining the datasets from 2016 to 2023

dataframes_list = []

for year in years:
    dataframes_list.append(merged_years[year])

combined_years = pd.concat(dataframes_list, ignore_index=True)

combined_years

Unnamed: 0,City,status,Household Income,All,African American,American Indian,Hispanic/ Latinx,Pacific Islander,Asian,White,Domestic Unknown,Int'l,Female,Male,Other,Unknown,Measure Values
0,Alameda,Adm,131116,220.0,10.0,0.0,8.0,0.0,140.0,39.0,3.0,0.0,120.0,98.0,0.0,0.0,3.949851
1,Alameda,App,131116,292.0,19.0,0.0,14.0,0.0,174.0,52.0,5.0,0.0,158.0,132.0,0.0,0.0,3.950483
2,Alhambra,Adm,72406,284.0,0.0,0.0,15.0,0.0,255.0,0.0,0.0,0.0,151.0,123.0,0.0,8.0,3.743325
3,Alhambra,App,72406,381.0,0.0,0.0,44.0,0.0,317.0,0.0,0.0,0.0,203.0,164.0,0.0,10.0,3.743359
4,Anaheim,Adm,85133,447.0,4.0,0.0,213.0,0.0,141.0,62.0,4.0,0.0,259.0,179.0,0.0,0.0,3.829754
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2219,Whittier,App,91457,546.0,12.0,0.0,464.0,0.0,25.0,29.0,0.0,0.0,346.0,184.0,0.0,0.0,4.002486
2220,Yorba Linda,Adm,151534,102.0,0.0,0.0,15.0,0.0,50.0,26.0,7.0,0.0,56.0,39.0,0.0,0.0,4.002059
2221,Yorba Linda,App,151534,157.0,0.0,0.0,24.0,0.0,71.0,47.0,10.0,0.0,90.0,58.0,0.0,0.0,4.002059
2222,Yuba City,Adm,59588,96.0,0.0,0.0,24.0,0.0,47.0,17.0,0.0,0.0,59.0,36.0,0.0,0.0,4.003958


#### 1. Gender

In this test, we would focus on whether the admission counts are affected by genders. 

Null hypothesis $H_0$: The observed distribution of admissions by gender (Male and Female) matches the expected distribution based on the proportions of applications by gender.

**Test combining over years:**

In [3]:
# Count the genders of applicants and addmissions
observed_adm_gender = combined_years[combined_years['status'] == 'Adm'][['Female', 'Male']].sum()
observed_app_gender = combined_years[combined_years['status'] == 'App'][['Female', 'Male']].sum()

# Gender counts for applicants
observed_app_gender

Female    299198.0
Male      215768.0
dtype: float64

In [4]:
# Gender counts for admissions
observed_adm_gender

Female    200338.0
Male      138401.0
dtype: float64

In [5]:
import scipy.stats as stats

# Calculate the total applicants
total_app_gender = observed_app_gender.sum()

# Calculate the expected admissions by gender propotions of applicants
expected_adm = (observed_app_gender / total_app_gender) * observed_adm_gender.sum()

chi2_adm, p_adm = stats.chisquare(f_obs=observed_adm_gender, f_exp=expected_adm)
print(f"Chi2 statistic: {chi2_adm}, p-value: {p_adm}")

Chi2 statistic: 151.01063276210112, p-value: 1.0424716293902935e-34


**Test over each year:**

In [6]:
for year in years:
    # Count the genders of applicants and addmissions
    observed_adm_gender = merged_years[year][merged_years[year]['status'] == 'Adm'][['Female', 'Male']].sum()
    observed_app_gender = merged_years[year][merged_years[year]['status'] == 'App'][['Female', 'Male']].sum()
    
    total_app_gender = observed_app_gender.sum()
    
    # Calculate the expected admissions by gender propotions of applicants
    expected_adm = (observed_app_gender / total_app_gender) * observed_adm_gender.sum()
    
    chi2_adm, p_adm = stats.chisquare(f_obs=observed_adm_gender, f_exp=expected_adm)
    print(f"{year} - Chi2 statistic: {chi2_adm}, p-value: {p_adm}")

2016 - Chi2 statistic: 6.6909367562669, p-value: 0.009690427294448939
2017 - Chi2 statistic: 2.0644056444608867, p-value: 0.15077370241519603
2018 - Chi2 statistic: 1.2370309908513248, p-value: 0.26604402244554654
2019 - Chi2 statistic: 6.9519884050029495, p-value: 0.008372612190307988
2020 - Chi2 statistic: 10.862275429610392, p-value: 0.0009814303302985831
2021 - Chi2 statistic: 53.62328908660018, p-value: 2.4286330005664483e-13
2022 - Chi2 statistic: 62.37158889309555, p-value: 2.8439661969341823e-15
2023 - Chi2 statistic: 59.14647319255896, p-value: 1.4636143772073543e-14


**Analysis:**

1. The combined datasets from 2016 to 2023:

> The chi-square statistic measures the discrepancy between the observed and expected frequencies. In this case, a chi-square statistic of $156.334$ is quite high, indicating a large difference between the observed and expected admission counts by gender.

> Given that the `p-value` is significantly smaller than $0.005$, we reject the null hypothesis. This means there is strong evidence that the observed distribution of admissions by gender does not match the expected distribution based on the application proportion. 

2. The individual tests for each years:

> When we observe the `p-value` for each year, we see fluctuations from each years. From 2016 and 2019, the `p-value` is above $0.005$, suggesting that the observed distribution of admissions by gender matches the expected distribution based on the application proportion. However, in most of the other years, especially entering the pandemic era, the `p-value` strongly suggests that there's a relationship between gender and admission. 

#### 2. Race

In this test, we would focus on whether the admission counts are affected by races. 

Null hypothesis $H_0$: The observed distributions of admissions by races match the expected distribution based on the proportions of applications by races.

**Test combining over years:**

In [9]:
races = ['African American', 'American Indian', 'Hispanic/ Latinx', 'Pacific Islander', 'Asian', 'White', 'Domestic Unknown', 'Int\'l']

# Count the races of applicants and addmissions
observed_adm_race = combined_years[combined_years['status'] == 'Adm'][races].sum()
observed_app_race = combined_years[combined_years['status'] == 'App'][races].sum()

# Race counts for applicants
observed_adm_race

African American     15911.0
American Indian         41.0
Hispanic/ Latinx    126780.0
Pacific Islander        17.0
Asian               125226.0
White                54830.0
Domestic Unknown      3761.0
Int'l                 1329.0
dtype: float64

In [10]:
# Race counts for admissions
observed_app_race

African American     31055.0
American Indian         85.0
Hispanic/ Latinx    211196.0
Pacific Islander        44.0
Asian               168244.0
White                85010.0
Domestic Unknown      5153.0
Int'l                 1604.0
dtype: float64

In [12]:
# Calculate the total applicants
total_app_race = observed_app_race.sum()

# Calculate the expected admissions by race propotions of applicants
expected_adm_race = (observed_app_race / total_app_race) * observed_adm_race.sum()

chi2_race, p_race = stats.chisquare(f_obs=observed_adm_race, f_exp=expected_adm_race)

print(f"Chi2 statistic: {chi2_race}, p-value: {p_race}")

Chi2 statistic: 4128.724616605838, p-value: 0.0


**Test over each year:**

In [13]:
for year in years:
    # Calculate the observed frequencies for each race in admissions
    observed_adm_race = merged_years[year][merged_years[year]['status'] == 'Adm'][races].sum()
    
    # Calculate the observed frequencies for each race in applications
    observed_adm_race = merged_years[year][merged_years[year]['status'] == 'App'][races].sum()
    
    # Calculate the total number of applicants
    total_app_race = observed_app_race.sum()
    
    # Calculate the expected admissions based on the proportions of applications
    expected_adm_race = (observed_app_race / total_app_race) * observed_adm_race.sum()
    
    # Ensure the sums match
    observed_adm_sum = observed_adm_race.sum()
    expected_adm_sum = expected_adm_race.sum()
    
    # Adjust the expected frequencies to ensure they match the observed sum
    if not observed_adm_sum == expected_adm_sum:
        expected_adm_race *= (observed_adm_sum / expected_adm_sum)
    
    # Handle zero expected frequencies by replacing them with a small value
    expected_adm_race[expected_adm_race == 0] = 1e-10
    
    chi2_race, p_race = stats.chisquare(f_obs=observed_adm_race, f_exp=expected_adm_race)
    print(f"{year} - Chi2 statistic: {chi2_race}, p-value: {p_race}")

2016 - Chi2 statistic: 254.15219797266369, p-value: 3.619888684018112e-51
2017 - Chi2 statistic: 106.39029502958431, p-value: 5.143332640731386e-20
2018 - Chi2 statistic: 86.20582178442841, p-value: 7.424327901914899e-16
2019 - Chi2 statistic: 11.934116886577225, p-value: 0.10274639344358868
2020 - Chi2 statistic: 73.07340622110175, p-value: 3.527427950188013e-13
2021 - Chi2 statistic: 32.04646317489002, p-value: 3.981820356480029e-05
2022 - Chi2 statistic: 59.55080935096963, p-value: 1.85569808006444e-10
2023 - Chi2 statistic: 110.96888978326466, p-value: 5.779820482075561e-21


**Analysis:**

1. The combined datasets from 2016 to 2023:

> The chi-square statistic measures the discrepancy between the observed and expected frequencies. In this case, a chi-square statistic of $3452.518$ is quite high, indicating a large difference between the observed and expected admission counts by races.

> Given that the `p-value` is significantly smaller than $0.005$, we reject the null hypothesis. This means there is strong evidence that the observed distribution of admissions by race does not match the expected distribution based on the application proportion. Also, since the `p-value` is $0$, which is a rare occasion, we need to take a look in test over each year.

2. The individual tests for each years:

> When we observe the `p-value` for each year, we see fluctuations from each years, but every year except 2019 the `p-value` is significant small, which explain why the `p-value` in test over combining years becomes $0$. It also strongly suggests that there's a relationship between race and admission. 

#### 3. Household Income

In this test, we would focus on whether the admission counts are affected by income level. 

Null hypothesis $H_0$: The observed distributions of admissions by income level match the expected distribution based on the proportions of applications by income level.

To group the income level, we will define the income bins and lables scaling by $10000$:

In [14]:
# Define the income bins and labels
bins = np.arange(0, combined_years['Household Income'].max() + 10000, 10000)
labels = [f'{int(b)}-{int(b+10000)}' for b in bins[:-1]]

**Test combining over years:**

In [15]:
# Group the income into bins
combined_years['Income Group'] = pd.cut(combined_years['Household Income'], bins=bins, labels=labels, right=False)

# Calculate the observed frequencies for each income group in admissions and applications
observed_adm_income = combined_years[combined_years['status'] == 'Adm']['Income Group'].value_counts().sort_index()
observed_app_income = combined_years[combined_years['status'] == 'App']['Income Group'].value_counts().sort_index()

# Calculate the total number of applicants
total_app_income = observed_app_income.sum()

# Calculate the expected admissions based on the proportions of applications
expected_adm_income = (observed_app_income / total_app_income) * observed_adm_income.sum()

# Ensure the sums match
observed_adm_sum = observed_adm_income.sum()
expected_adm_sum = expected_adm_income.sum()

# Adjust the expected frequencies to ensure they match the observed sum
if not observed_adm_sum == expected_adm_sum:
    expected_adm_income *= (observed_adm_sum / expected_adm_sum)

# Handle zero expected frequencies by replacing them with a small value
expected_adm_income[expected_adm_income == 0] = 1e-10

# Perform the chi-square goodness of fit test for income
chi2_income, p_income = stats.chisquare(f_obs=observed_adm_income, f_exp=expected_adm_income)
print(f"Chi2 statistic: {chi2_income}, p-value: {p_income}")

Chi2 statistic: 5e-10, p-value: 1.0


**Test over each year:**

In [17]:
import warnings
warnings.filterwarnings('ignore')

In [18]:
for year in years:
    # Group the income into bins
    merged_years[year]['Income Group'] = pd.cut(merged_years[year]['Household Income'], bins=bins, labels=labels, right=False)
    
    # Calculate the observed frequencies for each income group in admissions and applications
    observed_adm_income = merged_years[year][merged_years[year]['status'] == 'Adm'].groupby('Income Group').size().reindex(labels, fill_value=0)
    observed_app_income = merged_years[year][merged_years[year]['status'] == 'App'].groupby('Income Group').size().reindex(labels, fill_value=0)
    
    # Calculate the total number of applicants
    total_app_income = observed_app_income.sum()
    
    # Calculate the expected admissions based on the proportions of applications
    expected_adm_income = (observed_app_income / total_app_income) * observed_adm_income.sum()
    
    # Ensure the sums match
    observed_adm_sum = observed_adm_income.sum()
    expected_adm_sum = expected_adm_income.sum()
    
    # Adjust the expected frequencies to ensure they match the observed sum
    if not observed_adm_sum == expected_adm_sum:
        expected_adm_income *= (observed_adm_sum / expected_adm_sum)
    
    # Handle zero expected frequencies by replacing them with a small value
    expected_adm_income[expected_adm_income == 0] = 1e-10
    
    # Perform the chi-square goodness of fit test for income
    chi2_income, p_income = stats.chisquare(f_obs=observed_adm_income, f_exp=expected_adm_income)
    print(f"{year} - Chi2 statistic: {chi2_income}, p-value: {p_income}")

2016 - Chi2 statistic: 5e-10, p-value: 1.0
2017 - Chi2 statistic: 5e-10, p-value: 1.0
2018 - Chi2 statistic: 5e-10, p-value: 1.0
2019 - Chi2 statistic: 5e-10, p-value: 1.0
2020 - Chi2 statistic: 5e-10, p-value: 1.0
2021 - Chi2 statistic: 5e-10, p-value: 1.0
2022 - Chi2 statistic: 5e-10, p-value: 1.0
2023 - Chi2 statistic: 5e-10, p-value: 1.0


**Analysis:**

1. The combined datasets from 2016 to 2023:

> The chi-square statistic measures the discrepancy between the observed and expected frequencies. In this case, a chi-square statistic of $0.2523$ is quite low, indicating there is no much difference between the observed and expected admission counts by income level.

> Given that the `p-value` is larger than $0.005$, we failed to reject the null hypothesis. This means there is strong evidence that the observed distribution of admissions by income level matches the expected distribution based on the application proportion.

2. The individual tests for each years:

> When we observe the `p-value` for each year, we do not see much fluctuations from each years. Hence, It strongly suggests that there's no relationship between income level and admission. 

**Conclusion:**

> Based on our goodness-of-fit tests, we find that without considering GPA, students' household income level does not play as a factor of students' admission. However, the test suggests that there is strong evidence that the observed distribution of admissions by race does not match the expected distribution based on the application proportion, and we also find this phenomenon in tests on gender over most of the years, especially in recent years. Hence, if GPA of the students in each categories considered similar, the UC system should work on the admission differences by these factors.