## Chi-squared goodness-of-fit test

The Chi-squared goodness-of-fit test to check whether the distribution of sample categorical data matches an expected distribution.

### 1. Data Generation

In [2]:
import numpy as np
import pandas as pd
import scipy.stats as stats

In [3]:
# generate fake demographic data for US and Minnesota
national = pd.DataFrame(["white"]*100000 + ["hispanic"]*60000 +\
                        ["black"]*50000 + ["asian"]*15000 + ["other"]*35000)
           

minnesota = pd.DataFrame(["white"]*600 + ["hispanic"]*300 + \
                         ["black"]*250 +["asian"]*75 + ["other"]*150)

In [6]:
national_table = pd.crosstab(index=national[0], columns='count')
minnesota_table = pd.crosstab(index=minnesota[0], columns='count')

In [11]:
print('National')
print(national_table)

National
col_0      count
0               
asian      15000
black      50000
hispanic   60000
other      35000
white     100000


In [12]:
print('Minnesota')
print(minnesota_table)

Minnesota
col_0     count
0              
asian        75
black       250
hispanic    300
other       150
white       600


### 2. Chi-squared testing

In [15]:
observed = minnesota_table
national_ratios = national_table/len(national) # get population ratios
expected = national_ratios * len(minnesota) # get expected counts
expected

col_0,count
0,Unnamed: 1_level_1
asian,79.326923
black,264.423077
hispanic,317.307692
other,185.096154
white,528.846154


In [16]:
chi_squared_stats = ((observed-expected)**2/expected).sum()
chi_squared_stats

col_0
count    18.194805
dtype: float64

* Note: The chi-squared test assumes none of the expected counts are less than 5

In [20]:
# find the critical value for 95% confidence level and df=4 
crit = stats.chi2.ppf(q = 0.95, # find the critical value for 95% confidence
                      df = 4) # df = number of variable categories - 1
print('Critical value')
print(crit)

Critical value
9.487729036781154


In [21]:
# find the p-value
p_value = 1 - stats.chi2.cdf(x=chi_squared_stats, # find the p-value
                            df=4)
print("P value")
print(p_value)

P value
[0.00113047]


Since chi-squared statstic exceeds the critical value,we can reject the null hypothesis that the two distributions are the same

In [24]:
# Alternatively, can use scipy function for the chi-squared goodness-of-fit testing 
stats.chisquare(f_obs=observed, # Array of observed counts
                f_exp=expected) # Array of expected counts

Power_divergenceResult(statistic=array([18.19480519]), pvalue=array([0.00113047]))

### 3. Chi-squared Test of Independence

The chi-squared test of independence tests whether two categorical variables are independent.

In [31]:
# Randomly generate some fake voter polling data 

np.random.seed(10)

# Sample data randomly at fixed probablities
voter_race = np.random.choice(a=['asian','black','hispanic','other','white'],
                              p=[0.05, 0.15, 0.25, 0.05, 0.5],
                              size=1000)
voter_party = np.random.choice(a=['democrat','independent','republican'],
                                  p=[0.4,0.2,0.4],
                                  size=1000)
voters = pd.DataFrame({'race':voter_race,
                       'party':voter_party})
voters.head()

Unnamed: 0,race,party
0,white,democrat
1,asian,republican
2,white,independent
3,white,republican
4,other,democrat


In [32]:
voter_tab = pd.crosstab(voters.race, voters.party, margins = True)
voter_tab

party,democrat,independent,republican,All
race,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
asian,21,7,32,60
black,65,25,64,154
hispanic,107,50,94,251
other,15,8,15,38
white,189,96,212,497
All,397,186,417,1000


In [33]:
voter_tab.columns = ['democrat','independent','republican','row_total']
voter_tab.index = ['asian','black','hispanic','other','white','col_total']
voter_tab

Unnamed: 0,democrat,independent,republican,row_total
asian,21,7,32,60
black,65,25,64,154
hispanic,107,50,94,251
other,15,8,15,38
white,189,96,212,497
col_total,397,186,417,1000


In [34]:
observed = voter_tab.iloc[0:5,0:3]
observed

Unnamed: 0,democrat,independent,republican
asian,21,7,32
black,65,25,64
hispanic,107,50,94
other,15,8,15
white,189,96,212


In [41]:
# Get the expected value from the 2-dimensional table using outer product

expected = np.outer(voter_tab['row_total'][0:5],
                   voter_tab.loc['col_total'][0:3])/1000
expected = pd.DataFrame(expected)

expected.columns = ['democrat','independent','republican']
expected.index = ['asian','black','hispanic','other','white']

expected

Unnamed: 0,democrat,independent,republican
asian,23.82,11.16,25.02
black,61.138,28.644,64.218
hispanic,99.647,46.686,104.667
other,15.086,7.068,15.846
white,197.309,92.442,207.249


In [43]:
# Same step for chi-squared test
chi_squared_stat = ((observed-expected)**2/expected).sum().sum()
chi_squared_stat

7.169321280162059

In [44]:
crit = stats.chi2.ppf(q=0.95,
                      df=8)
print('Critical value')
print(crit)

Critical value
15.50731305586545


In [46]:
p_value = 1 - stats.chi2.cdf(x=chi_squared_stat,
                            df=8)
print('p value')
print(p_value)

p value
0.518479392948842


The degress of freedom for a test of independence equals the product of the number of categories in each variable minus 1. In this case, df = (5-1)*(3-1)=8

In [49]:
# Can also use stats model to conduct a test of independence
stats.chi2_contingency(observed=observed)

(7.169321280162059,
 0.518479392948842,
 8,
 array([[ 23.82 ,  11.16 ,  25.02 ],
        [ 61.138,  28.644,  64.218],
        [ 99.647,  46.686, 104.667],
        [ 15.086,   7.068,  15.846],
        [197.309,  92.442, 207.249]]))

The test result does not detect a significant relationship between race and party