## Chi-Squared Goodness-of-fit test

The chi-squared goodness-of-fit test is an analog of the one-way t-test for categorical variables: it tests whether the distribution of sample categorical data matches an expected distribution. For example, you could use a chi-squared goodness-of-fit test to check whether the race demographics of members at your church or school match that of the entire U.S. population or whether the computer browser preferences of your friends match those of Internet uses as a whole.

In [2]:
import numpy as np
import pandas as pd
from scipy import stats

In [5]:
national = pd.DataFrame(["white"]*100000 + ["hispanic"]*60000 +\
                        ["black"]*50000 + ["asian"]*15000 + ["other"]*35000)
           

minnesota = pd.DataFrame(["white"]*600 + ["hispanic"]*300 + \
                         ["black"]*250 +["asian"]*75 + ["other"]*150)

Get the value counts for each categories

In [8]:
national_table = pd.crosstab(index=national[0], columns=["count"])
national_table

col_0,count
0,Unnamed: 1_level_1
asian,15000
black,50000
hispanic,60000
other,35000
white,100000


In [9]:
minnesota_table = pd.crosstab(index=minnesota[0], columns=["count"])
minnesota_table

col_0,count
0,Unnamed: 1_level_1
asian,75
black,250
hispanic,300
other,150
white,600


Now calculate the observed values

In [10]:
# The observed table here will be the minnesota table as we want to find the racial 
# proportions with respect to the entire US.
observed = minnesota_table

Now calculate the expected values

In [11]:
national_ratios = national_table / len(national)
national_ratios

col_0,count
0,Unnamed: 1_level_1
asian,0.057692
black,0.192308
hispanic,0.230769
other,0.134615
white,0.384615


In [16]:
len(minnesota), len(national)

(1375, 260000)

In [13]:
expected = national_ratios * len(minnesota)
expected

col_0,count
0,Unnamed: 1_level_1
asian,79.326923
black,264.423077
hispanic,317.307692
other,185.096154
white,528.846154


In [19]:
stats.chisquare(f_obs=observed, f_exp=expected)

Power_divergenceResult(statistic=array([18.19480519]), pvalue=array([0.00113047]))

In [20]:
stats.chisquare(f_obs=observed)

Power_divergenceResult(statistic=array([590.90909091]), pvalue=array([1.43773674e-126]))

## Chi-Squared Test of Independence

The chi-squared test of independence tests whether two categorical variables are independent. The test of independence is commonly used to determine whether variables like education, political views and other preferences vary based on demographic factors like gender, race and religion.

**Null Hypothesis (H0):** There is no relationship between the two variables<br>
**Alternative Hypothesis (Ha):** There is a relationship between the two variables.

In [22]:
import numpy as np
import pandas as pd
from scipy import stats

In [28]:
np.random.seed(10)

# Sample data randomly at fixed probabilities
voter_race = np.random.choice(a= ["asian","black","hispanic","other","white"],
                              p = [0.05, 0.15 ,0.25, 0.05, 0.5],
                              size=1000)

# Sample data randomly at fixed probabilities
voter_party = np.random.choice(a= ["democrat","independent","republican"],
                              p = [0.4, 0.2, 0.4],
                              size=1000)

voters = pd.DataFrame({"race":voter_race, 
                       "party":voter_party})

voter_tab = pd.crosstab(voters.race, voters.party, margins = True)

voter_tab.columns = ["democrat","independent","republican","row_totals"]

voter_tab.index = ["asian","black","hispanic","other","white","col_totals"]

observed = voter_tab.iloc[0:5,0:3]   # Get table without totals for later use
voter_tab

Unnamed: 0,democrat,independent,republican,row_totals
asian,21,7,32,60
black,65,25,64,154
hispanic,107,50,94,251
other,15,8,15,38
white,189,96,212,497
col_totals,397,186,417,1000


In [23]:
stats.chi2_contingency(observed)

Chi2ContingencyResult(statistic=7.169321280162059, pvalue=0.518479392948842, dof=8, expected_freq=array([[ 23.82 ,  11.16 ,  25.02 ],
       [ 61.138,  28.644,  64.218],
       [ 99.647,  46.686, 104.667],
       [ 15.086,   7.068,  15.846],
       [197.309,  92.442, 207.249]]))

## Summary

Chi-squared tests provide a way to investigate differences in the distributions of categorical variables with the same categories and the dependence between categorical variables. In the next lesson, we'll learn about a third statistical inference test, the analysis of variance, that lets us compare several sample means at the same time.