# CHI-SQUARE () TEST OF INDEPENDENCE WITH PYTHON

### The data used in this example comes from Stata and is 1980 U.S. census data from 956 cities.



In [2]:
import pandas as pd
import researchpy as rp
import scipy.stats as stats

# To load a sample dataset for this demonstration
import statsmodels.api as sm

df = sm.datasets.webuse("citytemp2")
df

Unnamed: 0,division,region,heatdd,cooldd,tempjan,tempjuly,agecat
0,N. Eng.,NE,,,16.600000,69.599998,19-29
1,N. Eng.,NE,7947.0,250.0,18.200001,68.000000,19-29
2,Mid Atl,NE,7480.0,424.0,18.400000,70.199997,19-29
3,N. Eng.,NE,7482.0,353.0,19.900000,69.500000,19-29
4,N. Eng.,NE,7482.0,353.0,19.900000,69.500000,19-29
...,...,...,...,...,...,...,...
951,Pacific,West,0.0,3134.0,71.400002,75.400002,35+
952,Pacific,West,0.0,4389.0,72.599998,80.099998,35+
953,Pacific,West,0.0,4389.0,72.599998,80.099998,35+
954,Pacific,West,0.0,4389.0,72.599998,80.099998,35+


### The research question is the following, is there a relationship between the region and age. Before testing this relationship, let's see some basic univariate statistics.

In [5]:
df[["agecat", "region"]].head()

Unnamed: 0,agecat,region
0,19-29,NE
1,19-29,NE
2,19-29,NE
3,19-29,NE
4,19-29,NE


In [8]:
rp.summary_cat(df[["agecat", "region"]])

Unnamed: 0,Variable,Outcome,Count,Percent
0,agecat,19-29,507,53.03
1,,30-34,316,33.05
2,,35+,133,13.91
3,region,N Cntrl,284,29.71
4,,West,256,26.78
5,,South,250,26.15
6,,NE,166,17.36


### The method that needs to be used is scipy.stats.chi2_contingency and it's official documentation can be found here. This method requires one to pass a crosstabulation table, this can be accomplished using pandas.crosstab.

In [9]:
crosstab = pd.crosstab(df["region"], df["agecat"])

crosstab

agecat,19-29,30-34,35+
region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
NE,46,83,37
N Cntrl,162,92,30
South,139,68,43
West,160,73,23


### Now to pass this contingency table to the scipy.stats method. The output isn't the best formatted, but all the information is there. The information is returned within a tuple where the first value is the X2 test static, the second value is the p-value, and the third number is the degrees of freedom. An array is also returned which contains the expected cell counts.

In [10]:
stats.chi2_contingency(crosstab)

(61.28767688406036,
 2.463382670201326e-11,
 6,
 array([[ 88.03556485,  54.87029289,  23.09414226],
        [150.61506276,  93.87447699,  39.51046025],
        [132.58368201,  82.63598326,  34.78033473],
        [135.76569038,  84.61924686,  35.61506276]]))

### There is a relationship between region and the age distribution, x2(6) = 61.29, p< 0.0001.

## CHI-SQUARE TEST OF INDEPENDENCE WITH RESEARCHPY

Now to conduct the x2 test of independence using Researchpy. The method that needs to be used is researchpy.crosstab and the official documentation can be found here.

By default, the method returns the requested objects in a tuple that is just as ugly as scipy.stats. For cleaner output, one can assign each requested object from the tuple to another object and then those separately. The expected cell counts will be requested and used later while checking the assumptions for this statistical test. Additionally, will request the crosstabulation be returned with the cell percentage instead of the cell count.

In [11]:
crosstab, test_results, expected = rp.crosstab(df["region"], df["agecat"],
                                               test= "chi-square",
                                               expected_freqs= True,
                                               prop= "cell")

crosstab

Unnamed: 0_level_0,agecat,agecat,agecat,agecat
agecat,19-29,30-34,35+,All
region,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
NE,4.81,8.68,3.87,17.36
N Cntrl,16.95,9.62,3.14,29.71
South,14.54,7.11,4.5,26.15
West,16.74,7.64,2.41,26.78
All,53.03,33.05,13.91,100.0


In [13]:
test_results

Unnamed: 0,Chi-square test,results
0,Pearson Chi-square ( 6.0) =,61.2877
1,p-value =,0.0
2,Cramer's V =,0.179


ASSUMPTION CHECK
Checking the assumptions for the  test of independence is easy. Let's recall what they are:

The two samples are independent

The variables were collected independently of each other, i.e. the answer from one variable was not dependent on the answer of the other

No expected cell count is = 0
No more than 20% of the cells have and expected cell count < 5
The last two assumptions can be checked by looking at the expected frequency table.

In [14]:
expected

Unnamed: 0_level_0,agecat,agecat,agecat
agecat,19-29,30-34,35+
region,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
NE,88.035565,54.870293,23.094142
N Cntrl,150.615063,93.874477,39.51046
South,132.583682,82.635983,34.780335
West,135.76569,84.619247,35.615063


Link to the work available at
https://www.pythonfordatascience.org/chi-square-test-of-independence-python/