# scipy.stats chi2_contingency for $\chi^2$ test for independence, also called Pearson's chi-square test

First we need to import chi2_contingency, pandas and numpy libraries.

In [37]:
from scipy.stats import chi2_contingency
import pandas as pd
import numpy as np

Let's say we collected data on the favorite color of T-shirt for men and women. And we want to find out wheter color and gender are independent or not.

In [73]:
tshirts = pd.DataFrame(
    [
        [48,12,33,57],
        [35,46,42,27]
    ],
    index=["Male","Female"],
    columns=["Balck","White","Red","Blue"])
tshirts

Unnamed: 0,Balck,White,Red,Blue
Male,48,12,33,57
Female,35,46,42,27


chi2_contingency returns $\chi^2$, p-value, expected values.

In [40]:
chi2_contingency(tshirts)

(33.76146477535758, 2.2247293911334693e-07, 3, array([[41.5, 29. , 37.5, 42. ],
        [41.5, 29. , 37.5, 42. ]]))

The returned value above, 33.76 is $\chi^2$, 2.22 is p-value, and the array is expected values.

Generally you want to import data from a file. First csv file has data horizontally.

In [74]:
tshirtshorizontal = pd.read_csv('Data/tshirts-horizontal.csv')
tshirtshorizontal

Unnamed: 0,gender,Black,White,Red,Blue
0,Male,48,12,33,57
1,Female,35,46,42,27


We need to remove gender column.

In [60]:
num = tshirtshorizontal.iloc[:,1:]
num

Unnamed: 0,Black,White,Red,Blue
0,48,12,33,57
1,35,46,42,27


We need to convert to an array.

In [61]:
arr1 = num.to_numpy()
arr1

array([[48, 12, 33, 57],
       [35, 46, 42, 27]])

Now we can use `chi2_contingency()`.

In [62]:
chi2_contingency(arr1)

(33.76146477535758, 2.2247293911334693e-07, 3, array([[41.5, 29. , 37.5, 42. ],
        [41.5, 29. , 37.5, 42. ]]))

This time we use a data laid out vertically.

In [47]:
tshirtsvertical = pd.read_csv('Data/tshirts-vertical.csv')
tshirtsvertical

Unnamed: 0,Color,Male,Female
0,Black,48,35
1,White,12,46
2,Red,33,42
3,Blue,57,27


We transform data from vertical to horizontal by using `T`.

In [48]:
tshirtsvertical.T

Unnamed: 0,0,1,2,3
Color,Black,White,Red,Blue
Male,48,12,33,57
Female,35,46,42,27


We remove the first column using `iloc`.

In [68]:
num2=tshirtsvertical.T.iloc[1:,0:]
num2

Unnamed: 0,0,1,2,3
Male,48,12,33,57
Female,35,46,42,27


In [69]:
arr2=num2.to_numpy()
arr2

array([[48, 12, 33, 57],
       [35, 46, 42, 27]], dtype=object)

In [70]:
chi2_contingency(arr2)

(33.76146477535759, 2.224729391133464e-07, 3, array([[41.5, 29. , 37.5, 42. ],
        [41.5, 29. , 37.5, 42. ]]))