# Beginners Guide to Chi square test for independence with Jupyter Notebook

# Table of contents

- Introduction
- SciPy package
- Simple setup
- $\chi^2$ value, p-value, Degree of freedom, expected values
- Vertical and horizontal data
- Critical values
- The null and alternative hypotheses

# Introduction

The Chi-square test for independence is also called Pearson’s chi-square test. There are three ways to use the Chi-square. The Chi-square test for independence shows how two sets of data are independent from each other.  Chi-square of Goodness of fit test shows how different your data to the expected value. The [test for homogeneity determines](https://courses.lumenlearning.com/wmopen-concepts-statistics/chapter/test-of-homogeneity/) if two or more populations have the same distribution of a single categorical variable. 

In this article we are going to explore Chi-square test for independence with Jupyter Notebook. If you do not have the Jupyter Notebook, please read ["Beginner’s Guide to Jupyter Notebook
From"](http://bit.ly/2S1yHIm) first. Oh by the way we pronounce "Chi" as "kai" like "kite", NOT "chi" in "Chili". $\chi$ is a Greek letter for "Chi", so $\chi^2$ and Chi-square are the same.

Chi-square test for independece can be used in science, economics, marketing, or other various feilds.

> The expected asked “Hey the observed! How different are you from us?”

# SciPy package

In order to find Chi-square, we are going to use [SciPy](https://www.scipy.org/) package. SciPy is a Python-based open-source software for mathematics, science and engineering.  [`scipy.stats.chi2_contingency`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingency.html) is a useful tool to use and please do not confuse with [`scipy.stats.chisquare`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chisquare.html?highlight=stats%20chisquare#scipy.stats.chisquare). 

`scipy.stats.chi2_contingency` is for Chi-square test for independence and `scipy.stats.chisquare` is for Chi-square of Goodness of fit test.

You can find out more details [here](https://stats.stackexchange.com/questions/110718/chi-squared-test-with-scipy-whats-the-difference-between-chi2-contingency-and).

# Simple setup

We are going to import chi2_contingency , pandas and numpy libraries and create a small sample data.

You can find the all the codes [here](http://bit.ly/2EaqgST).

In [2]:
from scipy.stats import chi2_contingency
import pandas as pd
import numpy as np

Let’s say we collected data on the favorite color of T-shirts for men and women. We want to find out whether color and gender are independent or not. We create a dataframe and we will store our data in `tshirts`. Pandas `index` and `columns` is used to name rows and columns. In order to see what is in our `tshirts`, we just write it at the end. 

In [37]:
tshirts = pd.DataFrame(
    [
        [48,22,33,47],
        [35,36,42,27]
    ],
    index=["Male","Female"],
    columns=["Balck","White","Red","Blue"])
tshirts

Unnamed: 0,Balck,White,Red,Blue
Male,48,22,33,47
Female,35,36,42,27


SciPy's `chi2_contingency()` returns $\chi^2$ value, p-value, degree of freedom and expected values.

In [38]:
chi2_contingency(tshirts)

(11.56978992417547,
 0.00901202511379703,
 3,
 array([[42.93103448, 30.        , 38.79310345, 38.27586207],
        [40.06896552, 28.        , 36.20689655, 35.72413793]]))

In [39]:
pets = pd.DataFrame(
    [
        [207,282],
        [231,242]
    ],
    index=["Male","Female"],
    columns=["Cat","Dog"])
chi2_contingency(pets)

(3.8453857725970124,
 0.04988304516839146,
 1,
 array([[222.64241164, 266.35758836],
        [215.35758836, 257.64241164]]))

## Expected values

You can find the expected values in the forth returned value in an array form. Let's print the expected values in a friendly way. We again use the Pandas dataframe.

In [43]:
expected=chi2_contingency(tshirts)[3]
pd.DataFrame(
    data=expected[0:,0:], 
    index=["Male","Female"],
    columns=["Balck","White","Red","Blue"]
)

Unnamed: 0,Balck,White,Red,Blue
Male,42.931034,30.0,38.793103,38.275862
Female,40.068966,28.0,36.206897,35.724138


## $\chi^2$ value

You can find $\chi^2$ in the first returned value from `chi2_contingency`. The formula for the Chi-square is: 

\begin{equation}
\chi^2=\Sigma\frac{(O-E)^2}{E}
\end{equation}

Here O is the actual value and E is the expected value. This equation tells us to find the square of difference between the actual value and expected value and divide it by the expected value. Then add all together to find the $\chi^2$ value. 

\begin{equation}
\frac{(48-42.93)^2}{42.93}+\frac{(22-30)^2}{30}+\frac{(33-38.79)^2}{38.79}+\frac{(47-38.28)^2}{38.28}+\frac{(35-40.07)^2}{40.07}+\frac{(36-28)^2}{28}+\frac{(42-36.21)^2}{36.21}+\frac{(27-35.725)^2}{35.72}
\end{equation}

This what `chi2_contingency` is doing behind the scene. Since Python is 0 based index, in order to print the $\chi^2$ we need to use `[0]` which is the first value.

In [44]:
chisquare=chi2_contingency(tshirts)[0]
chisquare

11.56978992417547

## p-value

You can read more details about p-value [here](https://towardsdatascience.com/p-values-explained-by-data-scientist-f40a746cfc8). You can find the p-value in the second returned value.

In [14]:
pvalue=chi2_contingency(tshirts)[1]
pvalue

2.2247293911334693e-07

## Degree of freedom

You can find the p-value in the third returned value. We are going to use this to find the critical value later.

In [15]:
dof=chi2_contingency(tshirts)[2]
dof

3

## Horizontal data

Generally, you want to import data from a file. This CSV file has data horizontally.

In [17]:
tshirtshorizontal = pd.read_csv('https://raw.githubusercontent.com/shinokada/python-for-ib-diploma-mathematics/master/Data/tshirts-horizontal.csv')
tshirtshorizontal

Unnamed: 0,gender,Black,White,Red,Blue
0,Male,48,12,33,57
1,Female,35,46,42,27


We need to remove the gender column.

In [18]:
num = tshirtshorizontal.iloc[:,1:]
num

Unnamed: 0,Black,White,Red,Blue
0,48,12,33,57
1,35,46,42,27


We need to convert to an array.

In [19]:
arr1 = num.to_numpy()
arr1

array([[48, 12, 33, 57],
       [35, 46, 42, 27]])

Now we can use `chi2_contingency()`.

In [36]:
chi2_contingency(arr1)

(33.76146477535758, 2.2247293911334693e-07, 3, array([[41.5, 29. , 37.5, 42. ],
        [41.5, 29. , 37.5, 42. ]]))

## Vertical data

This time we use data laid out vertically.

In [21]:
tshirtsvertical = pd.read_csv('https://raw.githubusercontent.com/shinokada/python-for-ib-diploma-mathematics/master/Data/tshirts-vertical.csv')
tshirtsvertical

Unnamed: 0,Color,Male,Female
0,Black,48,35
1,White,12,46
2,Red,33,42
3,Blue,57,27


In [31]:
arr2 = tshirtsvertical.iloc[:,1:].to_numpy()

array([[48, 35],
       [12, 46],
       [33, 42],
       [57, 27]])

In [35]:
chi2_contingency(arr2)

(33.76146477535759, 2.224729391133464e-07, 3, array([[41.5, 29. , 37.5, 42. ],
        [41.5, 29. , 37.5, 42. ]]))

If you prefer horizontal data to vertical data, you can transform data from vertical to horizontal by using T.

In [23]:
tshirtsvertical.T

Unnamed: 0,0,1,2,3
Color,Black,White,Red,Blue
Male,48,12,33,57
Female,35,46,42,27


We remove the first row using `iloc`.

In [24]:
num2=tshirtsvertical.T.iloc[1:,0:]
num2

Unnamed: 0,0,1,2,3
Male,48,12,33,57
Female,35,46,42,27


In [33]:
arr3=num2.to_numpy()
arr3

array([[48, 12, 33, 57],
       [35, 46, 42, 27]], dtype=object)

In [34]:
chi2_contingency(arr3)

(33.76146477535759, 2.224729391133464e-07, 3, array([[41.5, 29. , 37.5, 42. ],
        [41.5, 29. , 37.5, 42. ]]))

## Critical values

The level of significance and degree of freedom can be used to find the critical value. As I mentioned before you can find the degree of freedom from the array. In order to find critical values, you need to import chi2 from scipy.state and define probability from the level of significance, 1%, 5% 10%, etc.

In [40]:
from scipy.stats import chi2
significance = 0.01
p = 1 - significance
dof = chi2_contingency(arr2)[2]
critical_value = chi2.ppf(p, dof)
critical_value

11.344866730144373

When the degree of freedom is 3 and at the 1% level of significance the critical value is about 11.34.
You can confirm with this value using cdf.

In [44]:
p = chi2.cdf(critical_value, dof)
p

0.99

# The null and alternative hypotheses

$H_0:$ Two variables are independent.
$H_1:$ Two variables are dependent.

If we reject the Null hypotheses by comparing the Chi-square value and the critical value.

In [47]:
subjects = pd.DataFrame(
    [
        [25,46,15],
        [15,44,15],
        [10,10,20]
    ],
    index=['Biology','Chemistry','Physics'],
    columns=['Math SL AA','Math SL AI','Math HL'])
subjects

Unnamed: 0,Math SL AA,Math SL AI,Math HL
Biology,25,46,15
Chemistry,15,44,15
Physics,10,10,20


If $\chi_{calc}^2 > \chi_{critical}^2$ we reject the null hypothesis.

In [76]:
chi, pval, dof, exp = chi2_contingency(subjects)

significance = 0.05
p = 1 - significance
critical_value = chi2.ppf(p, dof)

print('chi=%.6f, critical value=%.6f\n' % (chi, critical_value))

if chi > critical_value:
	print("""At %.2f level of significance, we reject the null hypotheses and accept H1. 
They are not independent.""" % (significance))
else:
	print("""At %.2f level of significance, we accept the null hypotheses. 
They are independent.""" % (significance))

chi=20.392835, critical value=9.487729

At 0.05 level of significance, we reject the null hypotheses and accept H1. 
They are not independent.


Alternatively we can compare p-value and the level of siginificance.
If `p-value < the level of significance`, we reject the null hypotheses.

In [74]:
chi, pval, dof, exp = chi2_contingency(subjects)
significance = 0.05

print('p-value=%.6f, significance=%.2f\n' % (pval, significance))

if pval < significance:
	print("""At %.2f level of significance, we reject the null hypotheses and accept H1. 
They are not independent.""" % (significance))
else:
	print("""At %.2f level of significance, we accept the null hypotheses. 
They are independent.""" % (significance))

p-value=0.000418, significance=0.05

At 0.05 level of significance, we reject the null hypotheses and accept H1. 
They are not independent.


# Conclusion

Once you know how to extract data from the CSV file, it is reasonably easy to find Chi-square, p-value, degree of freedom and critical value.