# Beginners Guide to Chi square test for independence with Jupyter Notebook

## 

# Table of contents

- Introduction
- What is the Chi-square test for
- 

> The expected asked “Hey the observed! How different are you from us?”

When you have a large amount of data you can use Python instead of a calculator to find Chi-square. It is useful for internal assessment and also extended essay. In this article, I will show you how to do it. You can find the all the codes [here](http://bit.ly/2EaqgST).

The easiest way to use Python is Jupyter Notebook. Anaconda is a free and open-source distribution of the Python programming languages for scientific computing. Jupyter Notebook is included in Anaconda. You can install Anaconda in this [link](https://www.anaconda.com/distribution/).

The Chi-square test for independence is also called Pearson’s chi-square test. In order to find Chi-square with Python, scipy.stats.chi2_contingency is a useful tool to use. Please do not confuse with scipy.stats.chisquare . You can find out more details here.
First, we need to import chi2_contingency , pandas and numpy libraries.

In [10]:
from scipy.stats import chi2_contingency
import pandas as pd
import numpy as np

Let’s say we collected data on the favorite color of T-shirts for men and women. We want to find out whether color and gender are independent or not.

In [11]:
tshirts = pd.DataFrame(
    [
        [48,12,33,57],
        [35,46,42,27]
    ],
    index=["Male","Female"],
    columns=["Balck","White","Red","Blue"])
tshirts

Unnamed: 0,Balck,White,Red,Blue
Male,48,12,33,57
Female,35,46,42,27


In [12]:
chi2_contingency(tshirts)

(33.76146477535758, 2.2247293911334693e-07, 3, array([[41.5, 29. , 37.5, 42. ],
        [41.5, 29. , 37.5, 42. ]]))

## $\chi^2$ value

You can find $\chi^2$ in the first returned value.

In [13]:
chisquare=chi2_contingency(tshirts)[0]
chisquare

33.76146477535758

## p-value

You can find the p-value in the second returned value.

In [14]:
pvalue=chi2_contingency(tshirts)[1]
pvalue

2.2247293911334693e-07

## Degree of freedom

You can find the p-value in the third returned value. We are going to use this to find the critical value later.

In [15]:
dof=chi2_contingency(tshirts)[2]
dof

3

## Expected values

You can find the expected values in the forth returned value in an array form. 

In [16]:
expected=chi2_contingency(tshirts)[3]
expected

array([[41.5, 29. , 37.5, 42. ],
       [41.5, 29. , 37.5, 42. ]])

## Horizontal data

Generally, you want to import data from a file. This CSV file has data horizontally.

In [17]:
tshirtshorizontal = pd.read_csv('https://raw.githubusercontent.com/shinokada/python-for-ib-diploma-mathematics/master/Data/tshirts-horizontal.csv')
tshirtshorizontal

Unnamed: 0,gender,Black,White,Red,Blue
0,Male,48,12,33,57
1,Female,35,46,42,27


We need to remove the gender column.

In [18]:
num = tshirtshorizontal.iloc[:,1:]
num

Unnamed: 0,Black,White,Red,Blue
0,48,12,33,57
1,35,46,42,27


We need to convert to an array.

In [19]:
arr1 = num.to_numpy()
arr1

array([[48, 12, 33, 57],
       [35, 46, 42, 27]])

Now we can use `chi2_contingency()`.

In [36]:
chi2_contingency(arr1)

(33.76146477535758, 2.2247293911334693e-07, 3, array([[41.5, 29. , 37.5, 42. ],
        [41.5, 29. , 37.5, 42. ]]))

## Vertical data

This time we use data laid out vertically.

In [21]:
tshirtsvertical = pd.read_csv('https://raw.githubusercontent.com/shinokada/python-for-ib-diploma-mathematics/master/Data/tshirts-vertical.csv')
tshirtsvertical

Unnamed: 0,Color,Male,Female
0,Black,48,35
1,White,12,46
2,Red,33,42
3,Blue,57,27


In [31]:
arr2 = tshirtsvertical.iloc[:,1:].to_numpy()

array([[48, 35],
       [12, 46],
       [33, 42],
       [57, 27]])

In [35]:
chi2_contingency(arr2)

(33.76146477535759, 2.224729391133464e-07, 3, array([[41.5, 29. , 37.5, 42. ],
        [41.5, 29. , 37.5, 42. ]]))

If you prefer horizontal data to vertical data, you can transform data from vertical to horizontal by using T.

In [23]:
tshirtsvertical.T

Unnamed: 0,0,1,2,3
Color,Black,White,Red,Blue
Male,48,12,33,57
Female,35,46,42,27


We remove the first row using `iloc`.

In [24]:
num2=tshirtsvertical.T.iloc[1:,0:]
num2

Unnamed: 0,0,1,2,3
Male,48,12,33,57
Female,35,46,42,27


In [33]:
arr3=num2.to_numpy()
arr3

array([[48, 12, 33, 57],
       [35, 46, 42, 27]], dtype=object)

In [34]:
chi2_contingency(arr3)

(33.76146477535759, 2.224729391133464e-07, 3, array([[41.5, 29. , 37.5, 42. ],
        [41.5, 29. , 37.5, 42. ]]))

## Critical values

The level of significance and degree of freedom can be used to find the critical value. As I mentioned before you can find the degree of freedom from the array. In order to find critical values, you need to import chi2 from scipy.state and define probability from the level of significance, 1%, 5% 10%, etc.

In [40]:
from scipy.stats import chi2
significance = 0.01
p = 1 - significance
dof = chi2_contingency(arr2)[2]
critical_value = chi2.ppf(p, dof)
critical_value

11.344866730144373

When the degree of freedom is 3 and at the 1% level of significance the critical value is about 11.34.
You can confirm with this value using cdf.

In [44]:
p = chi2.cdf(critical_value, dof)
p

0.99

# The null and alternative hypotheses

$H_0:$ Two variables are independent.
$H_1:$ Two variables are dependent.

If we reject the Null hypotheses by comparing the Chi-square value and the critical value.

In [47]:
subjects = pd.DataFrame(
    [
        [25,46,15],
        [15,44,15],
        [10,10,20]
    ],
    index=['Biology','Chemistry','Physics'],
    columns=['Math SL AA','Math SL AI','Math HL'])
subjects

Unnamed: 0,Math SL AA,Math SL AI,Math HL
Biology,25,46,15
Chemistry,15,44,15
Physics,10,10,20


If $\chi_{calc}^2 > \chi_{critical}^2$ we reject the null hypothesis.

In [76]:
chi, pval, dof, exp = chi2_contingency(subjects)

significance = 0.05
p = 1 - significance
critical_value = chi2.ppf(p, dof)

print('chi=%.6f, critical value=%.6f\n' % (chi, critical_value))

if chi > critical_value:
	print("""At %.2f level of significance, we reject the null hypotheses and accept H1. 
They are not independent.""" % (significance))
else:
	print("""At %.2f level of significance, we accept the null hypotheses. 
They are independent.""" % (significance))

chi=20.392835, critical value=9.487729

At 0.05 level of significance, we reject the null hypotheses and accept H1. 
They are not independent.


Alternatively we can compare p-value and the level of siginificance.
If `p-value < the level of significance`, we reject the null hypotheses.

In [74]:
chi, pval, dof, exp = chi2_contingency(subjects)
significance = 0.05

print('p-value=%.6f, significance=%.2f\n' % (pval, significance))

if pval < significance:
	print("""At %.2f level of significance, we reject the null hypotheses and accept H1. 
They are not independent.""" % (significance))
else:
	print("""At %.2f level of significance, we accept the null hypotheses. 
They are independent.""" % (significance))

p-value=0.000418, significance=0.05

At 0.05 level of significance, we reject the null hypotheses and accept H1. 
They are not independent.


# Conclusion

Once you know how to extract data from the CSV file, it is reasonably easy to find Chi-square, p-value, degree of freedom and critical value.