# Beginners Guide to Chi square test for independence with Jupyter Notebook

# Table of contents

- Introduction
- Prerequisite
- SciPy package
- Setup
- Python indexing
- $\chi^2$ value
- Expected values
- Selecting data
- More about $\chi^2$ value
- p-value, Degree of freedom
- Importing data, vertical and horizontal data, iloc, Pandas.DataFrame.transpose()
- Critical values
- The null and alternative hypotheses

# Introduction

The Chi-square test for independence is also called Pearson’s chi-square test. Chi-square test for independece is used in science, economics, marketing, or other various feilds. There are three ways to use the Chi-square. The Chi-square test for independence shows how two sets of data are independent from each other.  Chi-square of Goodness of fit test shows how different your data to the expected value. The [test for homogeneity determines](https://courses.lumenlearning.com/wmopen-concepts-statistics/chapter/test-of-homogeneity/) if two or more populations have the same distribution of a single categorical variable. 

In this article we are going to explore Chi-square test for independence with Jupyter Notebook. Oh by the way we pronounce "Chi" as "kai" like "kite", NOT "chi" in "Chili". $\chi$ is a Greek letter for "Chi", so $\chi^2$ and Chi-square are the same.

I will try to explain as much as possible to understand what's going on in all the codes.

> The expected asked “Hey the observed! How different are you from us?”

# Prerequisite

Even though this artile is aimed at beginners, please read ["Beginner’s Guide to Jupyter Notebook
From"](http://bit.ly/2S1yHIm) before continuing.

# SciPy package

In order to find Chi-square, we are going to use [SciPy](https://www.scipy.org/) package. SciPy is a Python-based open-source software for mathematics, science and engineering.  [`scipy.stats.chi2_contingency`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingency.html) is a useful tool to use and please do not confuse with [`scipy.stats.chisquare`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chisquare.html?highlight=stats%20chisquare#scipy.stats.chisquare). 

`scipy.stats.chi2_contingency` is for Chi-square test for independence and `scipy.stats.chisquare` is for Chi-square of Goodness of fit test.

You can find out more details [here](https://stats.stackexchange.com/questions/110718/chi-squared-test-with-scipy-whats-the-difference-between-chi2-contingency-and).

# Setup

Start Anaconda and launch Jupyter Notebook.

<center><img src="image/anaconda.png"></center>

Create a file by clicking New > Python 3

<center><img src="image/create-python3.png"></center>

Rename the file to "Chi-square test for independence".

<center><img src="image/rename-chi.png"></center>

In the first cell, we are going to import chi2_contingency , pandas and numpy libraries.

In [42]:
from scipy.stats import chi2_contingency
import pandas as pd
import numpy as np

When you run codes in Jupyter Notebook, you press SHIFT + RETURN.

Let’s say we collected data on the favorite color of T-shirts for men and women. We want to find out whether color and gender are independent or not. We create a small sample data using the Pandas dataframe and we will store our data in a variable called `tshirts`. 

Pandas `index` and `columns` is used to name rows and columns. In order to print what's in our `tshirts` variable, we just write `tshirts` at the end and enter SHIFT + RETURN. 

In [43]:
tshirts = pd.DataFrame(
    [
        [48,22,33,47],
        [35,36,42,27]
    ],
    index=["Male","Female"],
    columns=["Balck","White","Red","Blue"])
tshirts

Unnamed: 0,Balck,White,Red,Blue
Male,48,22,33,47
Female,35,36,42,27


You can find what the labels in the columns are by using `columns`.

In [44]:
tshirts.columns

Index(['Balck', 'White', 'Red', 'Blue'], dtype='object')

Similarly you can use `index` to find out what indexs are.

In [45]:
tshirts.index

Index(['Male', 'Female'], dtype='object')

# Python indexing

Python uses zero-based indexing. That means, the first element has an index 0, the second has index 1, and so on. If you want to access the fourth value in the `chi2_contingency(tshirts)` you need to use `[3]`.

# $\chi^2$ value 

SciPy's `chi2_contingency()` returns $\chi^2$ value, p-value, degree of freedom and expected values.

In [46]:
chi2_contingency(tshirts)

(11.56978992417547,
 0.00901202511379703,
 3,
 array([[42.93103448, 30.        , 38.79310345, 38.27586207],
        [40.06896552, 28.        , 36.20689655, 35.72413793]]))

# Expected values

You can find the expected values at the forth in the returned value. It is in an array form. 
It is a bit hard to see so let's print the expected values in a friendly way. We again use the Pandas dataframe.

In [47]:
df=chi2_contingency(tshirts)[3]
print(df)
pd.DataFrame(
    data=df[:,:], 
    index=["Male","Female"],
    columns=["Balck","White","Red","Blue"]
)

[[42.93103448 30.         38.79310345 38.27586207]
 [40.06896552 28.         36.20689655 35.72413793]]


Unnamed: 0,Balck,White,Red,Blue
Male,42.931034,30.0,38.793103,38.275862
Female,40.068966,28.0,36.206897,35.724138


# Selecting data

Panadas dataframe allow us different ways to select data.
In the above code, we used `df[:,:]`. The first part is for rows and second part is for columns. The `:` means select all. `df[:,:]` means to select all rows and columns. 

Let's find out more details with examples. We create an array of numbers with `numpy`. Then we use it to create a Pandas dataframe.

In [48]:
mydata = np.array([[1,2,3,4],
                   [5,6,7,8],
                   [9,10,11,12],
                   [13,14,15,16]]
                 )

In order to select after the first row and all columns, use `df[1:,:]`.

In [49]:
pd1= pd.DataFrame(
    data=mydata[1:,:]
)
print(pd1)

    0   1   2   3
0   5   6   7   8
1   9  10  11  12
2  13  14  15  16


In order to select all rows and after the second column, use `df[:,2:]`.

In [50]:
# To select all rows and after the second column, use `df[:,2:].
pd2= pd.DataFrame(
    data=mydata[:,2:]
)
print(pd2)

    0   1
0   3   4
1   7   8
2  11  12
3  15  16


In order to select after the first row and after the second column, use `df[1:,2:]`.

In [51]:
# To select after the first row and after the second column, use `df[1:,2:]`.
pd3= pd.DataFrame(
    data=mydata[1:,2:]
)
print(pd3)

    0   1
0   7   8
1  11  12
2  15  16


There is another way to select data from the Pandas dataframe using `df["column name"]["row index or row name"]`. This method has limitations. You can use only a column name but not a column index whereas you can use a row name and a row index.   
We create a Pandas dataframe using numpy array `mydata` for this exercise. We add index and column names using `index` and `column`. 

In [52]:
df = pd.DataFrame(mydata)
df.index=["Row 1","Row 2","Row 3","Row 4"]
df.columns=["Col 1", "Col 2", "Col 3", "Col 4"]
print(df)

       Col 1  Col 2  Col 3  Col 4
Row 1      1      2      3      4
Row 2      5      6      7      8
Row 3      9     10     11     12
Row 4     13     14     15     16


We can select Col 1 column.

In [53]:
print(df["Col 1"])

Row 1     1
Row 2     5
Row 3     9
Row 4    13
Name: Col 1, dtype: int64


Selecting Col 2 and Row 3 which is 10.

In [54]:
print(df["Col 2"]["Row 3"])

10


Using an index for the row to select the same number 10.

In [55]:
print(df["Col 2"][2])

10


One more way to select is to use `iloc` and `loc`. If you are interested to know about it, please read [this article](https://www.shanelynn.ie/select-pandas-dataframe-rows-and-columns-using-iloc-loc-and-ix/).

## More about $\chi^2$ value

Let's go back to the $\chi^2$. Previously I wrote that you can find $\chi^2$ in the first returned value from `chi2_contingency`. But how do you find the $\chi^2$ manually? The formula for the Chi-square is: 

\begin{equation}
\chi^2=\Sigma\frac{(O-E)^2}{E}
\end{equation}

Here O is the actual value and E is the expected value. 

## Side note about Latex

I used Latex, pronounce 'lah-teck' to write the above equation in Jupyter Notebook. The cell you are writing must be Markdown and this is what you need to type in the cell.

<center><img src="image/markdown.png"></center>

The $\chi^2$ equation tells us to find the square of difference between the actual value and expected value and divide it by the expected value. Then add all together to find the $\chi^2$ value. 

\begin{equation}
\frac{(48-42.93)^2}{42.93}+\frac{(22-30)^2}{30}+\frac{(33-38.79)^2}{38.79}+\frac{(47-38.28)^2}{38.28}+\frac{(35-40.07)^2}{40.07}+\frac{(36-28)^2}{28}+\frac{(42-36.21)^2}{36.21}+\frac{(27-35.725)^2}{35.72}
\end{equation}

This what `chi2_contingency` is doing behind the scene. Since Python is 0 based index, in order to print the $\chi^2$ we need to use `[0]` which is the first value.

In [56]:
chisquare=chi2_contingency(tshirts)[0]
chisquare

11.56978992417547

## p-value

You can read more details about p-value [here](https://towardsdatascience.com/p-values-explained-by-data-scientist-f40a746cfc8). You can find the p-value in the second returned value.

In [57]:
pvalue=chi2_contingency(tshirts)[1]
pvalue

0.00901202511379703

## Degree of freedom

You can find the degree of freedom in the third returned value. We are going to use this to find the critical value later. The way you find the critical value for $\chi^2$ for independence is different from $\chi^2$ Goodness of fit. 

For $\chi^2$ for independence:
\begin{equation}
\text{dof} = \text{(the number of rows - 1)} \times \text{ (the number of columns - 1) }
\end{equation}

For example if your data has 4 rows x 3 columns, then the degree of freedom is:
\begin{equation}
\text{dof } = (4-1) \times (3-1)=6
\end{equation}

For $\chi^2$ Goodness of fit, the categorical data has one dimention. And the degrees of freedom is:

\begin{equation}
\text{dof } = (n - 1) \text{ where n is the number of categories that the variable is divided into.}
\end{equation}


For our t-shirts, we use `[2]` which is the third index to find the degree of freedom.

In [98]:
dof=chi2_contingency(tshirts)[2]
dof

3

# Importing data

## Horizontal data

Generally, you want to import data from a file. The first CSV file has data horizontally. By using `pd.read_csv` the data automatically changed to a Pandas dataframe. 

The CSV file has the following data.

Let's store the data to a variable called `tshirtshor`.

In [99]:
tshirtshor = pd.read_csv('https://raw.githubusercontent.com/shinokada/python-for-ib-diploma-mathematics/master/Data/tshirts-horizontal.csv')
tshirtshor

Unnamed: 0,gender,Black,White,Red,Blue
0,Male,48,12,33,57
1,Female,35,46,42,27


But we can not use `shirtshor` yet. You can try what happens if you run `chi2_contingency(tshirtshor)`. When we used `pd.read_csv()` the Pandas automatically added an index column. There are a couple of solutions. The first one is to set the index to gender.

In [100]:
new1= tshirtshor.set_index('gender')
new1

Unnamed: 0_level_0,Black,White,Red,Blue
gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Male,48,12,33,57
Female,35,46,42,27


Now we can run chi2_contingency(newtshirts).

In [101]:
chi2_contingency(new1)

(33.76146477535758, 2.2247293911334693e-07, 3, array([[41.5, 29. , 37.5, 42. ],
        [41.5, 29. , 37.5, 42. ]]))

The second way is to remove the gender column.

In [92]:
new2 = tshirtshor.iloc[:,1:]
new2

Unnamed: 0,Black,White,Red,Blue
0,48,12,33,57
1,35,46,42,27


Now we can use `chi2_contingency()`.

In [93]:
chi2_contingency(new2)

(33.76146477535758, 2.2247293911334693e-07, 3, array([[41.5, 29. , 37.5, 42. ],
        [41.5, 29. , 37.5, 42. ]]))

## Vertical data

We are going to use vertically laid out data. Let's store the data to a variable called `tshirtsver`.

In [97]:
tshirtsver = pd.read_csv('https://raw.githubusercontent.com/shinokada/python-for-ib-diploma-mathematics/master/Data/tshirts-vertical.csv')
tshirtsver

Unnamed: 0,Color,Male,Female
0,Black,48,35
1,White,12,46
2,Red,33,42
3,Blue,57,27


As you can see Pandas added an index column, so we need to remove or set the index.

In [103]:
arr2 = tshirtsver.iloc[:,1:]
arr2

Unnamed: 0,Male,Female
0,48,35
1,12,46
2,33,42
3,57,27


## iloc

Wait, wait! What is `iloc`? `iloc` is used to select data using index number. The structure of `iloc` is `iloc[row index,colum index]`. Our `iloc[:,1:]` means select all rows and all columns after the first column.

Now we can use our friend `chi2_contingency()`.

In [104]:
chi2_contingency(arr2)

(33.76146477535758, 2.2247293911334693e-07, 3, array([[41.5, 41.5],
        [29. , 29. ],
        [37.5, 37.5],
        [42. , 42. ]]))

## Pandas.DataFrame.transpose()

If you prefer horizontal data to vertical data, you can transpose data from vertical to horizontal by using `Pandas.DataFrame.transpose()` or `T` for short.

In [23]:
tshirtsvertical.T

Unnamed: 0,0,1,2,3
Color,Black,White,Red,Blue
Male,48,12,33,57
Female,35,46,42,27


We remove the first row using `iloc[1:,0:]`.

In [106]:
num2=tshirtsvertical.T.iloc[1:,0:]
num2

Unnamed: 0,0,1,2,3
Male,48,12,33,57
Female,35,46,42,27


Now let's use `chi2_contingency()`.

In [109]:
chi2_contingency(num2)

(33.76146477535759, 2.224729391133464e-07, 3, array([[41.5, 29. , 37.5, 42. ],
        [41.5, 29. , 37.5, 42. ]]))

## Critical values

The level of significance and degree of freedom can be used to find the critical value. As I mentioned before you can find the degree of freedom from the array. In order to find critical values, you need to import chi2 from scipy.state and define probability from the level of significance, 1%, 5% 10%, etc.

In [40]:
from scipy.stats import chi2
significance = 0.01
p = 1 - significance
dof = chi2_contingency(arr2)[2]
critical_value = chi2.ppf(p, dof)
critical_value

11.344866730144373

When the degree of freedom is 3 and at the 1% level of significance the critical value is about 11.34.
You can confirm with this value using cdf.

In [44]:
p = chi2.cdf(critical_value, dof)
p

0.99

# The null and alternative hypotheses

$H_0:$ Two variables are independent.
$H_1:$ Two variables are dependent.

If we reject the Null hypotheses by comparing the Chi-square value and the critical value.

In [47]:
subjects = pd.DataFrame(
    [
        [25,46,15],
        [15,44,15],
        [10,10,20]
    ],
    index=['Biology','Chemistry','Physics'],
    columns=['Math SL AA','Math SL AI','Math HL'])
subjects

Unnamed: 0,Math SL AA,Math SL AI,Math HL
Biology,25,46,15
Chemistry,15,44,15
Physics,10,10,20


If $\chi_{calc}^2 > \chi_{critical}^2$ we reject the null hypothesis.

In [76]:
chi, pval, dof, exp = chi2_contingency(subjects)

significance = 0.05
p = 1 - significance
critical_value = chi2.ppf(p, dof)

print('chi=%.6f, critical value=%.6f\n' % (chi, critical_value))

if chi > critical_value:
	print("""At %.2f level of significance, we reject the null hypotheses and accept H1. 
They are not independent.""" % (significance))
else:
	print("""At %.2f level of significance, we accept the null hypotheses. 
They are independent.""" % (significance))

chi=20.392835, critical value=9.487729

At 0.05 level of significance, we reject the null hypotheses and accept H1. 
They are not independent.


Alternatively we can compare p-value and the level of siginificance.
If `p-value < the level of significance`, we reject the null hypotheses.

In [74]:
chi, pval, dof, exp = chi2_contingency(subjects)
significance = 0.05

print('p-value=%.6f, significance=%.2f\n' % (pval, significance))

if pval < significance:
	print("""At %.2f level of significance, we reject the null hypotheses and accept H1. 
They are not independent.""" % (significance))
else:
	print("""At %.2f level of significance, we accept the null hypotheses. 
They are independent.""" % (significance))

p-value=0.000418, significance=0.05

At 0.05 level of significance, we reject the null hypotheses and accept H1. 
They are not independent.


# Conclusion

Once you know how to extract data from the CSV file, it is reasonably easy to find Chi-square, p-value, degree of freedom and critical value.