# Chi Square Test

Chi Square test is used for categorical variables such as hair color, academic degree etc


**Contingency Table or Cross tabulations**

We may wish to look at a summary of a categorical variable as it pertains to another categorical variable. For example, sex and interest, where interest may have the labels ‘science‘, ‘math‘, or ‘art‘. We can collect observations from people with regard to these two categorical variables;

We can summarize the collected observations in a table with one variable corresponding to columns and another variable corresponding to rows. Each cell in the table corresponds to the count or frequency of observations that correspond to the row and column categories.

Historically, a table summarization of two categorical variables in this form is called a contingency table.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
# For example, the Sex=rows and Interest=columns table with contrived counts might look as follows:
data = [[20, 30, 15], [20, 15, 30]]

In [3]:
interests = pd.DataFrame(data = data, columns=["Science", "Math", "Art"], index=["Male", "Female"])
interests

Unnamed: 0,Science,Math,Art
Male,20,30,15
Female,20,15,30


# Goodness fit test

Generally, statistical tests take the form:
<br>*test statistic = (Observed value - What we expect if the null is true)/Average Variation*

Chi Square test is slightly different. 
<br>the numerator is exactly the same but the denominator - Average Variation is a little Different
<br>Let's see why with an example

**Example**

A new game has come out called the league of lemurs. It has hundreds of different unique characters you can play with four different types: Healers, tanks, assasins and fighters.
<br> The official league of lemurs development team says that on average they see 15% players choosing healers, 20% choosing tanks, 20% choosing assassins and 45% choosing fighters, but you wonder whether that distribution is true and you wish to perform a statistical test.

*null hypothesis: the percentages that LOL gave you are correct*<br>
*alternate hypothesis: atleast one of these percentages are incorrect*

Now you record 20 games with 10 players each and count the number of healers, tanks, assasins and fighters.
<br>The total number of players in 20 games would be 20*10 = 200

In [4]:
# This is the data you've collected 
data=[25, 35, 50, 90]
x={}
columns=["Healer", "Tanks", "Assasins", "Fighter"]
for i in range(4):
    x[columns[i]] = data[i]
lol = pd.DataFrame(x, index=["count"])

In [5]:
lol

Unnamed: 0,Healer,Tanks,Assasins,Fighter
count,25,35,50,90


In [6]:
# if we use the numbers given by LOL developers,
# our data would be
data1 = [200*.15, 200*.20, 200*0.20, 200*0.45]
y = {}
for i in range(4):
    y[columns[i]] = data1[i]
lol_dev = pd.DataFrame(y, index = ["count"])

In [7]:
lol_dev

Unnamed: 0,Healer,Tanks,Assasins,Fighter
count,30.0,40.0,40.0,90.0


So you can see that these numbers aren't exactly the same.
<br>But you have to ask yourself whether they're different enough for us to consider it to be statistically significant.
<br>We need a test statistic

Using our general formula, the numerator would be $(observed value - expected value)$
<br>but if you try to add up all these differences, you'd always get zero.
<br>since the total count is the same in both the samples.

In [8]:
# let's add the differences
sum(lol.loc["count"] - lol_dev.loc["count"]) # it's equal to zero

0.0

As we can see, we need a better way to measure.
<br>using a chi square, we square them up before adding them up

Now, for the denominator, instead of standard error, we just use the expected counts again.

**Why?**
<br>*Because the amount that a count deviates from it's expected frequency should be scaled by the expected frequency.*

*for example,*
<br>A deviation of 1 is'nt a big deal if the expected count is 2000. But if it's 10, that deviation of 1 matters more.
<br>Hence the need to scale the data

Now, the test statistic is:
<br>$(observed value1 - expected value1)^2/expected value1$ + $(observed value2 - expected value2)^2/expected value2$ + ....


In [9]:
# let's calculate the statistic
st = (lol.loc["count"]-lol_dev.loc["count"])**2/lol_dev.loc["count"]
st.sum()

3.9583333333333335

Like a t-statistic, a chi-square statistic has a distribution we can use to find the p value.
<br>And like t distributions, chi-square distributions change their shape as the degrees of freedom change.

*To find the degrees of freedom, we have to think about what kind of independent information we have.*

A frequency table, like the one we just used for our LOL example, has a certain number of cells. We have 4 cells in this case.

That means we have 4 independent pieces of information. each of the 4 counts.
<br>But as soon as we know the total counts - 200 in this case, The four values are aren't ALL independent anymore.
<br>because if you know three of the values and the total, you can find the fourth one.

So, in this case, the degrees of freedom is the number of categories we have minus one.
<br>4-1=3

Using our chi square distribution with 3 degrees of freedom, we can find our p value.
<br>Our p value here is 0.26 Hence, we've failed to reject the null.
<br>The sample we took didn't give us any statistically significant evidence that the game developers' percentages were wrong.

In [10]:
# let's implement this in python

from scipy import stats
 
stats.chisquare(lol.loc["count"], f_exp=lol_dev.loc["count"]) # it's the same

Power_divergenceResult(statistic=3.9583333333333335, pvalue=0.26599866994096394)

All chi square tests follow the same formula we just worked through.

The one we did just before is called the goodness of fit test, because we tested how well certain proportions fit our sample.
<br>One way to know that you're looking at a goodness of fit chi square test is if it has only one row.
<br>we can have many categories but we're only looking at one variable. like in our case, character class.

**Note:** One Thing we should always check when doing a chi-square test is whether the expected frequency for every cell is greater than 5.
<br>If the expected frequency is lower than 5, then the results can be quite off

Chi-Square tests aren't limited to analyzing just ONE categorical variable.
<br>They can even handle two.

This second type of chi square test is called the test of independence.

# Test of Independence

Tests of Independence look to see whether one category is independent of the other.
<br>It is a hypothesis test that answers the question—do the values of one categorical variable depend on the value of other categorical variables?

The Null and alternate hypotheses would be:
<br>**Null hypothesis:** There are no relationships between the categorical variables. If you know the value of one variable, it does not help you predict the value of another variable.
<br>**Alternative hypothesis:** There are relationships between the categorical variables. Knowing the value of one variable does help you predict the value of another variable.

**Example**
<br>Let's ask a couple of harry potter fans which house they like and whether they like pineapple on pizza.
<br>What we want to know is whether the pineapple on pizza preference is independent of the hogwarts house.
<br>In other words, Does liking a pineapple on pizza affect the probability of you identifying with each of the houses.

In [11]:
# let's look at the data

count = [[79, 122, 204, 74], [82, 130, 240, 69]]
pizza = pd.DataFrame(data=count, columns=["Gryffindor", "Hufflepuff", "Ravenclaw", "Slytherin"],
                    index = ["No", "Yes"])
pizza.index.rename("Like Pineapple Pizza?", inplace=True)
pizza

Unnamed: 0_level_0,Gryffindor,Hufflepuff,Ravenclaw,Slytherin
Like Pineapple Pizza?,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
No,79,122,204,74
Yes,82,130,240,69


Unlike our chi-square goodness of fit test, we're not specifying an exact distribution for our hogwarts houses and comparing our two groups: Yes pineapple and no pineapple.
<br>In this situation, we're not too conerned about the exact distribution.
<br>We just want to know whether it's different for people who like pineapple and for those who don't.

A chi-square test of independence can test whether or not one variable - pineapple preference - is independent of another - hogwarts house.

To calculate our chi-square statistic, we need our observed frequencies which we already have, and our expected frequencies which we need to calculate.

In [12]:
# let's take the sum of the counts
pizza.loc["Count"] = [pizza.iloc[:, i].sum() for i in range(4)]

In [13]:
pizza["Count"] = [pizza.iloc[i, :].sum() for i in range(3)]
pizza

Unnamed: 0_level_0,Gryffindor,Hufflepuff,Ravenclaw,Slytherin,Count
Like Pineapple Pizza?,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
No,79,122,204,74,479
Yes,82,130,240,69,521
Count,161,252,444,143,1000


Next, we need to calculate our degrees of freedom.
<br>In general, The degrees of freedom for the chi-squared distribution is calculated based on the size of the contingency table as:
<br>degrees of freedom= (rows - 1) * (cols - 1)

**Calculation of expected frequencies**
<br>Remember the expected counts are what we would expect if the null hypothesis were true i.e no relationship exists between the categorical variables.

Now, expected count of people who do not like pineapple on pizza and belong to gryffindor is:
<br>Total number of people * (Probability that He Belongs to gryffindor given that he doesn't like pineapple on pizza)

It can be represented as:
<br>$Total Count * P(Gryffindor/NoPizza)$

Now, since we assumed that the null hypothesis is true and no relationship exists,
<br>$P(Gryffindor/NoPizza) = P(Gryffindor) * P(NoPizza)$

Now, in our example:
<br>$P(Gryffindor)$ = People in Gryffindor/Total Number of people
<br> 161/1000 = 0.161
<br>$P(NoPizza)$ = People who do not like pizza/Total Number of people
<br>479/1000 = 0.479

Hence, 
<br>$P(Gryffindor/NoPizza)$ = 0.161*0.479 = 0.0771

Hence, expected count of people who do not like pineapple on pizza and belong to gryffindor is: 
<br>1000*0.0771 = 77.1

Using the same math, we can calculate the expected frequency for all our cells.

In [14]:
# let's calculate the expected frequencies
pizza_exp = pizza.copy()
pizza_exp.loc["ratio"] = pizza_exp.loc["Count"]/pizza_exp.iloc[2, :-1].sum()

In [15]:
pizza_exp

Unnamed: 0_level_0,Gryffindor,Hufflepuff,Ravenclaw,Slytherin,Count
Like Pineapple Pizza?,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
No,79.0,122.0,204.0,74.0,479.0
Yes,82.0,130.0,240.0,69.0,521.0
Count,161.0,252.0,444.0,143.0,1000.0
ratio,0.161,0.252,0.444,0.143,1.0


In [16]:
for row in range(2):
    for column in range(4):
        pizza_exp.iloc[row, column] = pizza_exp.loc["ratio"][column]*pizza_exp["Count"][row]
pizza_exp # this is the expected frequency table

Unnamed: 0_level_0,Gryffindor,Hufflepuff,Ravenclaw,Slytherin,Count
Like Pineapple Pizza?,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
No,77.119,120.708,212.676,68.497,479.0
Yes,83.881,131.292,231.324,74.503,521.0
Count,161.0,252.0,444.0,143.0,1000.0
ratio,0.161,0.252,0.444,0.143,1.0


Once we have the expected frequency, we just have to use chi square formula on each cell, and then add them all up to get our chi-square statistic.

In [24]:
# calculation of chi-square statistic

x = pizza.drop(["Count"])
x.drop(["Count"], axis =1, inplace=True)
x

Unnamed: 0_level_0,Gryffindor,Hufflepuff,Ravenclaw,Slytherin
Like Pineapple Pizza?,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
No,79,122,204,74
Yes,82,130,240,69


In [27]:
y = pizza_exp.drop(["ratio", "Count"])
y.drop(["Count"], axis=1, inplace=True)
y

Unnamed: 0_level_0,Gryffindor,Hufflepuff,Ravenclaw,Slytherin
Like Pineapple Pizza?,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
No,77.119,120.708,212.676,68.497
Yes,83.881,131.292,231.324,74.503


And with our chi-square distribution with 3 degrees of freedom, we can see that our p value of 0.6 is very large and we fail to reject the null hypothesis that the distribution of hogwarts houses is the same regardless of pizza preference.

In [36]:
# calculation of chi-square statistic
count = 0
for i in x.columns:
    count+=sum(((x - y)**2/y)[i])
count

1.6425103571002855

In [17]:
# let's see this in python
x = pizza.drop(["Count"])
x.drop(["Count"], axis=1, inplace=True)
stat, p, dof, expected = stats.chi2_contingency(x)

In [18]:
# chi square statistic
stat

1.6425103571002833

In [19]:
# p value
p # hence we fail to reject the null

0.6497897497574125

In [20]:
# degrees of freedom
dof

3

In [21]:
# expeced values
expected

array([[ 77.119, 120.708, 212.676,  68.497],
       [ 83.881, 131.292, 231.324,  74.503]])

All our values match!

# test of homogeneity

Test of homogeneity is looking whether it's likely that different samples come from the same population.

For example,
<br>You might want to know whether two samples of water are likely from the same lake based on the counts of fish, algae, bacteria found in them.

The calculation is the same as test of independence.

*References*

https://www.youtube.com/watch?v=7_cs1YlZoug&pbjreload=10
<br>https://machinelearningmastery.com/chi-squared-test-for-machine-learning/<br>
https://statisticsbyjim.com/hypothesis-testing/chi-square-test-independence-example/