<a href="https://colab.research.google.com/github/sauravkokane/Data-Science-Training/blob/master/Chi_Square_Test.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import numpy as np
import pandas as pd
from scipy.stats import chi2_contingency, chi2
import statistics as stat
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
sns.set(style="white", color_codes=True, rc={'figure.figsize':(15,8)})

In [2]:
def chi2_test_of_independence(data, significance=0.05):
    """
    Perform a chi-square test of independence to determine if there is a significant association between two categorical variables.
    And print the hypothesis.
    data : numpy array or pandas DataFrame containing the data
    significant : float, the significance level for the test

    returns : float, the p-value of the test

    """
    n_rows, n_cols = data.shape
    chisquare = 0
    # Calculate the degree of freedom and chi2 critical
    dof = (n_rows - 1) * (n_cols - 1)
    chi2_critical = chi2.ppf(1 - significance, dof)

    # Create a contingency table
    expected = np.zeros((n_cols, n_cols))
    expected = np.zeros((data.shape[0], data.shape[1]))

    for i in range(n_rows):
        for j in range(n_cols):
            expected[i, j] = ( data[i, :].sum() * data[:, j].sum() ) / data.sum()
            chisquare += (data[i, j] - expected[i, j])**2/expected[i, j]

    p_value = 1 - chi2.cdf(chisquare, dof)

    if chisquare > chi2_critical:
        print("Reject the null hypothesis (there is a significant association).")
    else:
        print("Fail to reject the null hypothesis (no significant association).")

    return chisquare, chi2_critical, p_value








# Chi-Square Test of Independence
Suppose you conduct a survey to see if there is an association between gender (male/female) and preference for a type of music (pop/rock). You collect the following data:<br />


| Gender/Music Type | Pop | Rock | Row Total |
|-------------------|-----|------|-----------|
| Male              | 30  |  20  |    50     |
| Female            | 25  |  25  |    30     |
| Column Total      | 55  |  45  |    100    |

---
Null Hypothesis:  $ H_0 $  : There is no significant association between gender and preference for music <br />
Alternate hypothesis:  $ H_a $ : There is significant association between gender and preference of music <br />

α = 0.05

In [3]:
df = pd.DataFrame({'Pop':[30,25],'Rock':[20,25]},index=['Male','Female'])
df.head()

Unnamed: 0,Pop,Rock
Male,30,20
Female,25,25


In [4]:
observed = df.values
observed

array([[30, 20],
       [25, 25]])

In [5]:
dof = (df.shape[0]-1)*(df.shape[1]-1)
dof

1

In [6]:
alpha = 0.05

In [7]:
E00 = (df.iloc[0,:].sum()*df.iloc[:,0].sum())/df.sum().sum()
E01 = (df.iloc[0,:].sum()*df.iloc[:,1].sum())/df.sum().sum()
E10 = (df.iloc[1,:].sum()*df.iloc[:,0].sum())/df.sum().sum()
E11 = (df.iloc[1,:].sum()*df.iloc[:,1].sum())/df.sum().sum()
Contingency_Table=pd.DataFrame({'Pop':[E00,E10],'Rock':[E01,E11]},index=['Male','Female'])
Contingency_Table

Unnamed: 0,Pop,Rock
Male,27.5,22.5
Female,27.5,22.5


In [8]:
chi_square = 0
for i in range(df.shape[0]):
    for j in range(df.shape[1]):
        chi_square += (df.iloc[i,j] - Contingency_Table.iloc[i,j])**2/Contingency_Table.iloc[i,j]
chi_square

1.0101010101010102

In [9]:
chi2_critical = chi2.ppf(1-alpha, dof)
chi2_critical

3.841458820694124

In [10]:
if chi_square > chi2_critical:
    print("Reject the null hypothesis (there is a significant association).")
else:
    print("Fail to reject the null hypothesis (no significant association).")

Fail to reject the null hypothesis (no significant association).


In [11]:
p = 1 - chi2.cdf(chi_square, dof)
p

0.31487864133641974

In [12]:
# Interpretation
alpha = 0.05
if p < alpha:
    print("Reject the null hypothesis (there is a significant association).")
else:
    print("Fail to reject the null hypothesis (no significant association).")

Fail to reject the null hypothesis (no significant association).


We want to determine if there is a significant association between education level and employment status. We survey 200 people and collect the following data:

---

| Education Level	| Employed	| Unemployed |	Row Total |
|-----------------|-----------|------------|------------|
| High School     |	50	      | 30	       | 80         |
| Bachelor's	    | 60	      | 20	       | 80         |
| Master's	      | 30	      | 10	       | 40         |
| Column Total	  | 140	      | 60	       | 200        |

---
Null Hypothesis  $ H_0 $ : There is no significant association between gender and preference for music <br />
Alternate hypothesis $ H_a $ : There is significant association between gender and preference of music

α = 0.05

In [13]:
df = pd.DataFrame({'Employed':[50,60,30],'Unemployed':[30,20,10]},index=['High School','Bachelor','Master'])
print(df)
data = df.values
dof = (data.shape[0]-1)*(data.shape[1]-1)
data

             Employed  Unemployed
High School        50          30
Bachelor           60          20
Master             30          10


array([[50, 30],
       [60, 20],
       [30, 10]])

In [14]:
expected = np.zeros((data.shape[0], data.shape[1]))
for i in range(data.shape[0]):
    for j in range(data.shape[1]):
        expected[i, j] = ( data[i, :].sum() * data[:, j].sum() ) / data.sum()
print(expected)

chisquare = 0
for i in range(data.shape[0]):
    for j in range(data.shape[1]):
        chisquare += (data[i, j] - expected[i, j])**2/expected[i, j]
print(chisquare)

chi2_critical = chi2.ppf(1-0.05, dof)
chi2_critical

[[56. 24.]
 [56. 24.]
 [28. 12.]]
3.571428571428571


5.991464547107979

In [15]:
chi2_test_of_independence(data, significance=0.05)

Fail to reject the null hypothesis (no significant association).


(3.571428571428571, 5.991464547107979, 0.16767724875179713)