<a href="https://colab.research.google.com/github/sauravkokane/Data-Science-Training/blob/master/Chi_Square_Test.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# IMPORT ALL REQUIRED LIBRARIES

In [1]:
import numpy as np
import pandas as pd
from scipy.stats import chi2_contingency, chi2
import statistics as stat
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
sns.set(style="white", color_codes=True, rc={'figure.figsize':(15,8)})

# CHI-SQUARE TEST OF INDEPENDENCE

## ALL REQUIRED FUNCTIONS

In [2]:
def chi2_test_of_independence(data, significance=0.05):
    """
    Perform a chi-square test of independence to determine if there is a significant association between two categorical variables.
    And print the hypothesis.
    data : numpy array or pandas DataFrame containing the data
    significant : float, the significance level for the test

    returns : chisquare, chi2_critical, p_value

    """
    n_rows, n_cols = data.shape

    chisquare = 0
    # Calculate the degree of freedom and chi2 critical
    dof = (n_rows - 1) * (n_cols - 1)
    chi2_critical = chi2.ppf(1 - significance, dof)

    # Create a contingency table
    expected = np.zeros((n_rows, n_cols))

    for i in range(n_rows):
        for j in range(n_cols):
            expected[i, j] = ( data[i, :].sum() * data[:, j].sum() ) / data.sum()
            chisquare += (data[i, j] - expected[i, j])**2/expected[i, j]

    print("Expected values:\n", pd.DataFrame(expected))

    p_value = 1 - chi2.cdf(chisquare, dof)

    if chisquare > chi2_critical:
        print("Reject the null hypothesis (there is a significant association).")
    else:
        print("Fail to reject the null hypothesis (no significant association).")

    return chisquare, chi2_critical, p_value








## EXAMPLE 1


Suppose you conduct a survey to see if there is an association between gender (male/female) and preference for a type of music (pop/rock). You collect the following data:<br />


| Gender/Music Type | Pop | Rock | Row Total |
|-------------------|-----|------|-----------|
| Male              | 30  |  20  |    50     |
| Female            | 25  |  25  |    30     |
| Column Total      | 55  |  45  |    100    |

---
Null Hypothesis:  $ H_0 $  : There is no significant association between gender and preference for music <br />
Alternate hypothesis:  $ H_a $ : There is significant association between gender and preference of music <br />

α = 0.05

In [3]:
df = pd.DataFrame({'Pop':[30,25],'Rock':[20,25]},index=['Male','Female'])
df.head()

Unnamed: 0,Pop,Rock
Male,30,20
Female,25,25


In [4]:
observed = df.values
observed

array([[30, 20],
       [25, 25]])

In [5]:
dof = (df.shape[0]-1)*(df.shape[1]-1)
dof

1

In [6]:
alpha = 0.05

In [7]:
E00 = (df.iloc[0,:].sum()*df.iloc[:,0].sum())/df.sum().sum()
E01 = (df.iloc[0,:].sum()*df.iloc[:,1].sum())/df.sum().sum()
E10 = (df.iloc[1,:].sum()*df.iloc[:,0].sum())/df.sum().sum()
E11 = (df.iloc[1,:].sum()*df.iloc[:,1].sum())/df.sum().sum()
Contingency_Table=pd.DataFrame({'Pop':[E00,E10],'Rock':[E01,E11]},index=['Male','Female'])
Contingency_Table

Unnamed: 0,Pop,Rock
Male,27.5,22.5
Female,27.5,22.5


In [8]:
chi_square = 0
for i in range(df.shape[0]):
    for j in range(df.shape[1]):
        chi_square += (df.iloc[i,j] - Contingency_Table.iloc[i,j])**2/Contingency_Table.iloc[i,j]
chi_square

1.0101010101010102

In [9]:
chi2_critical = chi2.ppf(1-alpha, dof)
chi2_critical

3.841458820694124

In [10]:
if chi_square > chi2_critical:
    print("Reject the null hypothesis (there is a significant association).")
else:
    print("Fail to reject the null hypothesis (no significant association).")

Fail to reject the null hypothesis (no significant association).


In [11]:
p = 1 - chi2.cdf(chi_square, dof)
p

0.31487864133641974

In [12]:
# Interpretation
alpha = 0.05
if p < alpha:
    print("Reject the null hypothesis (there is a significant association).")
else:
    print("Fail to reject the null hypothesis (no significant association).")

Fail to reject the null hypothesis (no significant association).


## EXAMPLE 2

We want to determine if there is a significant association between education level and employment status. We survey 200 people and collect the following data:

---

| Education Level	| Employed	| Unemployed |	Row Total |
|-----------------|-----------|------------|------------|
| High School     |	50	      | 30	       | 80         |
| Bachelor's	    | 60	      | 20	       | 80         |
| Master's	      | 30	      | 10	       | 40         |
| Column Total	  | 140	      | 60	       | 200        |

---
Null Hypothesis  $ H_0 $ : There is no significant association between gender and preference for music <br />
Alternate hypothesis $ H_a $ : There is significant association between gender and preference of music

α = 0.05

In [13]:
df = pd.DataFrame({'Employed':[50,60,30],'Unemployed':[30,20,10]},index=['High School','Bachelor','Master'])
print(df)
data = df.values
dof = (data.shape[0]-1)*(data.shape[1]-1)
data

             Employed  Unemployed
High School        50          30
Bachelor           60          20
Master             30          10


array([[50, 30],
       [60, 20],
       [30, 10]])

In [14]:
expected = np.zeros((data.shape[0], data.shape[1]))
for i in range(data.shape[0]):
    for j in range(data.shape[1]):
        expected[i, j] = ( data[i, :].sum() * data[:, j].sum() ) / data.sum()
print(expected)

chisquare = 0
for i in range(data.shape[0]):
    for j in range(data.shape[1]):
        chisquare += (data[i, j] - expected[i, j])**2/expected[i, j]
print(chisquare)

chi2_critical = chi2.ppf(1-0.05, dof)
chi2_critical

[[56. 24.]
 [56. 24.]
 [28. 12.]]
3.571428571428571


5.991464547107979

In [15]:
chi2_test_of_independence(data, significance=0.05)

Expected values:
       0     1
0  56.0  24.0
1  56.0  24.0
2  28.0  12.0
Fail to reject the null hypothesis (no significant association).


(3.571428571428571, 5.991464547107979, 0.16767724875179713)

## EXAMPLE 3

Imagine you are a marketing manager who wants to understand if the type of advertisement (TV, online, or print) influences whether people will buy your product. You collect data on the number of purchases and non-purchases associated with each type of advertisement.

---


| Advertisement Type	| Bought Product	| Didn't Buy Product |	Row Total |
|---------------------|-----------------|--------------------|------------|
| TV                  |	45	            | 25	               | 70         |
| Online              | 70	            | 35	               | 105        |
| Print               | 25	            | 15	               | 40         |
| Column Total	      | 140	            | 75	               | 215        |


---

Null Hypothesis  $ H_0 $ : The distribution of purchases and non-purchases is the same across different types of advertisements. <br />
Alternate hypothesis $ H_a $ : The distribution of purchases and non-purchases is different across different types of advertisements.

α = 0.05




In [16]:
df = pd.DataFrame({'Bought':[45,70,25],'Didn\'t Buy':[25,35,15]},index=['TV','Online','Print'])
df


Unnamed: 0,Bought,Didn't Buy
TV,45,25
Online,70,35
Print,25,15


In [17]:
data = df.values
data

array([[45, 25],
       [70, 35],
       [25, 15]])

In [18]:
chi2_test_of_independence(data, significance=0.05)

Expected values:
            0          1
0  45.581395  24.418605
1  68.372093  36.627907
2  26.046512  13.953488
Fail to reject the null hypothesis (no significant association).


(0.25290532879818595, 5.991464547107979, 0.881215861438609)

# CHI-SQUARE GOODNESS OF FIT TEST

## All required functions

In [19]:
def chi2_goodness_of_fit_test(data, expected_freq, significance=0.05):
    """
    Perform a chi-square goodness-of-fit test to determine if the observed frequencies match the expected frequencies.
    And print the hypothesis.
    data : numpy array containing the data
    expected_freq : numpy array containing the expected frequencies
    significant : float, the significance level for the test

    returns : chisquare, chi2_critical, p_value

    """
    # Calculate the number of observations
    n = data.shape[0]

    # Calculate the degree of freedom
    dof = n - 1

    # Calculate the chi-square statistic
    chisquare = 0
    for i in range(n):
        chisquare += (data[i, 1] - expected_freq[i, 1])**2/expected_freq[i, 1]

    # Calculate chi-square critical
    chi2_critical = chi2.ppf(1-significance, dof)

    # Calculate p-value
    p_value = 1 - chi2.cdf(chisquare, dof)

    # Print the results
    if chisquare > chi2_critical:
        print("Reject the null hypothesis (the observed frequencies do not match the expected frequencies).")
    else:
        print("Fail to reject the null hypothesis (the observed frequencies match the expected frequencies).")


    return chisquare, chi2_critical, p_value

## EXAMPLE 1

Candy Color Distribution
Imagine you are a quality control manager at a candy factory. You want to check if the color distribution of the candies in a bag matches the company's stated proportions. The company claims the candies come in the following distribution:

| Color of Candy | Expected percentage |
|----------------|---------------------|
| Red            | 30%                 |
| Blue           | 20%                 |
| Green          | 20%                 |
| Yellow         | 20%                 |
| Orange         | 10%                 |

You take a random sample of 100 candies from a production batch and count the number of candies of each color.


| Color of Candy | Observed Quantity |
|----------------|-------------------|
| Red            |  35               |
| Blue           |  15               |
| Green          |  20               |
| Yellow         |  20               |
| Orange         |  10               |
| <b>Total :</b> |  <b>100</b>       |

---
Null Hypothesis( $ H_0 $ ) : The observed frequencies match the expected frequencies based on the company's stated proportions. <br />
Alternate hypothesis ($ H_a $) : The observed frequencies do not match the expected frequencies. <br />

α = 0.05



In [20]:
alpha = 0.05

In [21]:
df = pd.DataFrame({'Color of Candy':['Red','Blue','Green','Yellow','Orange'],'Observed Quantity':[35,15,20,20,10]})
df

Unnamed: 0,Color of Candy,Observed Quantity
0,Red,35
1,Blue,15
2,Green,20
3,Yellow,20
4,Orange,10


In [22]:
data = df.values
data

array([['Red', 35],
       ['Blue', 15],
       ['Green', 20],
       ['Yellow', 20],
       ['Orange', 10]], dtype=object)

In [23]:
expectedQuantity = pd.DataFrame({'Color of Candy':['Red','Blue','Green','Yellow','Orange'],'Expected Quantity':[30,20,20,20,10]})
expectedQuantity

Unnamed: 0,Color of Candy,Expected Quantity
0,Red,30
1,Blue,20
2,Green,20
3,Yellow,20
4,Orange,10


In [24]:
chi_square = 0
for i in range(df.shape[0]):
    chi_square += (df.iloc[i,1] - expectedQuantity.iloc[i,1])**2/expectedQuantity.iloc[i,1]
chi_square

2.0833333333333335

In [25]:
dof = df.shape[0]-1
dof

4

In [26]:
chi2_critical = chi2.ppf(1-alpha, dof)
chi2_critical

9.487729036781154

In [27]:
if chi_square > chi2_critical:
    print("Reject the null hypothesis (the observed frequencies do not match the expected frequencies).")
else:
    print("Fail to reject the null hypothesis (the observed frequencies match the expected frequencies).")

Fail to reject the null hypothesis (the observed frequencies match the expected frequencies).


In [28]:
chi2_goodness_of_fit_test(data, expectedQuantity.values, significance=0.05)

Fail to reject the null hypothesis (the observed frequencies match the expected frequencies).


(2.0833333333333335, 9.487729036781154, 0.7204349163118164)

## Example 2

A company wants to know if the distribution of its sales across four regions (North, South, East, and West) is uniform. They collected the following sales data over the past year:


| Observed data            |    | Expected Data            |
|--------------------------|----|--------------------------|

| Region      | Sales      |    | Region      | Sales      |
|-------------|------------|    |-------------|------------|
| North       | 120        |    | North       | 125        |
| South       | 150        |    | South       | 125        |
| East        | 130        |    | East        | 125        |
| West        | 100        |    | West        | 125        |
|<b>Total</b> | <b>500</b> |    |<b>Total</b> | <b>500</b> |

---
Null Hypothesis( $ H_0 $ ) : The sales distribution is uniform across the four regions. <br />
Alternate hypothesis ($ H_a $) : The sales distribution is not uniform across the four regions. <br />

α = 0.05



In [29]:
df = pd.DataFrame({'Region':['North','South','East','West'],'Observed Data':[120,150,130,100]})
df

Unnamed: 0,Region,Observed Data
0,North,120
1,South,150
2,East,130
3,West,100


In [30]:
expected = pd.DataFrame({'Region':['North','South','East','West'],'Expected Data':[125,125,125,125]})
expected

Unnamed: 0,Region,Expected Data
0,North,125
1,South,125
2,East,125
3,West,125


In [31]:
alpha = 0.05

In [32]:
chi2_goodness_of_fit_test(df.values, expected.values, alpha)

Reject the null hypothesis (the observed frequencies do not match the expected frequencies).


(10.4, 7.814727903251179, 0.015454827216857758)