In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import chi2_contingency

In [4]:
df = pd.read_excel("2. Case 1 - Chi-square test of independence.xlsx")

In [5]:
# Print data head
df.head()

Unnamed: 0,Respondent,Uniqueness,Purchase Likelihood
0,1,Extremely unique,Extremely likely
1,2,Extremely unique,Extremely likely
2,3,Extremely unique,Extremely likely
3,4,Extremely unique,Extremely likely
4,5,Extremely unique,Extremely likely


In [6]:
# The Pandas crosstab function can be used to produce the contingency table from the raw data of responses
df_crosstab = pd.crosstab(df["Uniqueness"],df["Purchase Likelihood"])
df_crosstab

Purchase Likelihood,Extremely likely,Not at all likely,Not so likely,Somewhat likely,Very likely
Uniqueness,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Extremely unique,12,10,10,13,25
Not at all unique,5,11,5,4,8
Not so unique,7,9,10,8,16
Somewhat unique,15,16,15,16,28
Very unique,52,30,30,64,104


The **chi2_contingency()** function is part of the **scipy.stats** module used to test the independence between categorical variables in a contingency table. It calculates the chi-square test statistic, p-value, degrees of freedom, and expected frequencies.

Key parameters of the **chi2_contingency()** function include: 
1.	 observed: This is a 2D array or table of observed frequencies known as a contingency table, with each row and column representing different variable categories.
2.	 correction: This boolean indicates if Yates' continuity correction should be applied to 2x2 tables for more accurate small-sample chi-square statistics; it defaults to True.
3.	 lambda_: This variable name specifies the test statistic type as either a string or float; the default is None for calculating Pearson's chi-squared statistic. Other options include:
<br>•	'cressie-read'
<br>•	'log-likelihood'
<br>•	'freeman-tukey'
<br>•	'mod-log-likelihood'
<br>•	'neyman'
<br>•	'power-divergence'

<br>You can also provide a custom float value for the Cressie-Read power divergence.

For the following test, the null and alternative hypotheses can be formulated as:
- H0: Uniqueness is not associated with 'Purchase intent'
- H1: Uniqueness is associated with 'Purchase intent'
    
The significance level alpha is set to 0.05.

We're now ready to conduct the Chi-squared test of independence.

In [7]:
# Store the output of the test function into variables
stat, p, dof, expected = chi2_contingency(df_crosstab, correction=False)

In [8]:
# Print results
print("chi2: " + str(stat))
print("p: " + str(p))
print("df: " + str(dof))
print("expected: " + str(expected))

chi2: 21.388569942634078
p: 0.1640533250506747
df: 16
expected: [[12.17973231 10.17208413  9.36902486 14.05353728 24.22562141]
 [ 5.7418738   4.79541109  4.416826    6.62523901 11.4206501 ]
 [ 8.6998088   7.26577438  6.69216061 10.03824092 17.3040153 ]
 [15.65965583 13.07839388 12.0458891  18.06883365 31.14722753]
 [48.71892925 40.68833652 37.47609943 56.21414914 96.90248566]]


The *chi2_contingency* function returns the following values:

1.	 **chi2**: The test statistic is the chi-squared statistic value (21.39). It’s calculated by comparing the observed frequencies with the expected frequencies, considering the squared differences between them, and dividing the result by the expected frequencies.

2.	 **p**: The test’s p-value (0.1641). The p-value indicates the probability of observing a test statistic as extreme or more extreme than the one obtained—assuming the null hypothesis is true. (The variables are independent.)

3.	 **df**: The degrees of freedom for the test. For the chi-squared test, the degrees of freedom are calculated using the following formula: df = (r-1)(c-1)—where r is the number of rows in the contingency table and c is the number of columns.

4.	 **expected**: Expected frequencies are the values anticipated under the null hypothesis that assumes variable independence. The expected array shows the expected frequencies for each cell in the contingency table based on the row and column totals.  

The p-value (0.1641) is greater than the pre-set alpha level, 0.05. Therefore, we fail to reject the null hypothesis, suggesting that the variables are independent at the 0.05 significance level.