<a href="https://colab.research.google.com/github/rosslogan702/hypothesis_testing_notes/blob/master/chi_square_test.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Hypothesis Testing - Chi Square Test

# Contents

The focus of this notebook is the Chi Square Test.

The notebook will cover the following:



1.   Description
2.   Manual Calculation
3.   Practical Examples using SciPy Library
4. Assumptions

# 1. Description

One of the use cases of a Chi Square test is when we want to compare two or more categorical sets of data. 

It is useful and often used in A/B testing.

Example uses:



*   An A/B test where half of the users were shown a blue submit button and the other half a purple submit button. Which group was more likely to click on the button?  
*   People from Europe and people from South America were both given a survey asking “Which of the following three foods are your favorite?” Was there a signnificant difference between those from Europe and South America?  



# 2. Manual Calculation 

For the manual calculation example, we are going to use the same problem as we have used in the SciPy practical example.

### Problem Defintion

The management of an e-commerce company want to try to improve conversion on their landing page through to a product page. They want to test a new design for the landing page.  

An A/B test is set up where the users will either be shown the original page or the new landing page. Using the data below determine if there is a significant difference between the two.


***Original Landing Page***

Number of people clicked through to product page: **30**  
Number of people who did not click through to product page: **10**  

***New Landing Page***

Number of people clicked through to product page: **28**  
Number of people who did not click through to product page: **12** 

### Step 1 - Define Null and Alternative Hypothesis

$H_0$: There is no significant difference between the conversion from the original landing page through to the product page and the new landing page through to the product page.

$H_A$: There is a significant difference between the conversion from the original landing page through to the product page and the new landing page through to the product page.

### Step 2 - Prepare Observed Values

Performing this hypothesis test by hand requires that we use the chi-squared formula which is as follows.

$X^2 = \sum \frac{(O-E)^2}{E}$  

This formula is saying sum the observed value minus the expected value squared divided by the expected value

In [0]:
# Number of people who clicked and didn't for original landing page
original_clicked = 30
original_not_clicked = 10
original_total = 40

# Number of people who clicked and didn't for new landing page
new_clicked = 28
new_not_clicked = 12
new_total = 40

The observed values are the values that are provided in the above cell for both the original and the new landing page.

### Step 3 - Calculate the Expected Values for Original and New Landing Page

In [2]:
total = original_total + new_total
print('Total: {0:.3f}'.format(total))

Total: 80.000


In [3]:
clicked_total = original_clicked + new_clicked
print('Clicked Total: {0:.3f}'.format(clicked_total))

Clicked Total: 58.000


In [4]:
not_clicked_total = original_not_clicked + new_not_clicked
print('Not Clicked Total: {0:.3f}'.format(not_clicked_total))

Not Clicked Total: 22.000


In [5]:
expected_original_clicked = (clicked_total/total) * original_total
expected_original_not_clicked = (not_clicked_total/total) * original_total
print('Expected Original Clicked: {0:.3f}'.format(expected_original_clicked))
print('Expected Original Not Clicked: {0:.3f}'.format(expected_original_not_clicked))

Expected Original Clicked: 29.000
Expected Original Not Clicked: 11.000


In [6]:
expected_new_clicked = (clicked_total/total) * new_total
expected_new_not_clicked = (not_clicked_total/total) * new_total
print('Expected New Clicked: {0:.3f}'.format(expected_new_clicked))
print('Expected New Not Clicked: {0:.3f}'.format(expected_new_not_clicked))

Expected New Clicked: 29.000
Expected New Not Clicked: 11.000


### Step 4 - Apply Formula

In [7]:

formula_original_clicked = ((original_clicked - expected_original_clicked)**2)/expected_original_clicked
print('Formula Original Clicked: {0:.4f}'.format(formula_original_clicked))

formula_original_not_clicked = ((original_not_clicked - expected_original_not_clicked)**2)/expected_original_not_clicked
print('Formula Original Not Clicked: {0:.4f}'.format(formula_original_not_clicked))

formula_new_clicked = ((new_clicked - expected_new_clicked)**2)/expected_new_clicked
print('Formula New Clicked: {0:.4f}'.format(formula_new_clicked))

formula_new_not_clicked = ((new_not_clicked - expected_new_not_clicked)**2)/expected_new_not_clicked
print('Formula New Not Clicked: {0:.4f}'.format(formula_new_not_clicked))

Formula Original Clicked: 0.0345
Formula Original Not Clicked: 0.0909
Formula New Clicked: 0.0345
Formula New Not Clicked: 0.0909


In [8]:
formula_total = formula_original_clicked + formula_original_not_clicked + formula_new_clicked + formula_new_not_clicked
print('formula total: {0:.3f}'.format(formula_total))

formula total: 0.251


### Step 5 - Find the critical value from the Chi-Square Distribution Table

Using the following table: https://people.smp.uq.edu.au/YoniNazarathy/stat_models_B_course_spring_07/distributions/chisqtab.pdf

With a signficance level of 0.05, degress of freedom = 1 (degrees of freedom for a chi square test: (r-1) * (c-1) where r is the number of rows and c is the number of columns)

The critical value is : 3.84

### Step 6 - Results Analysis

In [9]:
critical_value = 3.84

if formula_total > critical_value:
  print('Result is statistically significant!')
else:
  print('Result is not statistically significant')

Result is not statistically significant


Therefore we do not have enough data currently to be able to confidently reject the null hypothesis.

# 3. Practical Examples using Scipy Library

The management of an e-commerce company want to try to improve conversion on their landing page through to a product page. They want to test a new design for the landing page.  

An A/B test is set up where the users will either be shown the original page or the new landing page. Using the data below determine if there is a significant difference between the two.


***Original Landing Page***

Number of people clicked through to product page: **30**  
Number of people who did not click through to product page: **10**  

***New Landing Page***

Number of people clicked through to product page: **28**  
Number of people who did not click through to product page: **12** 

### Step 1 - Define Null and Alternative Hypothesis

$H_0$: There is no significant difference between the conversion from the original landing page through to the product page and the new landing page through to the product page.

$H_A$: There is a significant difference between the conversion from the original landing page through to the product page and the new landing page through to the product page.

### Step 2 - Prepare Data and Run Test

The test is going to be run at a significance level of 0.05

In [0]:
# This is the library that is used to run the chi-squared test in scipy
from scipy.stats import chi2_contingency

# Create a contingency table for the scipy test, the first column is those that 
# clicked through to the product page, the second column is the number that did not
X = [[30, 10],
     [28, 12]]

chi2, p_val, dof, expected = chi2_contingency(X)

### Step 3 - Results Analysis

In [11]:
print('p_val: {0:.3f}'.format(p_val))

p_val: 0.802


In [12]:
if p_val < 0.05:
  print('Result is statistically significant!')
else:
  print('Result is not statistically significant')

Result is not statistically significant


As a result of this hypothesis test we fail to reject the null hypothesis and cannot confidently say that there is a significant difference between the two pages.