# What Hypothesis Test Should I Use for My A/B Experiment? (Quick-Reference)

The choice of hypothesis test after running an A/B test can be overwhelming for any data practicioner. The popular Python libraries, [scipy](https://docs.scipy.org/doc/scipy/reference/stats.html#statistical-tests) and [statsmodels](https://docs.scipy.org/doc/scipy/reference/stats.html#statistical-tests), have implemented hundreds of hypothesis tests for all types of use cases. 

This article is written to be your quick reference guide to the most common types of hypothesis tests you'll need to analyse your A/B(/C) experiments. I use these tests regularly in my role as a Data Scientist at [Movember](http://movember.com/). 

We will cover parametric and non-parametric tests for continuous and categorical variables in two or more samples. Each test will include a brief description, the assumptions, how the hypotheses are formulated and sample Python code. You are encouraged to read the documentation when you are implementing the test as each method has several different input parameters. 

If you are looking for information about [why you should run A/B experiments](https://medium.com/@rtkilian/how-a-b-testing-helps-microsoft-and-why-you-should-consider-it-too-c975f2922ffe) and [how you should set up an A/B test](https://medium.com/towards-data-science/a-quick-reference-checklist-for-a-b-testing-40f533cfb523), check out my other articles. Otherwise, if you are after an in-depth explanation of the tests, read [here](https://towardsdatascience.com/a-b-testing-a-complete-guide-to-statistical-testing-e3f1db140499), [here](https://towardsdatascience.com/the-ultimate-guide-to-a-b-testing-part-3-parametric-tests-2c629e8d98f8) and [here](https://towardsdatascience.com/the-ultimate-guide-to-a-b-testing-part-4-non-parametric-tests-4db7b4b6a974).

## Preparation

In [1]:
%conda install pandas
%conda install numpy
%conda install scipy
%conda install -c conda-forge statsmodels

Collecting package metadata (current_repodata.json): done
Solving environment: done


  current version: 4.10.3
  latest version: 4.12.0

Please update conda by running

    $ conda update -n base conda



# All requested packages already installed.


Note: you may need to restart the kernel to use updated packages.
Collecting package metadata (current_repodata.json): done
Solving environment: done


  current version: 4.10.3
  latest version: 4.12.0

Please update conda by running

    $ conda update -n base conda



# All requested packages already installed.


Note: you may need to restart the kernel to use updated packages.
Collecting package metadata (current_repodata.json): done
Solving environment: done


  current version: 4.10.3
  latest version: 4.12.0

Please update conda by running

    $ conda update -n base conda



# All requested packages already installed.


Note: you may need to restart the kernel to use updated packages.
Collecting package metadata (current_repodata.

In [2]:
import pandas as pd

import numpy as np
from numpy.random import default_rng
rng = default_rng(42)

## Hypothesis Tests

### Student's t-test
The t-test is a parametric test used to determine whether there is a significant difference between the means of two continuous samples. 

**Assumptions**
* Observations in each sample are independent
* Observations in each sample are approximately normally distributed
* Observations in each sample have the same variance

**Hypotheses**
* H0: the means of the two samples are equal
* H1: the means of the two samples are not equal

**Resources**
* [Student's t-test, Wikipedia](https://en.wikipedia.org/wiki/Student%27s_t-test)
* [scipy.stats.ttest_ind](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html)
* [Welch's t-test for samples with unequal variance, Wikipedia](https://en.wikipedia.org/wiki/Welch%27s_t-test)

In [3]:
from scipy.stats import ttest_ind

# Randomly generate data
x1 = rng.normal(loc=0.25, scale=1, size=100)
x2 = rng.normal(loc=0.00, scale=1, size=100)

# Calculate test statistic and p-value
stat, p = ttest_ind(x1, x2)

# Interpreation
print('stat=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
    print('Do not reject the null hypothesis and conclude the means of the samples are the same.')
else:
    print('Reject the null hypothesis and conclude the means of the samples are not the same.')

stat=1.683, p=0.094
Do not reject the null hypothesis and conclude the means of the samples are the same.


### Mann-Whitney U test
The Mann-Whitney U test is a non-parametric test to determine whether the distributions of two continuous samples are the same. The Mann-Whitney U test is the non-parametric version of the t-test for independent samples.

**Assumptions**
* Observations in each sample are independent
* Observations in each sample are continuous or ordinal and can be ranked
* The distribution of each sample is approximately the same shape

**Hypotheses**
* H0: the distributions of each sample are the same
* H1: the distributions of each sample are not the same

**Resources**
* [Mann-Whitney U test, Wikipedia](https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test)
* [scipy.stats.mannwhitneyu](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mannwhitneyu.html)

In [4]:
from scipy.stats import mannwhitneyu

# Randomly generate the data
x1 = rng.normal(loc=0.25, scale=1, size=100)
x2 = rng.normal(loc=0.00, scale=1, size=100)

# Calculate test statistic and p-value
stat, p = mannwhitneyu(x1, x2)

# Interpreatation
print('stat=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
    print('Do not reject the null hypothesis and conclude the distributions of the samples are the same.')
else:
    print('Reject the null hypothesis and conclude the distributions of the samples are not the same.')

stat=4999.000, p=0.999
Do not reject the null hypothesis and conclude the distributions of the samples are the same.


### Paired Student's t-test
The t-test is a parametric test used to determine whether there is a significant difference between the means of two paired continuous samples. 

**Assumptions**
* Observations in each sample are independent
* Observations in each sample are approximately normally distributed
* Observations in each sample have the same variance
* Observations across each sample are paired

**Hypotheses**
* H0: the means of the two paired samples are equal
* H1: the means of the two paired samples are not equal

**Resources**
* [Student's t-test for paired samples, Wikipedia](https://en.wikipedia.org/wiki/Student%27s_t-test#Dependent_t-test_for_paired_samples)
* [scipy.stats.ttest_rel](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_rel.html)

In [5]:
from scipy.stats import ttest_rel

# Randomly generate the data
x1 = rng.normal(loc=0.00, scale=1, size=100)
x2 = x1 + rng.normal(loc=0.25, scale=1, size=100)

# Calculate test statistic and p-value
stat, p = ttest_rel(x1, x2)

# Interpreation
print('stat=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
    print('Do not reject the null hypothesis and conclude the means of the paired samples are the same.')
else:
    print('Reject the null hypothesis and conclude the means of the paired samples are not the same.')

stat=-1.521, p=0.131
Do not reject the null hypothesis and conclude the means of the paired samples are the same.


### Wilcoxon signed-rank test
The Wilcoxon signed-rank test is a non-parametric test to determine whether the distributions of two paired continuous samples are the same. The Mann-Whitney U test is the non-parametric version of the paired t-test.

**Assumptions**
* Observations in each sample are independent
* Observations in each sample can be ranked
* Observations across each sample are paired

**Hypotheses**
* H0: the distributions of the paired samples are the same
* H1: the distributions of the paired samples are not the same

**Resources**
* [Wilcoxon signed-rank test, Wikipedia](https://en.wikipedia.org/wiki/Wilcoxon_signed-rank_test)
* [scipy.stats.wilcoxon](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.wilcoxon.html)

In [6]:
# Example of the Paired Student's t-test
from scipy.stats import wilcoxon

# Randomly generate the data
x1 = rng.normal(loc=0.00, scale=1, size=100)
x2 = x1 + rng.normal(loc=0.25, scale=1, size=100)

# Calculate test statistic and p-value
stat, p = wilcoxon(x1, x2)

# Interpreation
print('stat=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
    print('Do not reject the null hypothesis and conclude the distributions of the paired samples are the same.')
else:
    print('Reject the null hypothesis and conclude the distributions of the paired samples are not the same.')

stat=1688.000, p=0.004
Reject the null hypothesis and conclude the distributions of the paired samples are not the same.


### Analysis of Variance Test (ANOVA)
The one-way ANOVA test is a parametric test used to determine whether there is a significant difference between the means of two or more continuous samples. 

**Assumptions**
* Observations in each sample are independent
* Observations in each sample are approximately normally distributed
* Observations in each sample have the same variance

**Hypotheses**
* H0: the means of the two or more samples are equal
* H1: one or more of the means of the samples are not equal

**Resources**
* [Analysis of variance (ANOVA), Wikipedia](https://en.wikipedia.org/wiki/Analysis_of_variance)
* [scipy.stats.f_oneway](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.f_oneway.html)

In [7]:
# Example of the Analysis of Variance Test
from scipy.stats import f_oneway

# Randomly generate the data
x1 = rng.normal(loc=0.25, scale=1, size=100)
x2 = rng.normal(loc=0.00, scale=1, size=100)
x3 = rng.normal(loc=0.00, scale=1, size=100)

# Calculate test statistic and p-value
stat, p = f_oneway(x1, x2, x3)

# Interpreation
print('stat=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
    print('Do not reject the null hypothesis and conclude the means of the samples are the same.')
else:
    print('Reject the null hypothesis and conclude that one or more of the means of the samples are not the same.')

stat=0.075, p=0.927
Do not reject the null hypothesis and conclude the means of the samples are the same.


### Kruskal-Wallis H-test
The Kruskal-Wallis H-test is a non-parametric test to determine whether there is a significant difference between the medians of two or more continuous samples. It is the non-parametric equivalent of the one-way ANOVA test.

**Assumptions**
* Observations in each sample are independent
* Observations in each sample have the same variance

**Hypotheses**
* H0: the medians of the two or more samples are equal
* H1: one or more of the medians of the samples are not equal

**Resources**
* [Kuskal-Wallis one-way analysis test of variance, Wikipedia](https://en.wikipedia.org/wiki/Kruskal%E2%80%93Wallis_one-way_analysis_of_variance)
* [scipy.stats.kruskal](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.kruskal.html)

In [8]:
from scipy.stats import kruskal

# Randomly generate the data
x1 = rng.normal(loc=0.25, scale=1, size=100)
x2 = rng.normal(loc=0.00, scale=1, size=100)
x3 = rng.normal(loc=0.00, scale=1, size=100)

# Calculate test statistic and p-value
stat, p = kruskal(x1, x2, x3)

# Interpreation
print('stat=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
    print('Do not reject the null hypothesis and conclude the medians of the samples are the same.')
else:
    print('Reject the null hypothesis and conclude that one or more of the medians of the samples are not the same.')

stat=3.304, p=0.192
Do not reject the null hypothesis and conclude the medians of the samples are the same.


### Chi-squared Test
The Chi-squared test is used to test the independence of two or more categorical variables in a contingency table.

**Assumptions**
* Observations in the sample are independent
* The observed and expected frequencies in each cell in the contingency table are at least 5

**Hypotheses**
* H0: the variables are independent
* H1: the variables are not independent

**Resources**
* [Chi-squared test, Wikipedia](https://en.wikipedia.org/wiki/Chi-squared_test)
* [scipy.stats.chi2_contingency](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingency.html)
* [pandas.crosstab](https://pandas.pydata.org/docs/reference/api/pandas.crosstab.html)

In [9]:
from scipy.stats import chi2_contingency

# Example contingency table
table = [[100, 80, 70],[150,  20,  80]]

# Calculate test statistic and p-value
stat, p, dof, expected = chi2_contingency(table)

# Interpreation
print('stat=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
    print('Do not reject the null hypothesis and conclude the variables are independent.')
else:
    print('Reject the null hypothesis and conclude that the variables are dependent.')

stat=46.667, p=0.000
Reject the null hypothesis and conclude that the variables are dependent.


### Fisher's exact test
The Chi-squared test is used to test the independence of two categorical variables in a contingency table. Fisher's exact test is used instead of a Chi-squared test when the sample sizes are small. 

**Assumptions**
* Observations in the sample are independent

**Hypotheses**
* H0: the variables are independent
* H1: the variables are not independent

**Resources**
* [Fisher's test, Wikipedia](https://en.wikipedia.org/wiki/Fisher%27s_exact_test)
* [scipy.stats.fisher_exact](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.fisher_exact.html)

In [10]:
from scipy.stats import fisher_exact

# Example contingency table
table = [[100, 80],[150,  20]]

# Calculate test statistic and p-value
stat, p = fisher_exact(table)

# Interpreation
print('stat=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
    print('Do not reject the null hypothesis and conclude the variables are independent.')
else:
    print('Reject the null hypothesis and conclude that the variables are dependent.')

stat=0.167, p=0.000
Reject the null hypothesis and conclude that the variables are dependent.


### Poisson E-test
The Poisson exact test (E-test) is used to test whether there is a significant difference between two Poisson rates. 

**Assumptions**
* Observations in the sample are independent

**Hypotheses**
* H0: the Poisson rates are the same
* H1: the Poisson rates are not the same

**Resources**
* [Gu, Ng, Tang, Schucany 2008: Testing the Ratio of Two Poisson Rates, Biometrical Journal 50 (2008) 2, 2008](https://onlinelibrary.wiley.com/doi/10.1002/bimj.200710403)
* [statsmodels.stats.rates.test_poisson_2indep](https://www.statsmodels.org/dev/generated/statsmodels.stats.rates.test_poisson_2indep.html)

In [11]:
from statsmodels.stats.rates import test_poisson_2indep

# Example inputs taken from Gu, Ng, Tang, Schucany 2008: Testing the Ratio of Two Poisson Rates
count1 = 60
exposure1 = 51477.5
count2 = 30
exposure2 = 54308.7

# Calculate test statistic and p-value
stat, p = test_poisson_2indep(count1, exposure1, count2, exposure2, method='etest-wald')

# Interpreation
print('stat=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
    print('Do not reject the null hypothesis and conclude the Poisson rates are the same.')
else:
    print('Reject the null hypothesis and conclude that the Poisson rates are not the same.')

stat=3.385, p=0.001
Reject the null hypothesis and conclude that the Poisson rates are not the same.


## Conclusion
This article reviewed the hypothesis tests you will most likely use when analysing your A/B experiments. We covered what situations the tests are most suitable for, what assumptions need to be met, how to interpret the results, and provided the code and resources you will need to implement the tests.

Thank you to [Machine Learning Mastery](https://machinelearningmastery.com/statistical-hypothesis-tests-in-python-cheat-sheet/) for providing the inspiration for this article. 

Have I missed anything? Let me know, and I will update the list.

Liked what you read? Follow me on [Medium](https://medium.com/@rtkilian). Otherwise, [tweet me](https://twitter.com/rtkilian) or add me on [LinkedIn](https://www.linkedin.com/in/rtkilian/).