# The Ultimate Guide To AB Testing

### AB testing helps us understand change 

If we have a product or service that works in a certain way we often want to make changes to improve this. When we make a change we want to be able to say that the impact is real and in the right direction. In order to be able to measure this change we can use AB testing.

In an AB test we split the users into different groups; the control group stays with the same experience or product and then the other groups each receive a variation of this (often there is only one other variant). We then measure key metrics for each group and compare the results. AB testing can be used in many fields but is often used in product development.

To make this more concrete for the rest of the article - imagine we are running a website and are looking to improve the conversion of users purchasing on our website. We want to run an AB test to see if changing the styling of a key product page increases the number of users converting. The users are randomly split into two groups, we show the control group the old website and we show the users in the variant group the modified styling. We run the test for a period of time and observe how many from each group go on to purchase - we then calculate the statistics to determine whether the conversion rate has improved for the new style.

You can use AB testing for other metrics such as average spend but the analysis is a little different.

When calculating the statistics for an AB test there are two main schools of thought - `Freqentist` and `Bayesian`

### Some notation

Before we get into the detail let's define the number of users entering the test as $n$ and the number of users who convert as $s$ (I chose s for success).

We can then split users these between the two variants. We use $n_C$ and $n_V$ to represent the number of users in the control and variant respectively. We use $s_C$ and $s_V$ to represent the number of users who convert in the control and variant respectively. Note $n = n_C + n_V$ and $s = s_C + s_V$.

Now we can write the conversion rate $c$ as $c = \frac{s}{n}$ and for control $c_C = \frac{s_C}{n_C}$ and for the variant $c_V = \frac{s_V}{n_V}$.

### Frequentist Vs Bayesian thinking

`Freqentist` and `Bayesian` approaches to testing differ in how they calculate test statistics but also in how they think about the problem. It is worth describing these two different ways of thinking about testing before going into the calculations.

#### Frequentist

`Frequentists` view the `true` theoretical conversion rate as a fixed unique value - for instance in our example it could be 10%. This underlying value is also called the `population conversion rate` but it can't be directly observed as we can't realistically have everyone in the world try out our website. Instead we observe the `sample conversion rate` from the sample we do observe. This may be slightly different from the underlying conversion rate due to the randomness in the sample.

#### Bayesian

In a `Bayesian` setting instead of considering the `population conversion rate` as a single unobservable value it is considered as a distribution. For instance it could be a distribution centered around 10% as shown below:

In [114]:
bayesian_example_distribution()

FigureWidget({
    'data': [{'fill': 'tozeroy',
              'opacity': 0.5,
              'type': 'scatter',…

### Calculating Test Significance - the idea

Now we understand how freqentist and bayesian frameworks view conversion rates we can explain how the test statistics are calculated.

#### Frequentist

In a frequentist setting we initially assume that both the control and variant groups have the same underlying conversion rate (we assume that they are sampled from the same population). This assumption or hypothesis is known as the `null hypothesis` denoted $H_0$

The `alternative hypothesis`, $H_1$ is just the opposite of the null hypothesis - that the underlying conversion rates are different.

As we have initially assumed the underlying conversion rates are the same we expect the conversion rates of the control sample and variant sample to be close to each other and the underlying true value.

Say if the underlying conversion rate was 10% (remember in practice we don't know this) then we would expect both the control and the variant to have roughly 10% of users convert under the null hypothesis.

When sampling from a distribution where each user has a fixed probability $p$ of converting (the underlying conversion rate) then the conversion rate of a sample of users with size $n$ will tend to a normal distribution as we increase $n$. This is due to the central limit theorem.

If the variant sample has an average conversion rate significantly different to the control we reject the null hypothesis.

In order to decide whether the difference is significant we work out the $p$ value. The $p$ value is the probability of seeing a conversion rate in the variant at least as far away from the control conversion rate as we observed. If this $p$ is smaller than the preset significance value (also known as $\alpha$ - often this is 5%) then we reject the null hypothesis.

### A detailed look at the $p$ value

We have said above that when we look at the conversion rate for a particular group (either variant or control) we are sampling from an underlying distribution. We know that the mean of a sample tends towards a normal distribution with a certain mean and variance.

The mean can be estimated by the mean of the sample (i.e. the observed conversion rate)

The standard deviation of the sampling distribution of the mean (also know as the standard error) is equal to the population standard deviation divided by the square root of the size of the sample.

We can estimate the population standard deviation using the sample standard deviation.

So we assume that the control sample conversion rate $\mu(C,n_C)\sim\mathcal{N}(\mu_C,\sigma_C^2)$ where 

$$\sigma_C^2 = \frac{\text{SD}_C^2}{n_C}$$

where $\text{SD}_C$ is the standard deviation of the control sample

similarly we assume the variant conversion rate sample mean $\mu(V,n_V)\sim\mathcal{N}(\mu_V,\sigma_V^2)$ where 

$$\sigma_V^2 = \frac{\text{SD}_V^2}{n_V}$$

where $\text{SD}_V$ is the standard deviation of the variant sample

We are interested in $\mu(C,n_C) - \mu(V,n_V)$ the difference between the two conversion rates. As we have assumed they are normally distributed we can say that the difference is also normally distributed

$$
\mu(C,n_C) - \mu(V,n_V) \sim\mathcal{N}(\mu_C-\mu_V,\sigma_C^2+\sigma_V^2) = \mathcal{N}(\mu_C-\mu_V,\frac{\text{SD}_C^2}{n_C}+\frac{\text{SD}_V^2}{n_V})
$$

Under $H_0$ we assume $\mu_C=\mu_V$ hence

$$
\mu(C,n_C) - \mu(V,n_V) \sim\mathcal{N}(0,\frac{\text{SD}_C^2}{n_C}+\frac{\text{SD}_V^2}{n_V})
$$


Specifically we want to know the probability that difference was at least as extreme as we observed.
I.e. 

$$
P(|\mu(C,n_C) - \mu(V,n_V)| > \mu_C - \mu_V) = 2 * P(Z > \frac{|\mu_C - \mu_V|}{\sqrt{\frac{\text{SD}_C^2}{n_C}+\frac{\text{SD}_V^2}{n_V}}}   )
$$

In [115]:
true_conversion_rate = 0.2

control_users = 100
control_user_conversions = 21

variant_users = 100
variant_user_conversions = 25

control_conversion_rate = control_user_conversions / control_users
variant_conversion_rate = variant_user_conversions / variant_users
control_variance = (control_conversion_rate * (1 - control_conversion_rate)) 
variant_variance = (variant_conversion_rate * (1 - variant_conversion_rate))
diffence_variance = (control_variance / control_users) + (variant_variance / variant_users)

In [117]:
frequentist_normal_difference(alpha=0.1, two_tale=False)

FigureWidget({
    'data': [{'name': 'Distribution',
              'type': 'scatter',
              'uid': '78…

#### Bayesian

### Calculating Test Significance - the detail

#### Frequentist

#### Bayesian

### Calculating Test Significance - the detail

#### Frequentist

#### Bayesian

### What can possibly go wrong ... ?

#### Frequentist

#### Bayesian

### plotting

In [5]:
import numpy as np
import plotly.graph_objs as go
from scipy.stats import distributions
from scipy.stats import norm
from ipywidgets import widgets

In [2]:
def bayesian_example_distribution():
    figure = go.FigureWidget()
    
    # Beta values to plot
    n = 200
    step = 1 / n
    x = np.arange(0, 1, step)
    beta = distributions.beta(a=10, b=100)
    y = beta.pdf(x)

    figure.add_scatter(x=x, y=y, fill='tozeroy',opacity=0.5)
    
    # labels
    figure.layout.title = 'Example Conversion Rate Distribution'
    figure.layout.xaxis.title = 'Conversion rate'
    figure.layout.yaxis.title = 'Probability density'
    figure.layout.xaxis.tickformat = '%'
    figure.layout.yaxis.showticklabels = False

    return figure

In [112]:
def frequentist_normal_difference(alpha = 0.05, two_tale = True):
    difference_dist = norm(loc=control_conversion_rate, scale=diffence_variance ** 0.5)

    start_prob = 0.0001
    end_prob = 1 - start_prob
    start, end  = difference_dist.ppf([start_prob, end_prob])

    n_points=500
    step = (end-start)/n_points
    plot_conversion_rates = np.arange(start, end, step )
    plot_conversion_rates_probs = difference_dist.pdf(plot_conversion_rates)

    fig = go.FigureWidget()
    fig.add_scatter(x=plot_conversion_rates, y=plot_conversion_rates_probs, name="Distribution")
    fig.layout.xaxis.title = 'Variant conversion rate'
    fig.layout.yaxis.title = 'Probability density'

    if two_tale:
        alpha = alpha / 2
    plot_alpha_right_x = np.arange(difference_dist.ppf(1 - alpha), end, step)
    plot_alpha_right_y = difference_dist.pdf(plot_alpha_right_x)
    plot_alpha_left_x = np.arange(start, difference_dist.ppf(alpha), step)
    plot_alpha_left_y = difference_dist.pdf(plot_alpha_left_x)
    right = fig.add_scatter(x=plot_alpha_right_x, y=plot_alpha_right_y, fill='tozeroy', name=f"{alpha:.2%} prob")
    left = fig.add_scatter(x=plot_alpha_left_x, y=plot_alpha_left_y, fill='tozeroy', name=f"{alpha:.2%} prob",
                           visible=two_tale)

    return fig