# AB Test

- A controlled experiment, usually in the context of a website

- You test the performance of some change to your website (the variant) and measure conversion relative to your unchnged site (the control)

<img src="https://upload.wikimedia.org/wikipedia/commons/2/2e/A-B_testing_example.png" alt="Drawing" style="width: 500px;"/>

"Como o nome já diz, duas versões são comparadas, as quais são idênticas exceto por uma variante que pode impactar o comportamento do utilizador."

Example of things to test: Design changes, UI flow, Algorithmic changes, Pricing changes...

How measure the change?

- Ideally choose what you are trying to influence: order amounts, profit, ad clicks, order quantity.. *(talk/align with business people)*

- Attributing actions dowstream from your change can be hard: especially if you're runnig more than one experiment

Common mistake (**variance is your enemy**):

- Run a test for some small period of time that results in a few purchases to analyze

- You take the mean order amount from A and B, and declare victory or defeat

- But, there's so much random variaton in order amounts to begin with, that your result was just based upon chance

Tests to evaluate/measure results:

- **Z-tests** are appropriate for comparing means under stringent conditions regarding normality and a known standard deviation. 

- **Student's t-tests** are appropriate for comparing means under relaxed conditions when less is assumed. 

- **Welch's t test** assumes the least and is therefore the most commonly used test in a two-sample hypothesis test where the mean of a metric is to be optimized. While the mean of the variable to be optimized is the most common choice of estimator, others are regularly used.

- A comparison of two binomial distributions such as a click-through rate one would use **Fisher's exact test**.

## How long do I run an experiment?

- You have achieved significance (positive or negative)

- You no longer observe meaningful trends in your p-value. In other words, you don't see any indication that your experiment will 'converge' on a result over time

- You reach some pre-established upper bound on time

## Novelty Effect

"The novelty effect is an effect of introducing new elements on some activity or behavior."

Changes to a website will catch the attention of previous users who are used to the way it used to be:

- They might click on something simply because it is new but this attention won't last forever

Good idea to re-run experiments much later and validate their impact. Ofter the 'old' website will outperform the 'new' one after awhile, simply because it is a change.

## Seasonal effects

## Selection Bias

"Selection bias is the bias introduced by the selection of individuals, groups, or data for analysis in such a way that proper randomization is not achieved, thereby failing to ensure that the sample obtained is representative of the population intended to be analyzed"

- Run an A/A test periodically to check

## Data Pollution

- Are robots affection your experiment? (crawler for ex.)

- Are outliers skewing the result?

## Attribution Erros

Often there are errors in how conversion is attributed to an experiment

In [1]:
import numpy as np
from scipy import stats

In [3]:
A = np.random.normal(25.0,5.0,10000) # treatment group
B = np.random.normal(26.0,5.0,10000) # control group
stats.ttest_ind(A,B)

TtestResult(statistic=-13.220987829816442, pvalue=9.7472602578049e-40, df=19998.0)

Student T-tests: the null hypothesis is such that **the means of two populations are equal**.

As p-value = 9.7472602578049e-40, we reject the null hypothesis.

In [4]:
B = np.random.normal(25.0,5.0,10000) # control group
stats.ttest_ind(A,B)

TtestResult(statistic=1.0259277964371356, pvalue=0.30493802842786183, df=19998.0)

As p-value = 0.30493802842786183, we DON'T reject the null hypothesis.

Changing sample size:

In [5]:
A = np.random.normal(25.0,5.0,100000) # treatment group
B = np.random.normal(25.0,5.0,100000) # control group
stats.ttest_ind(A,B)

TtestResult(statistic=-0.6432036752631257, pvalue=0.5200926862804233, df=199998.0)

As p-value = 0.5200926862804233, we DON'T reject the null hypothesis.

Notice that the sample size had a quite impact on the p-value

In [9]:
# sanity test
print(stats.ttest_ind(A,A))
print(stats.ttest_ind(B,B))

TtestResult(statistic=0.0, pvalue=1.0, df=1999998.0)
TtestResult(statistic=0.0, pvalue=1.0, df=1999998.0)
