# A/B Testing

In [None]:
import numpy as np
import pandas as pd
from scipy.special import comb
from scipy import stats

## Agenda

SWBAT:

- Describe what an A/B Test is:
    - when it is used;
    - what forms it may take;
- Conduct an A/B Test in Python using `scipy`.


A/B Testing is really just a form of hypothesis testing applied to a business problem. And so it can take [many forms](https://en.wikipedia.org/wiki/A/B_testing).

The classic form of A/B Testing is exposing customers to two different versions of a website (the A and B versions) and then conducting a hypothesis test to see if their behavior is significantly different between the two versions.

Even limited to this context, there are [many](https://neilpatel.com/blog/19-obvious-ab-tests/) tests one might run!

We'll try a couple examples of A/B Testing here.

To start we'll try two examples with fake data:

## Example 1a: Canadian Winter Temperatures

In [None]:
a1 = pd.read_csv('fake_data/a1.csv', usecols=['target'])
b1 = pd.read_csv('fake_data/b1.csv', usecols=['target'])

In [None]:
a1.head()

In [None]:
b1.head()

The data here represent average winter temperatures at various locations in Canada, and the question is whether the temperatures in Group A are lower than temperatures in Group B.

**Null hypothesis**: The temperatures in Group A are not lower than the temperatures in Group B.

**Alternative hypothesis**: The temperatures in Group A are lower than the temperatures in Group B.

### By Hand
First let's try this by hand, assuming a $t$-distribution, and setting an $\alpha$ threshold of 0.02:

We'll start by calculating the pooled variance:
- $\Large s^2_P=\frac{(n_1-1)s^2_1 - (n_2-1)s^2_2}{n_1+n_2-2}$.

In [None]:
var_pooled = ((len(a1)-1) * a1.var() - (len(b1)-1) * b1.var()) / (len(a1) + len(b1) - 2)
var_pooled

Now we calculate the $t$-statistic:
- $\Large t=\frac{\bar{x_1} - \bar{x_2}}{s^2_P\sqrt{\frac{1}{n_1} + \frac{1}{n_2}}}$

In [None]:
t = (a1.mean() - b1.mean()) / (var_pooled * np.sqrt(1/len(a1) + 1/len(b1)))
t

And the critical $t$-stat is:

In [None]:
stats.t.ppf(0.99, df=len(a1)+len(b1)-2)

### In One Line!

In [None]:
stats.ttest_ind(a1, b1, equal_var=False).pvalue

## Example 1b: Russian Winter Temperatures

In [None]:
a2 = pd.read_csv('fake_data/a2.csv', usecols=['target'])
b2 = pd.read_csv('fake_data/b2.csv', usecols=['target'])

In [None]:
a2.head()

In [None]:
b2.head()

The data here represent average winter temperatures at various locations in Russia, and the question is whether the temperatures in Group A are lower than temperatures in Group B.

**Null hypothesis**: The temperatures in Group A are not lower than the temperatures in Group B.

**Alternative hypothesis**: The temperatures in Group A are lower than the temperatures in Group B.

In [None]:
stats.ttest_ind(a2, b2, equal_var=False)

Wow! Looks like there's a real difference here.

## Example 2: Online Sales

Now let's try a binomial A/B Test (where the variable of interest is binary). We can use [Fisher's exact test](https://en.wikipedia.org/wiki/Fisher%27s_exact_test).

### Question

We have data about whether customers completed sales transactions, segregated by the type of ad banners to which the customers were exposed.

The question we want to answer is whether there was any difference in sales "conversions" between desktop customers who saw the sneakers banner and desktop customers who saw the accessories banner in the month of May 2019.

### Getting the Data

First let's download the data from [kaggle](https://www.kaggle.com/podsyp/how-to-do-product-analytics).

In [None]:
#!unzip /Users/gdamico/Downloads/product.csv.zip

In [None]:
#!mkdir data

In [None]:
#!mv /Users/gdamico/Downloads/product.csv data

Let's go ahead and amend the `.gitignore` file now so that we don't accidentally add the data to our next commit.

In [None]:
#!(echo ; echo "# data"; echo "product.csv") >> .gitignore

In [None]:
df = pd.read_csv('data/product.csv')

In [None]:
df.head()

### EDA

Lets's look at the different banner types:

In [None]:
df['product'].value_counts()

In [None]:
df.groupby('product')['target'].value_counts()

Let's look at the range of time-stamps on these data:

In [None]:
df['time'].min()

In [None]:
df['time'].max()

Let's check the counts of the different site_version values:

In [None]:
df['site_version'].value_counts()

### Experimental Setup

We need to filter by site_version, time, and product:

In [None]:
df_AB = df[(df['site_version'] == 'desktop') &
           (df['time'] >= '2019-05-01') &
           ((df['product'] == 'accessories') | (df['product'] == 'sneakers'))].reset_index()

In [None]:
df_AB.tail()

### The Hypotheses

NULL: Customers who saw the company banner were no more or less likely to buy than customers who saw the clothes banner.

ALTERNATIVE: Customers who saw the company banner were more or less likely to buy than customers who saw the clothers banner.

### Setting a Threshold

We'll set a false-positive rate of $\alpha = 0.05$.

### Preparing Fisher's Test

Fisher's Test is an exact calculation of a $p$-value that requires four quantities: the respective numbers of 1's and 0's for each class.

In [None]:
df_A = df_AB[df_AB['product'] == 'accessories']
df_B = df_AB[df_AB['product'] == 'sneakers']

In [None]:
a = sum(df_A['target'])
b = sum(df_B['target'])

c = len(df_A['target']) - a
d = len(df_B['target']) - b

a, b, c, d

### Calculation

Fisher's Test tells us that the $p$-value corresponding to our distribution is given by:

$\Large p = \frac{(a+b)!(c+d)!(a+c)!(b+d)!}{a!b!c!d!n!}$

In [None]:
ab_choose_a = comb(a+b, a, exact=True)

In [None]:
cd_choose_c = comb(c+d, c, exact=True)

In [None]:
n_choose_ac = comb(a+b+c+d, a+c, exact=True)

In [None]:
p = ab_choose_a * cd_choose_c / n_choose_ac
p

This extremely low $p$-value suggests that these two groups are genuinely performing differently. In particular, the desktop customers who saw the sneakers banner in May 2019 bought at a higher rate than the desktop customers who saw the accessories banner in May 2019.

## Exercise

Same question as before, but this time for April 2019 instead of May! Use a threshold of 0.05.