# Case Studies A/B-TEST

The purpose of this notebook is to demonstrate and test some functions I have written for hypothesis testing, especially for A/B-tests (on proportions). The custom function files can be found in the codebook folder / repository.

In [1]:
# load libraries

import numpy as np
import pandas as pd
import scipy.stats as stats

# custom functions - see codebook folder / repository
import cleaning_functions as cleaning
import hypothesis_functions as exp

import matplotlib.pyplot as plt
% matplotlib inline

---

## Case Study 1: Increase of Software Downloads and Licence Purchases

__Scenario:__ A productivity software company is looking for ways to increase the number of people who pay for their software. Users can download and use it free of charge, for a 7-day trial. After the end of the trial, users are required to pay for a license to continue using the software. The company wants to try to change the layout of the homepage to emphasize more prominently and higher up on the page that there is a 7-day trial available for the company's software. 

## General Workflow


1. Build a user funnel
2. Decide on metrics
3. Perform experiment sizing
4. Analyze results
5. Draw conclusions

### 1. Define user funnel

A straightforward flow might include the following steps:
- Visit homepage
- Visit download page
- Sign up for an account
- Download software
- After 7-day trial, software takes user to license-purchase page
- Purchase license

We'll assume that the potentially muddying effects of visits across multiple days, established user visits (e.g. for support or additional information), and 'lost' cookie tracking will be ignorable, at least unless we find reason to doubt our findings.

### 2. Define Metrics


__General thoughts:__ 

From our user funnel, we should consider two things: 
1. where and how we should split users into experiment groups, and 
2. what metrics we will use to track the success or failure of the experimental manipulation. 

The choice of unit of diversion (the point at which we divide observations into groups) may affect what metrics we can use, and whether the metrics we record should be considered invariant or evaluation metrics.

__Eval metrics decision:__

First: A cookie-based diversion is chosen. 

Second: Metrics we want to keep track of (somewhat simplified here):
* the number of cookies that are recorded @ homepage as a whole (invariate metric)
* the number of clicks on the download button @ download page (eval metric)
* the number of licenses purchased through the user accounts, each of which can be linked back to a particular condition (eval metric). 

Note on Evaluation metrics: 
- In this case ratios are better than count as ratios account for slight number imbalances between the two groups.
- We will have to take into account that we have 2 evaluation metrics when we calculate our confidence level.

### 3. Experiment Sizing

__Background info:__ Recent history shows that there are 
- about 3250 unique visitors per day, with slightly more visitors on Friday through Monday, than the rest of the week. 
- about 520 software downloads per day (a .16 rate) 
- about 65 licenses purchased each day (a .02 rate). 

In an ideal case, both the _download rate_ and the _license purchase rate_ should increase with the new homepage; a statistically significant negative change should be a sign to not deploy the homepage change. However, if only one of our metrics shows a statistically significant positive change we should be happy enough to deploy the new homepage.

This means: If we want to preserve a maximum 5% Type I error rate for falsely deploying the homepage without any actual effect we should apply a Bonferroni correction: alpha_(individual) = alpha_(overall)/n. (Keep in mind: This is a one sided test.) - AND: Also make sure there is enough data for 2 groups! 

__Question 1:__ If we want to detect an increase of 50 downloads per day (up to 570 per day, or a .175 rate). How many days of data do we have to collect in order to get enough visitors to detect this new rate at an overall 5% Type I error rate and at 80% power.

__Question 2:__ If we want to detect an increase of 10 license purchases per day (up to 75 per day, or a .023 rate). How many days of data do we have to collect in order to get enough visitors to detect this new rate at an overall 5% Type I error rate and at 80% power.

In [2]:
# input
daily_visitors_per_group = 3250 / 2
alpha_ind = 0.05 / 2

# calculate sample size for each group according to question 1
size = exp.calc_experiment_size(0.16, 0.175, alpha=alpha_ind, beta=0.2, two_tails=False)

# calculate number of days according to question 1
days = size / daily_visitors_per_group
print("days: ", np.ceil(days))

min number of samples per group to achieve desired power: 9481.0
days:  6.0


In [3]:
# calculate size according to question 2
size = size = exp.calc_experiment_size(0.02, 0.023, alpha=alpha_ind, beta=0.2, two_tails=False)

# calculate number of days
days = size / daily_visitors_per_group
print("days: ", np.ceil(days))

min number of samples per group to achieve desired power: 34930.0
days:  22.0



__Final decision:__ One thing that isn't accounted for in the base experiment length calculations is that there is going to be a delay between when users download the software and when they actually purchase a license. Any purchases observed within the first week might not be attributable to either experimental condition. As a way of accounting for this, we'll run the experiment for about one week longer to allow those users who come in during the third week a chance to come back and be counted in the license purchases tally. 

--> __Runtime 30 days__, the number of license purchases will only include purchases by users who joined after the start of the experiment.

## 4. Analzye Results

In [4]:
# load data
data = pd.read_csv('data/homepage-experiment-data.csv', sep=';')
data = cleaning.edit_column_names(data)

### Basic EDA

In [5]:
display(data.head())
display(data.shape)

Unnamed: 0,day,control_cookies,control_downloads,control_licenses,experiment_cookies,experiment_downloads,experiment_licenses
0,1,1764,246,1,1850,339,3
1,2,1541,234,2,1590,281,2
2,3,1457,240,1,1515,274,1
3,4,1587,224,1,1541,284,2
4,5,1606,253,2,1643,292,3


(29, 7)

In [6]:
data.sum()

day                       435
control_cookies         46851
control_downloads        7554
control_licenses          710
experiment_cookies      47346
experiment_downloads     8548
experiment_licenses       732
dtype: int64

In [7]:
# define inputs for results calculations

# for downloads (full period)
n_exp_downloads = data['experiment_cookies'].sum()
n_cont_downloads = data['control_cookies'].sum()
n_pool_downloads = n_exp_downloads + n_cont_downloads

x_cont_downloads = data['control_downloads'].sum() 
x_exp_downloads = data['experiment_downloads'].sum()

# p_exp_downloads = data['experiment_downloads'].sum() / data['experiment_cookies'].sum()
# p_cont_downloads = data['control_downloads'].sum() / data['control_cookies'].sum()

# for licenses (cookies of first three weeks, purchases of last three weeks)
n_exp_licenses = data.loc[:20, 'experiment_cookies'].sum()
n_cont_licenses = data.loc[:20, 'control_cookies'].sum()
n_pool_licenses = n_exp_licenses + n_cont_licenses

x_exp_licenses =  data.loc[:, 'experiment_licenses'].sum()
x_cont_licenses =  data.loc[:, 'control_licenses'].sum()

### Check invariate metric

- H0: control_cookies = experiment_cookies
- H1: control_cookies <> experiment cookies (2-sided test)

In [8]:
# check if population sizing for both groups is within expectations

exp.calc_invariant_population(n_exp_downloads, n_cont_downloads)

OK: Observed difference in sizes is within expectations.
proportion exp: 0.503 within lower bound : 0.497 and upper bound: 0.503
p-value: 0.107 > alpha: 0.050


### Check eval metrics

- H0: experiment_x = control_x
- H1: experiment_x > control_x (1-sided test)

In [9]:
# calculate effect on downloads, NOTE: alpha is halved (Bonferroni correction because we do 2 tests)

exp.calc_experiment_results(x_exp_downloads, x_cont_downloads, n_exp_downloads, n_cont_downloads, alpha=0.025, two_tails=False)

Observed difference: 0.0193 with lower bound: 0.0145 and upper bound: 0.0241
p-value: 0.0000 < alpha: 0.025 (z-score: 7.8708)
STATISTICALLY HO CAN BE REJECTED.


In [10]:
# calculate effect on licenses, NOTE: alpha is halved (Bonferroni correction because we do 2 tests)

exp.calc_experiment_results(x_exp_licenses, x_cont_licenses, n_exp_licenses, n_cont_licenses, alpha=0.025, two_tails=False)

Observed difference: 0.0003 with lower bound: -0.0019 and upper bound: 0.0024
p-value: 0.3979 >= alpha: 0.025 (z-score: 0.2587)
STATISTICALLY HO CAN NOT BE REJECTED.


### 5. Draw Conclusions

Despite the fact that statistical significance wasn't obtained for the number of licenses purchased, the new homepage appeared to have a strong effect on the number of downloads made. _Based on our goals, this seems enough to suggest replacing the old homepage with the new homepage._ 

Establishing whether there was a significant increase in the number of license purchases, either through the rate or the increase in the number of homepage visits, will need further experiments or data collection.

---

## Case Study 2: Click-trough-probability

__Scenario:__ We want to change a webpage layout to increase the CTP (click-trough-probability = number of unique visitors that click at least once / number of unique visitors that visit the page) for a specific button. First we observe the acutal CTP and calculate the confidence intervall to see what fluctuation we can typically expect.

### Assess initial situation

In [11]:
# collect observations 

n = 2000 # total unique visitors to page
clicks = 300 # unique visitors who click
p = 300 / 2000 # success rate as proportion

# return initial CPT
print("Initial CPT (old layout): ", p)

Initial CPT (old layout):  0.15


In [12]:
# safety check if we can assume a normal distribution for our calculations

assert (n * p > 5)  and  (n * (1-p) > 5), "n too small for assumption of normal distribution"
print("assumption of normal distribution ok (rule of thumb)")

assumption of normal distribution ok (rule of thumb)


In [13]:
# calculate lower and upper bounds of 95% confidence intervall (with custom function)

exp.calc_confidence_bounds_binomial(p=p, n=n, alpha=0.05)

lower bound: 0.1344, upper bound: 0.1656


## Design Experiment

### 1. Formulate Hipotheses

- H0: p_exp - p_cont = 0
- H1: p_exp - p_cont <> 0    _# reject null, in this case as two-sided test_ 
    
Hypothesis Testing: Calculate the probabilty that the results occured by chance == Calculate P(p_exp - p_cont = 0 | H0)
> if that P is small enough (P < alpha) we can reject the null hypotehsis

### 2. Decide on Practical Significance

The practical significance we want to achieve so that we can justify the investments for the change. Can be higher than the statistical significance.

### 3. Calculate Size of Experiment

The size is interrelated with the sensivity / statistical power of the experiment(1 - beta). The bigger the size (and the effect) the higher it is / the lower is beta. The sensivity is tricky to calculate so it is easyest to use an [online calculator](http://www.evanmiller.org/ab-testing/sample-size.html).

In [14]:
exp.calc_experiment_size(0.15, 0.17, alpha=0.05, two_tails=True)

min number of samples per group to achieve desired power: 5084.0


5084.0

## Check invariants

In this case only the population sizing is checked:
1. Compute standard error (standard deviation of the sampling distribution for the proportion) of binomial with p = 0.5
2. Multiply by z-score to get the margin of error m
3. Compute confidence interval around 0.5
4. Check wether observed fraction is within interval

In [15]:
exp.calc_invariant_population(61454, 61818)

OK: Observed difference in sizes is within expectations.
proportion exp: 0.499 within lower bound : 0.497 and upper bound: 0.503
p-value: 0.300 > alpha: 0.050


## Calculate Results

In [16]:
# collect observations 

x_exp = 1242 # unique visitors in experiment group who click
x_cont = 974 # unique visitors in control group who click
n_exp = 9886 # total unique visitors to page in experiment group 
n_cont = 10072 # total unique visitors to page in control group

In [17]:
exp.calc_experiment_results(x_exp=x_exp, x_cont=x_cont, n_exp=n_exp, n_cont=n_cont, alpha=0.05, two_tails=True)

Observed difference: 0.0289 with lower bound: 0.0202 and upper bound: 0.0376
p-value: 0.0000 < alpha: 0.050 (z-score: 6.5038)
STATISTICALLY HO CAN BE REJECTED.


## Draw Conclusions

As the lower bound of the calculated effect is above the practical significance threshold, the change should be made.

---