## Import libraries

In [56]:
import numpy as np
import pandas as pd
from scipy import stats

## Load baseline data

In [33]:
baseline_df = pd.read_csv('./data/baseline_values.csv',header=None,names = ['Metric','Baseline_Value'])

In [44]:
baseline_df

Unnamed: 0,Metric,Baseline_Value
0,Unique cookies to view course overview page pe...,40000.0
1,"Unique cookies to click ""Start free trial"" per...",3200.0
2,Enrollments per day:,660.0
3,"Click-through-probability on ""Start free trial"":",0.08
4,"Probability of enrolling, given click:",0.20625
5,"Probability of payment, given enroll:",0.53
6,"Probability of payment, given click",0.109313


## Metrics
Given the experiment (see README for details and discussion), we decide to consider the following **invariant** metrics:
- **Number of cookies** (number of unique cookies to view the course overview)
- **Number of clicks** (number of unique cookies to click the "Start free trial" button daily, which happens before the free trial screener is triggered)
- **Click-through-probability** (number of unique cookies to click the "Start free trail" button divided by the number of unique cookies to view the course overview page)

and the following **evaluation** metrics:
- **Gross conversion** (number of user-ids to complete checkout and enroll in the free trial divided by the number of unique cookies to click the "Start free trial" button)
- **Net conversion** (number of user-ids to remain enrolled past the 14-day boundary divided by the number of unique cookies to click the "Start free trial" button)
- **Retention** (number of user-ids to remain enrolled past the 14-day boundary divided by the number of user-ids to complete checkout).

We remind that cookie "uniqueness" is determined by day. User-ids are automatically unique.

We also define the practical significance boundary for each evaluation metric (difference that would have to be observed for a meaningful change for the business, $d_{\text{min}}$), given as absolute change:

In [129]:
# number of evaluation metrics
num_eval_metrics = 3

# gross conversion
d_min_gross_conv = 0.01

# net conversion
d_min_net_conv = 0.0075

# retention
d_min_ret = 0.01

## Metric variability
We estimate the standard deviation of the chosen evaluation metrics analytically with a sample size of 5000 unique cookies that visit the course overview page.

In [36]:
sample_size = 5000

### Gross conversion
This metric follows a binomial distribution with $\hat{p}$ the "probability of enrolling, given click" and $n$ the number of unique cookies to click the "Start free trial button" (which corresponds to the sample size multiplied by the "click-through-probability on the start-free-trial button"). The estimated expected value for the metric before any changes is $n \hat{p}$ and the variance is $n \hat{p} (1-\hat{p})$. We can rescale this metric by $n$ so that the metric is defined exactly as $\hat{p}$ and its standard deviation is $\sqrt{\hat{p} (1-\hat{p})/n}$.

In [37]:
p_hat_gross_conv = baseline_df.iloc[4]['Baseline_Value']
n_gross_conv = sample_size*baseline_df.iloc[3]['Baseline_Value']

In [55]:
# compute the std analytically
std_gross_conv = np.sqrt(p_hat_gross_conv*(1-p_hat_gross_conv)/n_gross_conv)
round(std_gross_conv,4)

0.0202

### Net conversion
Similarly, the metric is defined as the "probability of payment, given click" ($\hat{p}$) and $n$ is the same as for the gross conversion case.

In [40]:
p_hat_net_conv = baseline_df.iloc[6]['Baseline_Value']
n_net_conv = n_gross_conv

In [54]:
# compute the std analytically
std_net_conv = np.sqrt(p_hat_net_conv*(1-p_hat_net_conv)/n_net_conv)
round(std_net_conv,4)

0.0156

### Retention
In this case, the metric is defined as the "probability of payment, given enroll" ($\hat{p}$) and $n$ is the number of user-ids to complete checkout and enroll, which corresponds to the sample size multiplied by "enrollments per day" and divided by "number of unique cookies to visit course overview page per day".

In [45]:
p_hat_ret = baseline_df.iloc[5]['Baseline_Value']
n_ret = sample_size*baseline_df.iloc[2]['Baseline_Value']/baseline_df.iloc[0]['Baseline_Value']

In [53]:
# compute the std analytically
std_ret = np.sqrt(p_hat_ret*(1-p_hat_ret)/n_ret)
round(std_ret,4)

0.0549

Before proceeding, we need to check that $n$ for the three metrics is large enough that we can construct confidence intervals (critical values $z^*$ etc) assuming a normal distribution for the metric point estimates (the chosen metrics are probabilities and therefore inherently means) (through CLT).
**TO DO**

## Size
We have two groups (sometimes called "variations", i.e. experiment and control) and we choose $\alpha = 0.05$ and $1-\beta = 0.80$. The sample size (for each group) for the given power can be computed using the following formula (see eq. (2.33) of [this](http://vanbelle.org/chapters%5Cwebchapter2.pdf)):

$$ n = \frac{\left (z_{1-\alpha/2} \sqrt{2 \hat{p} (1-\hat{p})}+z_{1-\beta} \sqrt{\hat{p} (1-\hat{p})+(\hat{p}+d_{\text{min}}) (1-(\hat{p}+d_{\text{min}}))} \right)^2}{d^2_{\text{min}}} \, .$$

Since we are tracking and testing multiple metrics, we might consider applying the Bonferroni correction to obtain conservative results and limit the possibility of false positives (we know however that there is some correlation between the chosen evaluation metrics so applying the Bonferroni correction might be over conservative).

Define chosen $\alpha$, $\beta$, and whether to apply the Bonferroni correction:

In [130]:
alpha=0.05
beta=0.20
apply_bonferroni=False

In [131]:
if apply_bonferroni:
    ind_alpha=alpha/num_eval_metrics
else:
    ind_alpha=alpha

In [132]:
def compute_group_size(alpha,beta,p,dmin):
    
    if p>0.5:
        dmin=-dmin
    
    return (stats.norm.ppf(1-alpha/2)*np.sqrt(2*p*(1-p))
    + stats.norm.ppf(1-beta)*np.sqrt(p*(1-p)+(p+dmin)*(1-(p+dmin))))**2/dmin**2

### Gross conversion
Determine required number of unique cookies (total) to click "Start free trial" button:

In [133]:
total_size_gross_conv = int(2*round(compute_group_size(ind_alpha,beta,p_hat_gross_conv,d_min_gross_conv),0))
total_size_gross_conv

51670

Determine required number of unique cookies to visit course overview page:

In [134]:
total_pageviews_gross_conv = int(total_size_gross_conv/baseline_df.iloc[3]['Baseline_Value'])
total_pageviews_gross_conv

645875

### Net conversion
Determine required number of unique cookies (total) to click "Start free trial" button:

In [135]:
total_size_net_conv = int(2*round(compute_group_size(ind_alpha,beta,p_hat_net_conv,d_min_net_conv),0))
total_size_net_conv

54826

Determine required number of unique cookies to visit course overview page:

In [136]:
total_pageviews_net_conv = int(total_size_net_conv/baseline_df.iloc[3]['Baseline_Value'])
total_pageviews_net_conv

685325

### Retention
Determine required number of user-ids to enroll:

In [137]:
total_size_ret = int(2*round(compute_group_size(ind_alpha,beta,p_hat_ret,d_min_ret),0))
total_size_ret

78230

Determine required number of unique cookies to visit course overview page:

In [138]:
total_pageviews_ret = int(total_size_ret/baseline_df.iloc[2]['Baseline_Value']*baseline_df.iloc[0]['Baseline_Value'])
total_pageviews_ret

4741212

If we want to track all evaluation metrics, we need to consider the maximum of all the required sizes:

In [139]:
max([total_pageviews_gross_conv,total_pageviews_net_conv,total_pageviews_ret])

4741212

We observe that the number of unique cookies needed to test retention is much higher than those needed to test gross and net conversions. We might consider dropping the retention metric if it requires too many cookies/high percentage of traffic diversion/a very long time to run the experiment.

## Duration vs Exposure
We know that our traffic consists of approximately 40,000 unique cookies that visit the course overview page per day (from the baseline data at our disposal). If we consider all three metrics, with 100% diverted traffic it would take 4,741,212/40,000 = 119 days to run the experiment, which is definitely too long (and does not leave room for any other experiment).

If we drop retention from the set of evaluation metrics and divert 50% of traffic (20,000 unique cookies per day), we would need 685,325/20,000 = 35 days to run the experiment, which is acceptable (685,325 corresponds to the maximum number of unique cookies required to test gross and net conversions). In addition, the experiment does not seem to be very risky for users and for the business so a 50% traffic diversion seems like a reasonable percentage.

## Load experiment and control data

In [140]:
control_df = pd.read_csv('./data/control_results.csv')
experiment_df = pd.read_csv('./data/experiment_results.csv')

In [142]:
control_df.head()

Unnamed: 0,Date,Pageviews,Clicks,Enrollments,Payments
0,"Sat, Oct 11",7723,687,134.0,70.0
1,"Sun, Oct 12",9102,779,147.0,70.0
2,"Mon, Oct 13",10511,909,167.0,95.0
3,"Tue, Oct 14",9871,836,156.0,105.0
4,"Wed, Oct 15",10014,837,163.0,64.0


In [143]:
experiment_df.tail()

Unnamed: 0,Date,Pageviews,Clicks,Enrollments,Payments
32,"Wed, Nov 12",10042,802,,
33,"Thu, Nov 13",9721,829,,
34,"Fri, Nov 14",9304,770,,
35,"Sat, Nov 15",8668,724,,
36,"Sun, Nov 16",8988,710,,


## Analysis