# Import libraries

In [56]:
import numpy as np
import pandas as pd
from scipy import stats

# Load baseline data

In [190]:
baseline_dict = {'Metric': ['Unique cookies to view course overview page daily',
                            'Unique cookies to click "Start free trial" button daily',
                            'Enrollments per day',
                            'Probability of payment, given enroll'],
                 'Baseline_Value': [40000, 3200, 660, 0.53]}

baseline_df = pd.DataFrame.from_dict(baseline_dict)

In [191]:
derived_baseline_dict = {'Metric': ['Click-through probability on "Start free trial"',
                                    'Probabilty of enrolling, given click',
                                    'Probability of payment, given click'],
                         'Baseline_Value': [3200/40000, 660/3200, 0.53*660/3200]}

derived_baseline_df = pd.DataFrame.from_dict(derived_baseline_dict)

In [192]:
baseline_df = pd.concat([baseline_df,derived_baseline_df],ignore_index=True)

In [193]:
baseline_df

Unnamed: 0,Metric,Baseline_Value
0,Unique cookies to view course overview page daily,40000.0
1,"Unique cookies to click ""Start free trial"" but...",3200.0
2,Enrollments per day,660.0
3,"Probability of payment, given enroll",0.53
4,"Click-through probability on ""Start free trial""",0.08
5,"Probabilty of enrolling, given click",0.20625
6,"Probability of payment, given click",0.109313


# Metrics
Given the experiment (see README for details and discussion), we decide to consider the following **invariant** metrics:
- **Number of cookies** (number of unique cookies to view the course overview)
- **Number of clicks** (number of unique cookies to click the "Start free trial" button daily, which happens before the free trial screener is triggered)
- **Click-through-probability** (number of unique cookies to click the "Start free trail" button divided by the number of unique cookies to view the course overview page)

and the following **evaluation** metrics:
- **Gross conversion** (number of user-ids to complete checkout and enroll in the free trial divided by the number of unique cookies to click the "Start free trial" button)
- **Net conversion** (number of user-ids to remain enrolled past the 14-day boundary divided by the number of unique cookies to click the "Start free trial" button)
- **Retention** (number of user-ids to remain enrolled past the 14-day boundary divided by the number of user-ids to enroll).

We remind that cookie "uniqueness" is determined by day. User-ids are automatically unique.

We also define the practical significance boundary for each evaluation metric (difference that would have to be observed for a meaningful change for the business, $d_{\text{min}}$), given as absolute change:

In [194]:
# number of evaluation metrics
num_eval_metrics = 3

# gross conversion
d_min_gross_conv = 0.01

# net conversion
d_min_net_conv = 0.0075

# retention
d_min_ret = 0.01

# Metric variability
We estimate the standard deviation of the chosen evaluation metrics analytically with an example sample size of 5000 unique cookies that visit the course overview page.

In [195]:
sample_size = 5000

### Gross conversion
This metric follows a binomial distribution with $\hat{p}$ the "probability of enrolling, given click" and $n$ the number of unique cookies to click the "Start free trial button" (which corresponds to the sample size multiplied by the "click-through-probability on the start-free-trial button"). The estimated expected value for the metric before any changes is $n \hat{p}$ and the variance is $n \hat{p} (1-\hat{p})$. We can rescale this metric by $n$ so that the metric is defined exactly as $\hat{p}$ and its standard deviation is $\sqrt{\hat{p} (1-\hat{p})/n}$.

In [196]:
p_hat_gross_conv = baseline_df.iloc[5]['Baseline_Value']
n_gross_conv = sample_size*baseline_df.iloc[4]['Baseline_Value']

In [197]:
# compute the std analytically
std_gross_conv = np.sqrt(p_hat_gross_conv*(1-p_hat_gross_conv)/n_gross_conv)
round(std_gross_conv,4)

0.0202

### Net conversion
Similarly, the metric is defined as the "probability of payment, given click" ($\hat{p}$) and $n$ is the same as for the gross conversion case.

In [198]:
p_hat_net_conv = baseline_df.iloc[6]['Baseline_Value']
n_net_conv = n_gross_conv

In [199]:
# compute the std analytically
std_net_conv = np.sqrt(p_hat_net_conv*(1-p_hat_net_conv)/n_net_conv)
round(std_net_conv,4)

0.0156

### Retention
In this case, the metric is defined as the "probability of payment, given enroll" ($\hat{p}$) and $n$ is the number of user-ids to complete checkout and enroll, which corresponds to the sample size multiplied by "enrollments per day" and divided by "number of unique cookies to visit course overview page per day".

In [200]:
p_hat_ret = baseline_df.iloc[3]['Baseline_Value']
n_ret = sample_size*baseline_df.iloc[2]['Baseline_Value']/baseline_df.iloc[0]['Baseline_Value']

In [201]:
# compute the std analytically
std_ret = np.sqrt(p_hat_ret*(1-p_hat_ret)/n_ret)
round(std_ret,4)

0.0549

Before proceeding, we need to check that $n$ for the three metrics is large enough that we can construct confidence intervals (critical values $z^*$ etc) assuming a normal distribution for the metric point estimates (the chosen metrics are probabilities and therefore inherently means) (through CLT). We see that, for the example sample size of 5000 unique cookies, $n$ is large enough ($n>30$) in all three cases:

In [204]:
print(n_gross_conv)
print(n_net_conv)
print(n_ret)

400.0
400.0
82.5


# Size
We have two groups (experiment and control) and we choose $\alpha = 0.05$ and $1-\beta = 0.80$. The sample size (for each group) for the given power can be computed using the following formula (see eq. (2.33) of [this](http://vanbelle.org/chapters%5Cwebchapter2.pdf)):

$$ n = \frac{\left (z_{1-\alpha/2} \sqrt{2 \hat{p} (1-\hat{p})}+z_{1-\beta} \sqrt{\hat{p} (1-\hat{p})+(\hat{p}+d_{\text{min}}) (1-(\hat{p}+d_{\text{min}}))} \right)^2}{d^2_{\text{min}}} \, .$$

Since we are tracking and testing multiple metrics, we might consider applying the Bonferroni correction to limit the possibility of false positives (we know however that there is some correlation between the chosen evaluation metrics so applying the Bonferroni correction might be overly conservative).

Define chosen $\alpha$, $\beta$, and whether to apply the Bonferroni correction:

In [205]:
alpha=0.05
beta=0.20
apply_bonferroni=False

In [206]:
if apply_bonferroni:
    ind_alpha=alpha/num_eval_metrics
else:
    ind_alpha=alpha

In [207]:
def compute_group_size(alpha,beta,p,dmin):
    
    if p>0.5:
        dmin=-dmin
    
    return (stats.norm.ppf(1-alpha/2)*np.sqrt(2*p*(1-p))
    + stats.norm.ppf(1-beta)*np.sqrt(p*(1-p)+(p+dmin)*(1-(p+dmin))))**2/dmin**2

### Gross conversion
Determine required number of unique cookies (per variation) to click "Start free trial" button:

In [209]:
group_size_gross_conv = int(round(compute_group_size(ind_alpha,beta,p_hat_gross_conv,d_min_gross_conv),0))
group_size_gross_conv

25835

Determine required number of unique cookies to visit course overview page (per variation):

In [214]:
group_pageviews_gross_conv = int(round(group_size_gross_conv/baseline_df.iloc[4]['Baseline_Value'],0))
group_pageviews_gross_conv

322938

### Net conversion
Determine required number of unique cookies (per variation) to click "Start free trial" button:

In [216]:
group_size_net_conv = int(round(compute_group_size(ind_alpha,beta,p_hat_net_conv,d_min_net_conv),0))
group_size_net_conv

27413

Determine required number of unique cookies to visit course overview page (per variation):

In [220]:
group_pageviews_net_conv = int(round(group_size_net_conv/baseline_df.iloc[4]['Baseline_Value'],0))
group_pageviews_net_conv

342662

### Retention
Determine required number of user-ids to enroll (per variation):

In [221]:
group_size_ret = int(round(compute_group_size(ind_alpha,beta,p_hat_ret,d_min_ret),0))
group_size_ret

39115

Determine required number of unique cookies to visit course overview page (per variation):

In [224]:
group_pageviews_ret = int(round(group_size_ret/baseline_df.iloc[2]['Baseline_Value']*baseline_df.iloc[0]['Baseline_Value'],0))
group_pageviews_ret

2370606

If we want to track all evaluation metrics, we need to consider the maximum of all the required sizes:

In [226]:
max([group_pageviews_gross_conv,group_pageviews_net_conv,group_pageviews_ret])

2370606

We observe that the number of unique cookies needed to test retention is much higher than those needed to test gross and net conversions. We might consider dropping the retention metric if it requires too many cookies/high percentage of traffic diversion/a very long time to run the experiment. In that case, the number of unique cookies needed is (per group):

In [227]:
max([group_pageviews_gross_conv,group_pageviews_net_conv])

342662

# Duration vs Exposure
We know that our traffic consists of approximately 40,000 unique cookies that visit the course overview page per day (from the baseline data at our disposal). If we consider all three metrics, with 100% diverted traffic it would take 2*2,370,606/40,000 = 119 days to run the experiment, which is definitely too long (and does not leave room for any other experiment).

If we drop retention from the set of evaluation metrics and divert 50% of traffic (20,000 unique cookies per day), we would need 2*342,662/20,000 = 35 days to run the experiment, which is acceptable. In addition, the experiment does not seem to be very risky for users and for the business so a 50% traffic diversion seems like a reasonable percentage.

# Load experiment and control data

Meaning of each column:
- **Pageviews**: number of unique cookies to view the course overview page that day
- **Clicks**: number of unique cookies to click the "Start free trial" button that day
- **Enrollments**: number of user-ids to complete checkout and enroll in the free trial that day
- **Payments**: number of user-ids who enrolled on that day and remained enrolled for at least 14 days (thus making a payment).

Because of the 14-day free-trial window, enrollments and payments are tracked for 14 fewer days than the other columns.

In [140]:
control_df = pd.read_csv('./data/control_results.csv')
experiment_df = pd.read_csv('./data/experiment_results.csv')

In [142]:
control_df.head()

Unnamed: 0,Date,Pageviews,Clicks,Enrollments,Payments
0,"Sat, Oct 11",7723,687,134.0,70.0
1,"Sun, Oct 12",9102,779,147.0,70.0
2,"Mon, Oct 13",10511,909,167.0,95.0
3,"Tue, Oct 14",9871,836,156.0,105.0
4,"Wed, Oct 15",10014,837,163.0,64.0


In [143]:
experiment_df.tail()

Unnamed: 0,Date,Pageviews,Clicks,Enrollments,Payments
32,"Wed, Nov 12",10042,802,,
33,"Thu, Nov 13",9721,829,,
34,"Fri, Nov 14",9304,770,,
35,"Sat, Nov 15",8668,724,,
36,"Sun, Nov 16",8988,710,,


# Analysis

In order to track gross and net retention, we need to consider only the days for which we have enrollments and payments data.

In [343]:
final_control_df = control_df[control_df['Enrollments'].notnull()]
final_experiment_df = experiment_df[experiment_df['Enrollments'].notnull()]

# final_control_df = control_df.copy()
# final_experiment_df = experiment_df.copy()

In [344]:
final_control_summary_df = pd.DataFrame(final_control_df.sum()).drop('Date').rename(columns={0:'Control'})
final_experiment_summary_df = pd.DataFrame(final_experiment_df.sum()).drop('Date').rename(columns={0:'Experiment'})

In [345]:
# join on index
final_results_df = pd.DataFrame.join(final_control_summary_df,final_experiment_summary_df)
final_results_df

Unnamed: 0,Control,Experiment
Pageviews,212163,211362
Clicks,17293,17260
Enrollments,3785,3423
Payments,2033,1945


Before proceeding, we need to look at the above sample sizes and verify that they are large enough to draw conclusions on the chosen evaluation metrics.

### Sanity checks on invariant metrics
Check whether invariant metrics are equivalent between experiment and control groups.

**Pageviews** (unique cookies to visit course overview page) and **clicks** (unique cookies to click the "Start free trial" button) should be randomly split between the two groups with p=0.50 so we can use a binomial test to verify it. The sample size is large enough to assume normal distribution in constructing confidence intervals around the expected fraction of cookies in the control group (p=0.50) for both metrics (we define "control" as the success outcome). We choose $\alpha=0.05$ ($z^*=1.96$).

In [346]:
sanity_check_df = final_results_df.drop(['Enrollments','Payments'])
sanity_check_df

Unnamed: 0,Control,Experiment
Pageviews,212163,211362
Clicks,17293,17260


In [347]:
sanity_check_df['Total'] = sanity_check_df['Control']+sanity_check_df['Experiment']
sanity_check_df['Observed_Fraction'] = sanity_check_df['Control']/sanity_check_df['Total']
sanity_check_df['Expected_Fraction'] = 0.5
sanity_check_df['Std'] = sanity_check_df['Expected_Fraction']*(1-sanity_check_df['Expected_Fraction'])/sanity_check_df['Total']
sanity_check_df['Std'] = sanity_check_df['Std'].apply(lambda x:np.sqrt(x))
sanity_check_df['Margin_Error'] = 1.96*sanity_check_df['Std']
sanity_check_df['Lower_CI'] = sanity_check_df['Expected_Fraction']-sanity_check_df['Margin_Error']
sanity_check_df['Upper_CI'] = sanity_check_df['Expected_Fraction']+sanity_check_df['Margin_Error']
sanity_check_df['Pass_Test'] = sanity_check_df.apply(lambda x: x['Observed_Fraction']>=x['Lower_CI'] and x['Observed_Fraction']<=x['Upper_CI'], axis=1)

In [348]:
sanity_check_df

Unnamed: 0,Control,Experiment,Total,Observed_Fraction,Expected_Fraction,Std,Margin_Error,Lower_CI,Upper_CI,Pass_Test
Pageviews,212163,211362,423525,0.500946,0.5,0.000768,0.001506,0.498494,0.501506,True
Clicks,17293,17260,34553,0.500478,0.5,0.00269,0.005272,0.494728,0.505272,True


For the **click-through probability**, we construct the 95% confidence interval around the expected difference in click-through probabilities ($d_{\text{exp}}=0$) and check whether the observed difference is within the confidence interval.

In [349]:
add_sanity_check_df = pd.DataFrame({'Control':pd.Series([final_results_df.loc['Clicks','Control']/final_results_df.loc['Pageviews','Control']], index=['Click-through Probability']),
                                    'Experiment':pd.Series([final_results_df.loc['Clicks','Experiment']/final_results_df.loc['Pageviews','Experiment']], index=['Click-through Probability'])})
add_sanity_check_df

Unnamed: 0,Control,Experiment
Click-through Probability,0.081508,0.081661


In [350]:
add_sanity_check_df['Observed_Diff'] = add_sanity_check_df['Experiment']-add_sanity_check_df['Control']
add_sanity_check_df['Expected_Diff'] = 0.0
add_sanity_check_df['Pooled_Prob'] = (final_results_df.loc['Clicks','Control']+final_results_df.loc['Clicks','Experiment'])/(final_results_df.loc['Pageviews','Control']+final_results_df.loc['Pageviews','Experiment'])
add_sanity_check_df['Pooled_SE'] = np.sqrt(add_sanity_check_df['Pooled_Prob']*(1-add_sanity_check_df['Pooled_Prob'])*(1/final_results_df.loc['Pageviews','Control']+1/final_results_df.loc['Pageviews','Experiment']))
add_sanity_check_df['Margin_Error'] = 1.96*add_sanity_check_df['Pooled_SE']
add_sanity_check_df['Lower_CI'] = add_sanity_check_df['Expected_Diff']-add_sanity_check_df['Margin_Error']
add_sanity_check_df['Upper_CI'] = add_sanity_check_df['Expected_Diff']+add_sanity_check_df['Margin_Error']
add_sanity_check_df['Pass_Test'] = add_sanity_check_df.apply(lambda x: x['Observed_Diff']>=x['Lower_CI'] and x['Observed_Diff']<=x['Upper_CI'], axis=1)

In [351]:
add_sanity_check_df

Unnamed: 0,Control,Experiment,Observed_Diff,Expected_Diff,Pooled_Prob,Pooled_SE,Margin_Error,Lower_CI,Upper_CI,Pass_Test
Click-through Probability,0.081508,0.081661,0.000153,0.0,0.081584,0.000841,0.001649,-0.001649,0.001649,True


Since all sanity checks are passed, there is no need for further investigation (e.g. looking at the day-to-day breakdown).

### Evaluation metrics