# Import libraries

In [56]:
import numpy as np
import pandas as pd
from scipy import stats

# Load baseline data
Baseline estimates for the metrics, derived from historical data before the experiment.

In [458]:
baseline_dict = {'Metric': ['Unique cookies to view course overview page per day',
                            'Unique cookies to click "Start free trial" button per day',
                            'Enrollments per day',
                            'Probability of payment, given enroll'],
                 'Baseline_Value': [40000, 3200, 660, 0.53]}

baseline_df = pd.DataFrame.from_dict(baseline_dict)

In [459]:
derived_baseline_dict = {'Metric': ['Click-through probability on "Start free trial" button',
                                    'Probabilty of enroll, given click',
                                    'Probability of payment, given click'],
                         'Baseline_Value': [3200/40000, 660/3200, 0.53*660/3200]}

derived_baseline_df = pd.DataFrame.from_dict(derived_baseline_dict)

In [460]:
baseline_df = pd.concat([baseline_df,derived_baseline_df],ignore_index=True)

In [461]:
pd.set_option('display.max_colwidth', None)
baseline_df

Unnamed: 0,Metric,Baseline_Value
0,Unique cookies to view course overview page per day,40000.0
1,"Unique cookies to click ""Start free trial"" button per day",3200.0
2,Enrollments per day,660.0
3,"Probability of payment, given enroll",0.53
4,"Click-through probability on ""Start free trial"" button",0.08
5,"Probabilty of enroll, given click",0.20625
6,"Probability of payment, given click",0.109313


# Metrics
We consider the following **invariant** metrics:
- **Number of cookies** (number of unique cookies to view the course overview page)
- **Number of clicks** (number of unique cookies to click the "Start free trial" button)
- **Click-through-probability** (number of unique cookies to click the "Start free trial" button divided by the number of unique cookies to view the course overview page)

and the following **evaluation** metrics:
- **Gross conversion** (number of user-ids to enroll in the free trial divided by the number of unique cookies to click the "Start free trial" button)
- **Net conversion** (number of user-ids to remain enrolled past the 14-day boundary divided by the number of unique cookies to click the "Start free trial" button)
- **Retention** (number of user-ids to remain enrolled past the 14-day boundary divided by the number of user-ids to enroll).

We remind that cookie "uniqueness" is determined by day. User-ids are automatically unique.

We also define the practical significance boundary for each evaluation metric (difference that would have to be observed for a meaningful change for the business, $d_{\text{min}}$):

In [522]:
# number of evaluation metrics
num_eval_metrics = 3

# gross conversion
d_min_gross_conv = -0.013

# net conversion
d_min_net_conv = 0.01

# retention
d_min_ret = 0.01

# Metric variability
As an intermediate step, we can estimate the variability (standard deviation) of the chosen evaluation metrics analytically with an example sample size of 5,000 unique cookies that visit the course overview page. This exercise allows us to discuss what distribution the evaluation metrics follow.

In [523]:
sample_size = 5000

### Gross conversion
This metric follows a binomial distribution with $n$ the number of unique cookies to click the "Start free trial" button and $\hat{p}$ the "probability of enroll, given click". The expectation value for the metric is $n \hat{p}$ and the variance is $n \hat{p} (1-\hat{p})$. We can rescale this metric by $n$ so that it is defined exactly as $\hat{p}$ and its standard deviation is $\sqrt{\hat{p} (1-\hat{p})/n}$.

In [493]:
p_hat_gross_conv = baseline_df.iloc[5]['Baseline_Value']
n_gross_conv = sample_size*baseline_df.iloc[4]['Baseline_Value']

In [494]:
# compute the std analytically
std_gross_conv = np.sqrt(p_hat_gross_conv*(1-p_hat_gross_conv)/n_gross_conv)
round(std_gross_conv,4)

0.0202

### Net conversion
Similarly, this metric is defined as the "probability of payment, given click" ($\hat{p}$) and $n$ is the same as for the gross conversion case.

In [495]:
p_hat_net_conv = baseline_df.iloc[6]['Baseline_Value']
n_net_conv = n_gross_conv

In [496]:
# compute the std analytically
std_net_conv = np.sqrt(p_hat_net_conv*(1-p_hat_net_conv)/n_net_conv)
round(std_net_conv,4)

0.0156

### Retention
In this case, this metric is defined as the "probability of payment, given enroll" ($\hat{p}$) and $n$ is the number of user-ids to enroll.

In [497]:
p_hat_ret = baseline_df.iloc[3]['Baseline_Value']
n_ret = sample_size*baseline_df.iloc[2]['Baseline_Value']/baseline_df.iloc[0]['Baseline_Value']

In [498]:
# compute the std analytically
std_ret = np.sqrt(p_hat_ret*(1-p_hat_ret)/n_ret)
round(std_ret,4)

0.0549

The final thing to check in this intermediate exercise is that, for our initial sample size of 5,000 pageviews, $n p$ and $n q$ (with $q = 1-p$) are larger than 5 for the three metrics. If these two conditions are met, we can approximate the binomial distributions B(n,p) with normal distributions N(np,npq) and therefore we can construct confidence intervals around the metric point estimates (the chosen metrics are probabilities and therefore inherently means) (through CLT). In this case, both conditions are met for all three metrics:

In [555]:
print(n_gross_conv*p_hat_gross_conv,n_gross_conv*(1-p_hat_gross_conv))
print(n_net_conv*p_hat_net_conv,n_net_conv*(1-p_hat_net_conv))
print(n_ret*p_hat_ret, n_ret*(1-p_hat_ret))

82.5 317.5
43.725 356.275
43.725 38.775


# Size
We have two groups (experiment and control) and we choose $\alpha = 0.05$ and $1-\beta = 0.80$. The sample size (for each group) for the given power can be computed using the following formula (see eq. (2.33) of [this](http://vanbelle.org/chapters%5Cwebchapter2.pdf)):

$$ n = \frac{\left (z_{1-\alpha/2} \sqrt{2 \hat{p} (1-\hat{p})}+z_{1-\beta} \sqrt{\hat{p} (1-\hat{p})+(\hat{p}+d_{\text{min}}) (1-(\hat{p}+d_{\text{min}}))} \right)^2}{d^2_{\text{min}}} \, .$$

Since we are tracking and testing multiple metrics, we might consider applying the Bonferroni correction to reduce the possibility of observing false positives. However, given that the three evaluation metrics are to some extent correlated, applying the Bonferroni correction might be overly conservative.

We define the chosen $\alpha$ and $\beta$ and whether to apply the Bonferroni correction:

In [500]:
alpha=0.05
beta=0.20
apply_bonferroni=False

In [501]:
if apply_bonferroni:
    ind_alpha=alpha/num_eval_metrics
else:
    ind_alpha=alpha

In [502]:
def compute_group_size(alpha,beta,p,dmin):
    
    if p>0.5:
        dmin=-dmin
    
    return (stats.norm.ppf(1-alpha/2)*np.sqrt(2*p*(1-p))
    + stats.norm.ppf(1-beta)*np.sqrt(p*(1-p)+(p+dmin)*(1-(p+dmin))))**2/dmin**2

### Gross conversion
Determine required number of unique cookies (per variation) to click "Start free trial" button:

In [524]:
group_size_gross_conv = int(round(compute_group_size(ind_alpha,beta,p_hat_gross_conv,d_min_gross_conv),0))
group_size_gross_conv

15097

Determine required number of unique cookies to visit course overview page (per variation):

In [525]:
group_pageviews_gross_conv = int(round(group_size_gross_conv/baseline_df.iloc[4]['Baseline_Value'],0))
group_pageviews_gross_conv

188712

### Net conversion
Determine required number of unique cookies (per variation) to click "Start free trial" button:

In [526]:
group_size_net_conv = int(round(compute_group_size(ind_alpha,beta,p_hat_net_conv,d_min_net_conv),0))
group_size_net_conv

15464

Determine required number of unique cookies to visit course overview page (per variation):

In [527]:
group_pageviews_net_conv = int(round(group_size_net_conv/baseline_df.iloc[4]['Baseline_Value'],0))
group_pageviews_net_conv

193300

### Retention
Determine required number of user-ids to enroll (per variation):

In [528]:
group_size_ret = int(round(compute_group_size(ind_alpha,beta,p_hat_ret,d_min_ret),0))
group_size_ret

39115

Determine required number of unique cookies to visit course overview page (per variation):

In [529]:
group_pageviews_ret = int(round(group_size_ret/baseline_df.iloc[2]['Baseline_Value']*baseline_df.iloc[0]['Baseline_Value'],0))
group_pageviews_ret

2370606

If we want to track all evaluation metrics, we need to consider the maximum of all the required sizes:

In [530]:
max([group_pageviews_gross_conv,group_pageviews_net_conv,group_pageviews_ret])

2370606

We observe that the number of unique cookies needed to test retention is much higher than those needed to test gross and net conversions. We might consider dropping the retention metric if it requires too many cookies/high percentage of traffic diversion/a very long time to run the experiment. In that case, the number of unique cookies needed is (per group):

In [531]:
max([group_pageviews_gross_conv,group_pageviews_net_conv])

193300

# Duration vs Exposure
We know that our traffic consists of approximately 40,000 unique cookies that visit the course overview page per day (from the historical data at our disposal). If we consider all three metrics, with 100% diverted traffic it would take *2 * 2,370,606/40,000 = 118.5* days to run the experiment, which is definitely too long (and does not leave room for any other experiment).

If we drop retention from the set of evaluation metrics and divert 50% of traffic (20,000 unique cookies per day), we would need *2 * 193,300/20,000 = 19.3* days to run the experiment, which is acceptable. In addition, the experiment does not seem to be very risky for the involved users and for the business itself so a 50% traffic diversion seems like a reasonable percentage.

# Load experiment and control data

Meaning of each column in the dataframes below:
- **Pageviews**: number of unique cookies to view the course overview page that day
- **Clicks**: number of unique cookies to click the "Start free trial" button that day
- **Enrollments**: number of user-ids to enroll in the free trial that day
- **Payments**: number of user-ids who enrolled on that day and remained enrolled past the 14-day boundary (thus making a payment).

Because of the 14-day free-trial window, enrollments and payments are tracked for 14 fewer days than the other columns.

In [551]:
control_df = pd.read_csv('./data/control_results.csv')
experiment_df = pd.read_csv('./data/experiment_results.csv')

In [472]:
control_df.head()

Unnamed: 0,Date,Pageviews,Clicks,Enrollments,Payments
0,"Sat, Oct 11",7723,687,134.0,70.0
1,"Sun, Oct 12",9102,779,147.0,70.0
2,"Mon, Oct 13",10511,909,167.0,95.0
3,"Tue, Oct 14",9871,836,156.0,105.0
4,"Wed, Oct 15",10014,837,163.0,64.0


In [473]:
experiment_df.tail()

Unnamed: 0,Date,Pageviews,Clicks,Enrollments,Payments
32,"Wed, Nov 12",10042,802,,
33,"Thu, Nov 13",9721,829,,
34,"Fri, Nov 14",9304,770,,
35,"Sat, Nov 15",8668,724,,
36,"Sun, Nov 16",8988,710,,


# Analysis

We need to consider only the days for which we have enrollment and payment information as well, since the evaluation metrics depend on these.

In [532]:
final_control_df = control_df[control_df['Enrollments'].notnull()]
final_experiment_df = experiment_df[experiment_df['Enrollments'].notnull()]

In [533]:
# create summary dataframes with totals
final_control_summary_df = pd.DataFrame(final_control_df.sum()).drop('Date').rename(columns={0:'Control'})
final_experiment_summary_df = pd.DataFrame(final_experiment_df.sum()).drop('Date').rename(columns={0:'Experiment'})

In [534]:
# join on index
final_results_df = pd.DataFrame.join(final_control_summary_df,final_experiment_summary_df)
final_results_df

Unnamed: 0,Control,Experiment
Pageviews,212163,211362
Clicks,17293,17260
Enrollments,3785,3423
Payments,2033,1945


The totals for pageviews and clicks are above the minimum sample sizes required to test gross and net conversions, but below the minimum size required for retention. For the purpose of this analysis, we decide to drop retention from the set of evaluation metrics.

### Sanity checks on invariant metrics
Check whether invariant metrics are equivalent between experiment and control groups.

**Pageviews** (unique cookies to visit course overview page) and **clicks** (unique cookies to click the "Start free trial" button) should be randomly split between the two groups with p=0.50 so we can use a binomial test to verify it. The sample size is large enough to assume normal distribution in constructing confidence intervals around the expected fraction of cookies in the control group (p=0.50) for both metrics (we define "control" as the success outcome). We choose $\alpha=0.05$ ($z^*=1.96$).

In [535]:
sanity_check_df = final_results_df.drop(['Enrollments','Payments'])
sanity_check_df

Unnamed: 0,Control,Experiment
Pageviews,212163,211362
Clicks,17293,17260


In [536]:
sanity_check_df['Total'] = sanity_check_df['Control']+sanity_check_df['Experiment']
sanity_check_df['Observed_Fraction'] = sanity_check_df['Control']/sanity_check_df['Total']
sanity_check_df['Expected_Fraction'] = 0.5
sanity_check_df['Std'] = sanity_check_df['Expected_Fraction']*(1-sanity_check_df['Expected_Fraction'])/sanity_check_df['Total']
sanity_check_df['Std'] = sanity_check_df['Std'].apply(lambda x:np.sqrt(x))
sanity_check_df['Margin_Error'] = 1.96*sanity_check_df['Std']
sanity_check_df['Lower_CI'] = sanity_check_df['Expected_Fraction']-sanity_check_df['Margin_Error']
sanity_check_df['Upper_CI'] = sanity_check_df['Expected_Fraction']+sanity_check_df['Margin_Error']
sanity_check_df['Pass_Test'] = sanity_check_df.apply(lambda x: x['Observed_Fraction']>=x['Lower_CI'] and x['Observed_Fraction']<=x['Upper_CI'], axis=1)

In [537]:
sanity_check_df

Unnamed: 0,Control,Experiment,Total,Observed_Fraction,Expected_Fraction,Std,Margin_Error,Lower_CI,Upper_CI,Pass_Test
Pageviews,212163,211362,423525,0.500946,0.5,0.000768,0.001506,0.498494,0.501506,True
Clicks,17293,17260,34553,0.500478,0.5,0.00269,0.005272,0.494728,0.505272,True


For the **click-through probability**, we construct the 95% confidence interval around the expected difference in click-through probabilities ($d_{\text{exp}}=0$) and check whether the observed difference is within the confidence interval.

In [538]:
add_sanity_check_df = pd.DataFrame({'Control':pd.Series([final_results_df.loc['Clicks','Control']/final_results_df.loc['Pageviews','Control']], index=['Click-through Probability']),
                                    'Experiment':pd.Series([final_results_df.loc['Clicks','Experiment']/final_results_df.loc['Pageviews','Experiment']], index=['Click-through Probability'])})
add_sanity_check_df

Unnamed: 0,Control,Experiment
Click-through Probability,0.081508,0.081661


In [539]:
add_sanity_check_df['Observed_Diff'] = add_sanity_check_df['Experiment']-add_sanity_check_df['Control']
add_sanity_check_df['Expected_Diff'] = 0.0
add_sanity_check_df['Pooled_Prob'] = (final_results_df.loc['Clicks','Control']+final_results_df.loc['Clicks','Experiment'])/(final_results_df.loc['Pageviews','Control']+final_results_df.loc['Pageviews','Experiment'])
add_sanity_check_df['Pooled_SE'] = np.sqrt(add_sanity_check_df['Pooled_Prob']*(1-add_sanity_check_df['Pooled_Prob'])*(1/final_results_df.loc['Pageviews','Control']+1/final_results_df.loc['Pageviews','Experiment']))
add_sanity_check_df['Margin_Error'] = 1.96*add_sanity_check_df['Pooled_SE']
add_sanity_check_df['Lower_CI'] = add_sanity_check_df['Expected_Diff']-add_sanity_check_df['Margin_Error']
add_sanity_check_df['Upper_CI'] = add_sanity_check_df['Expected_Diff']+add_sanity_check_df['Margin_Error']
add_sanity_check_df['Pass_Test'] = add_sanity_check_df.apply(lambda x: x['Observed_Diff']>=x['Lower_CI'] and x['Observed_Diff']<=x['Upper_CI'], axis=1)

In [540]:
add_sanity_check_df

Unnamed: 0,Control,Experiment,Observed_Diff,Expected_Diff,Pooled_Prob,Pooled_SE,Margin_Error,Lower_CI,Upper_CI,Pass_Test
Click-through Probability,0.081508,0.081661,0.000153,0.0,0.081584,0.000841,0.001649,-0.001649,0.001649,True


Since all sanity checks are passed, there is no need for further investigation (e.g. looking at the day-to-day breakdown).

### Evaluation metrics

#### Effect Size Test

For each evaluation metric (gross and net conversions) we construct the 95% confidence interval around the difference between experiment and control and check whether the interval includes zero (if it doesn't, the difference is statistically significant) and whether the interval includes $d_{\text{min}}$ (if it doesn't, the difference is also practically significant).

In [541]:
eval_metrics_df = pd.DataFrame({'Control':pd.Series([final_results_df.loc['Enrollments','Control']/final_results_df.loc['Clicks','Control'],
                                                     final_results_df.loc['Payments','Control']/final_results_df.loc['Clicks','Control']],
                                                    index=['Gross conversion','Net conversion']),
                                'Experiment':pd.Series([final_results_df.loc['Enrollments','Experiment']/final_results_df.loc['Clicks','Experiment'],
                                                        final_results_df.loc['Payments','Experiment']/final_results_df.loc['Clicks','Experiment']],
                                                       index=['Gross conversion','Net conversion'])})
eval_metrics_df

Unnamed: 0,Control,Experiment
Gross conversion,0.218875,0.19832
Net conversion,0.117562,0.112688


In [542]:
def pass_practical_test(row):
    
    if row['Minimum_Diff']>=0:
        return row['Minimum_Diff']<row['Lower_CI']
    else:
        return row['Minimum_Diff']>row['Upper_CI']

In [543]:
eval_metrics_df['Observed_Diff'] = eval_metrics_df['Experiment']-eval_metrics_df['Control']
eval_metrics_df['Expected_Diff'] = 0.0
eval_metrics_df['Minimum_Diff'] = [d_min_gross_conv,d_min_net_conv]

pooled_prob_gross_conv = (final_results_df.loc['Enrollments','Control']+final_results_df.loc['Enrollments','Experiment'])/(final_results_df.loc['Clicks','Control']+final_results_df.loc['Clicks','Experiment'])
pooled_prob_net_conv = (final_results_df.loc['Payments','Control']+final_results_df.loc['Payments','Experiment'])/(final_results_df.loc['Clicks','Control']+final_results_df.loc['Clicks','Experiment'])
eval_metrics_df['Pooled_Prob'] = [pooled_prob_gross_conv,pooled_prob_net_conv]
eval_metrics_df['Pooled_SE'] = np.sqrt(eval_metrics_df['Pooled_Prob']*(1-eval_metrics_df['Pooled_Prob'])*(1/final_results_df.loc['Clicks','Control']+1/final_results_df.loc['Clicks','Experiment']))

eval_metrics_df['Margin_Error'] = 1.96*eval_metrics_df['Pooled_SE']
eval_metrics_df['Lower_CI'] = eval_metrics_df['Observed_Diff']-eval_metrics_df['Margin_Error']
eval_metrics_df['Upper_CI'] = eval_metrics_df['Observed_Diff']+eval_metrics_df['Margin_Error']

eval_metrics_df['Pass_Stat_Test'] = eval_metrics_df.apply(lambda x: False if x['Expected_Diff']>=x['Lower_CI'] and x['Expected_Diff']<=x['Upper_CI'] else True, axis=1)
eval_metrics_df['Pass_Prac_Test'] = eval_metrics_df.apply(pass_practical_test, axis=1)

In [544]:
eval_metrics_df

Unnamed: 0,Control,Experiment,Observed_Diff,Expected_Diff,Minimum_Diff,Pooled_Prob,Pooled_SE,Margin_Error,Lower_CI,Upper_CI,Pass_Stat_Test,Pass_Prac_Test
Gross conversion,0.218875,0.19832,-0.020555,0.0,-0.013,0.208607,0.004372,0.008568,-0.029123,-0.011986,True,False
Net conversion,0.117562,0.112688,-0.004874,0.0,0.01,0.115127,0.003434,0.006731,-0.011605,0.001857,False,False


#### Sign Test

In [545]:
sign_test_df = final_control_df.merge(final_experiment_df,on='Date')

sign_test_df['Gross_Conversion_Control'] = sign_test_df['Enrollments_x']/sign_test_df['Clicks_x']
sign_test_df['Net_Conversion_Control'] = sign_test_df['Payments_x']/sign_test_df['Clicks_x']

sign_test_df['Gross_Conversion_Experiment'] = sign_test_df['Enrollments_y']/sign_test_df['Clicks_y']
sign_test_df['Net_Conversion_Experiment'] = sign_test_df['Payments_y']/sign_test_df['Clicks_y']

In [546]:
sign_test_n = len(sign_test_df)

We count the number of times in which the control gross/net conversion is higher than the experiment one:

In [547]:
gross_conv_k = sum(sign_test_df['Gross_Conversion_Control']>sign_test_df['Gross_Conversion_Experiment'])
gross_conv_k

19

In [548]:
net_conv_k = sum(sign_test_df['Net_Conversion_Control']>sign_test_df['Net_Conversion_Experiment'])
net_conv_k

13

What is the probability of obtaining results at least as extreme as these (p-value)? This follows a binomial distribution with $n$ as the number of days in the dataframe and $p=0.5$ (if there is no statistical difference, days should be randomly split between positive and negative differences). The probability of obtaining *at least* $k$ successes out of $n$ trials is:

In [549]:
gross_conv_cumul_prob=0

# multiply by 2 because it is a two-sided test
for k in range(gross_conv_k,sign_test_n+1):
    gross_conv_cumul_prob+=2*stats.binom.pmf(k, sign_test_n, 0.5)
    
gross_conv_cumul_prob

0.002599477767944336

In [550]:
net_conv_cumul_prob=0

# multiply by 2 because it is a two-sided test
for k in range(net_conv_k,sign_test_n+1):
    net_conv_cumul_prob+=2*stats.binom.pmf(k, sign_test_n, 0.5)
    
net_conv_cumul_prob

0.6776394844055196