##0. Experiment Design

###Metric Choice

###Evaluation Metrics:

####Gross Conversion
The number of user-ids to complete checkout and enroll in the free trial divided by number of unique cookies to click the "Start free trial" button. In order to launch the product change, we'd need an absolute minimum difference of 0.01.

Why an evaluation metric: Since the two branches of traffic would see/not see the screener, the metric would tell us how the new screener affect enrollment in the class.

Why not an invariant metric: Since the two branches of traffic would see/not see the screener, we'd not expect the metric to be the same between the two branches.

####Net Conversion
The number of user-ids to remain enrolled past the 14-day boundary (and thus make at least one payment) divided by the number of unique cookies to click the "Start free trial" button. In order to launch the product change, we'd need absolute minimum difference of 0.0075.

Why an evaluation metric: Since the two branches of traffic would see/not see the screener, the metric would tell us if the screener has an effect on the retention of students in the class.

Why not an invariant metric: Since the two branches of traffic would see/not see the screener, we'd not expect the metric to be the same between the two branches.


###Invariant Metrics:

####Number of cookies
The number of unique cookies to view the course overview page. Since the traffic is randomly split, we'd expect the same number of cookies.

Why not an evaluation metric: Since the two branches of traffic are randomly split and have not reached the screener (or the lack thereof), we'd not expect there to be a difference.

Why an invariant metric: Since the two branches of traffic are randomly split and have not reached the screener (or the lack thereof), we'd expect them to be the same.

####Number of clicks
The number of unique cookies to click the "Start free trial" button (which happens before the free trial screener is trigger). Since this step occurs before the free trial screener, we'd expect the same number of cookies in both traffic streams.

Why not an evaluation metric: Since the two branches of traffic are randomly split and have not reached the screener (or the lack thereof), we'd not expect there to be a difference.

Why an invariant metric: Since the two branches of traffic are randomly split and have not reached the screener (or the lack thereof), we'd expect them to be the same.

###Neither Evaluation nor Invariant Metrics:

####Number of user-ids
The number of users who enroll in the free trial.

Why not an evaluation metric: Since Gross Conversion already captures enrollment, this metric is extraneous.

Why not an invariant metric: Since the two branches of traffic would see/not see the screener, we'd not expect this metric to be the same between the two branches.

####Click-through-probability 
The number of unique cookies to click the "Start free trial" button divided by number of unique cookies to view the course overview page.

Why not an evaluation metric: Since the two branches of traffic are randomly split and have not reached the screener (or the lack thereof), we'd not expect there to be a difference.

Why not an invariant metric: Since the metric is a product of the number of cookies and number of clicks, this metric is extraneous.

####Retention
The number of user-ids to remain enrolled past the 14-day boundary (and thus make at least one payment) divided by number of user-ids to complete checkout. 

Why not an evaluation metric: Since Net Conversion already captures retention, this metric is extraneous.

Why not an invariant metric: Since the two branches of traffic would see/not see the screener, we'd not expect this metric to be the same between the two branches.

##1. Measuring Standard Deviation

Standard deviation of evaluation metrics:

In [1]:
import pandas as pd
import numpy as np
from math import *

In [4]:
baseline_values = pd.read_csv('Final Project Baseline Values - Sheet1.csv', header=None)

In [12]:
baseline_values = baseline_values.set_index(0)

In [26]:
baseline_values['sample_size'] = baseline_values[1] * (5000/40000.0)

In [27]:
baseline_values

Unnamed: 0_level_0,1,sample_size
0,Unnamed: 1_level_1,Unnamed: 2_level_1
Unique cookies to view page per day:,40000.0,5000.0
"Unique cookies to click ""Start free trial"" per day:",3200.0,400.0
Enrollments per day:,660.0,82.5
"Click-through-probability on ""Start free trial"":",0.08,0.01
"Probability of enrolling, given click:",0.20625,0.025781
"Probability of payment, given enroll:",0.53,0.06625
"Probability of payment, given click",0.109313,0.013664


In [50]:
# Gross Conversion

p_hat = baseline_values.loc['Probability of enrolling, given click:', 1]
N = baseline_values.loc['Unique cookies to click "Start free trial" per day:', 'sample_size']
SE_p = sqrt(p_hat*(1-p_hat)/(N))
print p_hat
print SE_p
print p_hat * N

0.20625
0.020230604137
82.5


In [49]:
# Net Conversion

p_hat = baseline_values.loc['Probability of payment, given click', 1]
N = baseline_values.loc['Unique cookies to click "Start free trial" per day:', 'sample_size']
SE_p = sqrt(p_hat*(1-p_hat)/(N))
print p_hat
print SE_p
print p_hat * N

0.1093125
0.0156015445825
43.725


For both of the evaluation metrics, the analytical estimate is likely comparable to the empirical estimate. Since clicks and enrollments are binomial variables, with p_hat * N > 5 and (1-p_hat) * N > 5, we can use the normal distribution to approximate them which is what we're assuming with the analytical estimate. If there's a slight difference, the analytical estimate would be an underestimate of the variance.

Furthermore, the unit of analyses (the denominator, i.e. cookies) and our unit of diversion for the experiment (i.e. cookies) are the same, therefore we can be confident about using the analytical estimate for the variance. 

##2. Sizing

Given we have 2 metrics, using the Bonferonni correction is the right way to ensure that the result of the hypothesis test is not due to chance. However, it will take too long for the experiment to achieve enough power (using beta = 0.2 and alpha = 0.05). 

To calculate the number of pageviews required, the baseline p_hat and the absolute minimum difference are plugged into this [calculator](http://www.evanmiller.org/ab-testing/sample-size.html). The results are below:

In [51]:
# Net Conversion

sample_size_per_branch = 27345
pageviews = baseline_values.loc['Unique cookies to view page per day:', 1]
clicks = baseline_values.loc['Unique cookies to click "Start free trial" per day:', 1]
print sample_size_per_branch*2 * (pageviews / float(clicks)) 

683625.0


In [87]:
# Gross Conversion

sample_size_per_branch = 25812
pageviews = baseline_values.loc['Unique cookies to view page per day:', 1]
clicks = baseline_values.loc['Unique cookies to click "Start free trial" per day:', 1]
print sample_size_per_branch*2 * (pageviews / float(clicks)) 

645300.0


We'd take the larger one 683625.

###Duration vs. Exposure
Since it is a relatively low risk experiment, I'd split the traffic 50-50 to conduct the experiment. A 50-50 split is the highest amount of traffic you can divert into the experiment and control group. This is required due to the high number of pageviews required to establish statistical power for the hypothesis test. Even doing so would take 35 days to complete the experiment given the number of pageviews needed and daily traffic. However, given the benefit of better retaining the users enrolled in the course (i.e. reducing churn), it would be beneficial to run it for that long. Since the cost of acuiring a customer (in this case, the online marketing campaigns, etc.) is often much higher than the cost of retaining a customer (in this case, understanding if there's anything that can be done to influence the customer to stay), such a long experiment would be worth running.

##3. Experiment Analysis
###Sanity Checks
For each of your invariant metrics, the 95% confidence interval for the value you expect to observe contains the actual observed value, therefore the metric passes the sanity check. 

In [53]:
control = pd.read_csv('Final Project Results - Control.csv')
experiment = pd.read_csv('Final Project Results - Experiment.csv')

In [56]:
print control.head()

          Date  Pageviews  Clicks  Enrollments  Payments
0  Sat, Oct 11       7723     687          134        70
1  Sun, Oct 12       9102     779          147        70
2  Mon, Oct 13      10511     909          167        95
3  Tue, Oct 14       9871     836          156       105
4  Wed, Oct 15      10014     837          163        64


In [88]:
# Since we'd need enrollment and payment information, we only use pageviews and clicks
# for the first 24 days.

pageviews_control = sum(control['Pageviews'])
pageviews_experiment = sum(experiment['Pageviews'])
clicks_control = sum(control['Clicks'])
clicks_experiment = sum(experiment['Clicks'])

In [89]:
# pageviews

SD = sqrt(0.5*0.5/float(pageviews_control + pageviews_experiment))
margin_of_error = 1.96 * SD # assuming 95% C.I
CI_LB = 0.5 - margin_of_error
CI_UB = 0.5 + margin_of_error
print 'interval: ', (CI_LB, CI_UB)
print 'observed: ', (pageviews_control) / float(pageviews_control + pageviews_experiment)

interval:  (0.49882039214902313, 0.5011796078509769)
observed:  0.500639666881


In [90]:
# clicks

SD = sqrt(0.5*0.5/float(clicks_control + clicks_experiment))
margin_of_error = 1.96 * SD # assuming 95% C.I
CI_LB = 0.5 - margin_of_error
CI_UB = 0.5 + margin_of_error
print 'interval: ', (CI_LB, CI_UB)
print 'observed: ', (clicks_control) / float(clicks_control + clicks_experiment)

interval:  (0.49588449572378945, 0.5041155042762105)
observed:  0.500467347407


##4. Result Analysis

####Effect Size Tests

In [73]:
# Gross Conversion

enrollments_control = sum(control['Enrollments'][:23])
enrollments_experiment = sum(experiment['Enrollments'][:23])

N_cont = float(clicks_control)
N_exp = float(clicks_experiment)
X_cont = enrollments_control
X_exp = enrollments_experiment

In [74]:
p_hat = (X_cont + X_exp) / (N_cont + N_exp)
SE_pool = sqrt((p_hat)*(1 - p_hat)*(1/N_cont + 1/N_exp))
d_hat = X_exp/N_exp - X_cont/N_cont
margin_of_error = SE_pool * 1.96 
CI_LB = d_hat - margin_of_error
CI_UB = d_hat + margin_of_error
print 'interval: ', (CI_LB, CI_UB)

# Since the CI does not contain 0, it's statistically significant. Since the UB's absolute value 
# is greater than the minimum desirable effect (i.e. 0.01), it's practically significant. 

interval:  (-0.029123358335404401, -0.01198639082531873)


In [75]:
# Net Conversion

payments_control = sum(control['Payments'][:23])
payments_experiment = sum(experiment['Payments'][:23])

N_cont = float(clicks_control)
N_exp = float(clicks_experiment)
X_cont = payments_control
X_exp = payments_experiment

In [76]:
p_hat = (X_cont + X_exp) / (N_cont + N_exp)
SE_pool = sqrt((p_hat)*(1 - p_hat)*(1/N_cont + 1/N_exp))
d_hat = X_exp/N_exp - X_cont/N_cont
margin_of_error = SE_pool * 1.96 
CI_LB = d_hat - margin_of_error
CI_UB = d_hat + margin_of_error
print 'interval: ', (CI_LB, CI_UB)

# Since the CI does contain 0, it's not statistically significant and therefore also not
# practically significant

interval:  (-0.011604624359891718, 0.001857179010803383)


####Sign Tests
Based on the [sign test](http://graphpad.com/quickcalcs/binomial2/), at alpha = 0.05, the gross conversion effect is statistically significant since the 2-tailed p-value is less than 0.05 but the net conversion effect is not since its value is greater than 0.05.

In [79]:
control['enrollments_clicks'] = control['Enrollments'] / control['Clicks']
control['payments_clicks'] = control['Payments'] / control['Clicks']

experiment['enrollments_clicks'] = experiment['Enrollments'] / experiment['Clicks']
experiment['payments_clicks'] = experiment['Payments'] / experiment['Clicks']

In [84]:
# Gross Conversion

n_positives = sum((experiment['enrollments_clicks'] > control['enrollments_clicks'])[:23])
n_days = 23
print 'number of positives: ', n_positives

# 2-tailed p-value = 0.0026

number of positives:  4


In [85]:
# Net Conversion

n_positives = sum((experiment['payments_clicks'] > control['payments_clicks'])[:23])
print 'number of positives: ', n_positives

# 2-tailed p-value = 0.6776

number of positives:  10


##5. Summary
As stated above, the Bonferonni correction is not used because the experiment will simply take too long if it's used. The sign test and the effect size test agree with each other.

##6. Recommendation
Since the Gross Conversion effect test is practically significant, we can conclude that having the extra free trial screener does have a visible business impact on the enrollment. In fact, it has a negative effect on the number of people that enroll in the course which is what we'd expect (we'd want it to decrease the number of students that wants to sign up but don't have the time to commit).

As for the Net Conversion effect test, the result is not statistically significant; meaning the students that sign up after knowing that they'd need the time commitment does not necessarily continue to be enrolled. This requires additional thought and discussion with the product team. 

Before making plans to conduct additional experiments, it might be effective to have an exit survey when the student decide to cancel their subscription of the reason why they're cancelling. This will lead to better understanding of the students and better intuition in generating features that'd reduce churn.