In [1]:
import pandas as pd
import numpy as np
import math

[A/B Testing to Determine an Effective Approach to Reduce Early Udacity Course Cancellation](https://github.com/shubhamlal11/Udacity-AB-Testing-Final-Project)
==========================================================================================

Note: Another resource from Kaggle:
[AB Testing With Python - Walkthrough Udacity's Course Final Project](https://www.kaggle.com/tammyrotem/ab-tests-with-python/notebook#AB-Testing-With-Python---Walkthrough-Udacity's-Course-Final-Project)

Note: Another resource from Towards Data Science:
[A Summary of Udacity A/B Testing Course](https://towardsdatascience.com/a-summary-of-udacity-a-b-testing-course-9ecc32dedbb1)

Experiment Description
----------------------

At the time of this experiment, Udacity courses currently have two
options on the home page: "start free trial", and "access course
materials". If the student clicks "start free trial", they will be asked
to enter their credit card information, and then they will be enrolled
in a free trial for the paid version of the course. After 14 days, they
will automatically be charged unless they cancel first. If the student
clicks "access course materials", they will be able to view the videos
and take the quizzes for free, but they will not receive coaching
support or a verified certificate, and they will not submit their final
project for feedback.

In the experiment, Udacity tested a change where if the student clicked
"start free trial", they were asked how much time they had available to
devote to the course. If the student indicated 5 or more hours per week,
they would be taken through the checkout process as usual. If they
indicated fewer than 5 hours per week, a message would appear indicating
that Udacity courses usually require a greater time commitment for
successful completion, and suggesting that the student might like to
access the course materials for free. At this point, the student would
have the option to continue enrolling in the free trial, or access the
course materials for free instead. This screenshot shows the experiment:

![Experiment Screenshot](exp_screenshot.png)

The primary aim of Udacity is to improve the overall student experience
and improve coaches' capacity to support students who are likely to
complete the course.

**Null Hypothesis :** The null hypothesis is that this approach might
not make a significant change and might not be effective in reducing the
early Udacity course cancellation.

**Alternative Hypothesis :** The alternative hypothesis is that this
might reduce the number of frustrated students who left the free trial
because they didn't have enough time, without significantly reducing the
number of students to continue past the free trial and eventually
complete the course.

Experimental Design
-------------------

The unit of diversion is a cookie, although if the student enrolls in
the free trial, they are tracked by user-id from that point forward. The
same user-id cannot enroll in the free trial twice. For users that do
not enroll, their user-id is not tracked in the experiment, even if they
were signed in when they visited the course overview page.

### Metric Choice

_**Invariant Metrics :** number of cookies, number of clicks, click-through-probability._

_**Evaluation Metrics :** gross conversion, retention, net conversion._

#### Invariant Metrics

Invariant metrics are thoses which remain invariant throughout the
experiment. One could expect a similar distribution of such metrics both
on control and experiment side. In the given experiment, the invariant
metrics are as follows -

**Number of cookies:** That is, number of unique cookies to view the
course overview page. This is the unit of diversion and even
distribution amongst the control and experiment groups is expected.

**Number of clicks:** That is, number of unique cookies to click the
"Start free trial" button (which happens before the free trial screener
is trigger).Equal distribution amongst the experiment and control groups
would be expected since at this point in the funnel the experience is
the same for all users and therefore elements of the experiment would
not be expected to impact clicking the "start free trial" button.

**Click-through-probability:** That is, number of unique cookies to
click the "Start free trial" button divided by number of unique cookies
to view the course overview page. Till the time the user clicks the
"start free trial" button the user experience is same for all the users.
Hence, we expect equal distribution in both the groups.

#### Evaluation Metrics

Evaluation metrics are chosen since there is a possibility of different
distribution between experiment and control groups as a function of
experiment. Each evaluation metric is associated with a minimum
difference (dmin) that must be observed for consideration in the
decision to launch the experiment. The ultimate goal is to minimize
student frustation and use the limited coaching resources most
efficiently. With this in mind, the following conditions must be
satisfied -

-   Increased retension, i.e, the ratio of users who remained enrolled
    past the 14-day boundary to the number of users to complete checkout
    should increase.

-   Decreased gross conversion coupled to increase in net conversion,
    i.e, less students enrolling in free trial but more students staying
    beyound the free trial.

**Gross conversion:** That is, number of user-ids to complete checkout
and enroll in the free trial divided by number of unique cookies to
click the "Start free trial" button.

**Retention:** That is, number of user-ids to remain enrolled past the
14-day boundary (and thus make at least one payment) divided by number
of user-ids to complete checkout.

**Net conversion:** That is, number of user-ids to remain enrolled past
the 14-day boundary (and thus make at least one payment) divided by the
number of unique cookies to click the "Start free trial" button.

#### Unused Metrics

**Number of user-ids:** The number of users who enroll in the free
trial. User-ids are tracked only after enrolling in the free trial and
equal distribution between the control and experimental branches would
not be expected. User-id count could be used to evaluate how many
enrollments stayed beyond the 14 day free trial boundary, but since it
isn't normalized, I have elected not to use it.

Measuring Standard Deviation
----------------------------

***Analytical Estimate of Standard Deviation of Evaluation Metrics***

For each of the metrics, the standard deviation is calculated for a
sample size of 5000 unique cookies visiting the course overview page.
The standard deviation are calculated using the [Baseline
Values](/data/basline_vals.csv).

In [2]:
pageviews=5000

In [3]:
df_basevals = pd.read_csv('data/baseline_vals.csv', index_col=False, header=None, names=['metric','baseline_val'])
df_basevals.metric = df_basevals.metric.map(lambda x: x.lower())
df_basevals

Unnamed: 0,metric,baseline_val
0,unique cookies to view page per day:,40000.0
1,"unique cookies to click ""start free trial"" per...",3200.0
2,enrollments per day:,660.0
3,"click-through-probability on ""start free trial"":",0.08
4,"probability of enrolling, given click:",0.20625
5,"probability of payment, given enroll:",0.53
6,"probability of payment, given click",0.109313


$sd = \sqrt{\frac{\widehat{p}(1 - \widehat{p})}{N}}$

In [4]:
# Gross Conversion Standard Deviation = (660/3200)(1-(660/3200))/N
phat = 660/3200
print('phat =', phat)
print('sd =', round(np.sqrt((0.206250*(1-0.206250))/(5000*3200/40000)), 4))

phat = 0.20625
sd = 0.0202


In [5]:
# Retention Standard Deviation
print('sd =', round(np.sqrt((0.53*(1-0.53))/(5000*660/40000)), 4))

sd = 0.0549


In [6]:
# Net Conversion Standard Deviation
print('sd =', round(np.sqrt((0.109313*(1-0.109313))/(5000*3200/40000)), 4))

sd = 0.0156


<table>
<thead>
<tr class="header">
<th align="center">Evaluation Metric</th>
<th align="center">Standard Deviation</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td align="center">Gross Conversion</td>
<td align="center">.0202</td>
</tr>
<tr class="even">
<td align="center">Retention</td>
<td align="center">.0549</td>
</tr>
<tr class="odd">
<td align="center">Net Conversion</td>
<td align="center">.0156</td>
</tr>
</tbody>
</table>

Sizing
------

The following calculation is based on [baseline conversion data](/data/baseline_vals.csv).

### Number of Samples vs Power

Page views required for each evaluation metric is calculated separately using the [online calculator](http://www.evanmiller.org/ab-testing/sample-size.html). The alpha value of 0.05 and beta value of 0.20 is used in all the cases. The Bonferonni correction will not be used.

$n = \frac{(Z_{1 - \frac{\alpha}{2}}sd_{1} + Z_{1 - \beta}sd_{2})}{d^{2}}, where$

$sd_{1} = \sqrt{p(1 - p)+p(1 - p)}$

$sd_{2} = \sqrt{p(1 - p)+(p+d)(1 - (1 - (p+d))}$

In [7]:
from scipy.stats import norm

# Inputs: Required alpha value (alpha should already fit the required test)
# Returns: z-score for given alpha
def get_z_score(alpha):
    return norm.ppf(alpha)

# Inputs: p: baseline conversion rate (estimated p)
#         d: minimum detectable effect (d_min)
# Returns: sd for the baseline and sd for the expected change
def get_sds(p, d):
    sd1 = np.sqrt(2*p*(1 - p))
    sd2 = np.sqrt(p*(1-p) + (p+d)*(1-(p+d)))
    sds = [sd1, sd2]
    return sds

# Inputs: sd1: sd for the baseline
#         sd2: sd for the expected change
#         alpha: significance level 
#         beta: power = 1-beta
#         d: d_min
#         p: baseline conversion rate (estimated p)
# Returns: minimum sample size required per group according to metric denominator
def get_sampSize(sds, alpha, beta, d):
    n = pow((get_z_score(1-alpha/2)*sds[0] + get_z_score(1-beta)*sds[1]), 2) / pow(d, 2)
    return n

***Pageviews for Each Evaluation Metric to Achieve Target Statistical Power***

#### Gross Conversion

* Baseline Conversion: 20.625%
* Minimum Detectable Effect: 1%
* $\alpha$ = P(reject null | null true) = 0.05
* $\beta$ = P(fail to reject | null false) = 0.20
* Sensitivity = (1 - $\beta$) = 0.80 power
* Sample Size = 25,835 enrollments/group
* Number of Groups = 2 (experiment and control)
* Total Sample Size = 51,670 enrollments
* Clicks/Pageview: 3,200/40,000 = 0.08 clicks/pageview
* Pageviews Required = (51,670/0.08) = 645,875

In [8]:
GC_SampSize = round(get_sampSize(get_sds(0.20625, 0.01), 0.05, 0.2, 0.01))
GC_SampSize

25835.0

#### Retention

* Baseline Conversion: 53%
* Minimum Detectable Effect: 1%
* $\alpha$ = P(reject null | null true) = 0.05
* $\beta$ = P(fail to reject | null false) = 0.20
* Sensitivity = (1 - $\beta$) = 0.80 power
* Sample Size = 39,115 enrollments/group
* Number of Groups = 2 (experiment and control)
* Total Sample Size = 78,230 enrollments
* Enrollments/Pageview: 660/40,000 = 0.0165 enrollments/pageview
* Pageviews = (78,230/0.0165) = 4,741,212

In [9]:
R_SampSize = round(get_sampSize(get_sds(0.53, 0.01), 0.05, 0.2, 0.01))
R_SampSize

39087.0

#### Net Conversion

* Baseline Conversion: 10.9313%
* Minimum Detectable Effect: 0.75%
* $\alpha$ = P(reject null | null true) = 0.05
* $\beta$ = P(fail to reject | null false) = 0.20
* Sensitivity = (1 - $\beta$) = 0.80 power
* Sample Size = 27,413 enrollments/group
* Number of Groups = 2 (experiment and control)
* Total Sample Size = 54,826 enrollments
* Clicks/Pageview: 3,200/40,000 = 0.08 clicks/pageview
* Pageviews = (54,826/0.08) = 685,325

In [10]:
NC_SampSize = round(get_sampSize(get_sds(0.109313, 0.0075), 0.05, 0.2, 0.0075))
NC_SampSize

27413.0

*Pageviews required is maximum of pageviews required for Gross Conversion, Retention, Net Conversion. Therefore, the required pageviews is 4,741,212.*

### Duration and Exposure
If we divert 100% of traffic, given 40,000 page views per day, the
experiment would take ~ 119 days.  That is a long time.

In [11]:
# 100% diversion
math.ceil(4741212/(40000*1.0))

119

If we eliminate Retention, we are left with Gross Conversion and Net Conversion. This reduces the number of required pageviews to 685,325, and an ~ 18 day experiment with 100% diversion and ~ 35 days given 50% diversion. There may be other experiments to run, so let's say 50% diversion for 35 days.

In [12]:
# 100% diversion
math.ceil(685325/(40000*1.0))

18

In [13]:
# 50% diversion
math.ceil(685325/(40000*0.5))

35

A 119 day experiment with 100% diversion of traffic presents both a
business risk (potential for: frustrated students, lower conversion and
retention, and inefficient use of coaching resources) and an opportunity
risk (performing other experiments). However, in general, this is not a
risky experiment as the change would not be expected to cause a
precipitous drop in enrollment. In terms of timing, an 18 day experiment
is more reasonable, but % diversion may be scaled down depending on
other experiments of interest to be performed concurrently.

Experiment Analysis
-------------------

The experiment data can be found in the following links :
* [Control Group Data](/data/Final%20Project%20Results%20-%20Control) 
* [Experiment Group Data](/data/Final%20Project%20Results%20-%20Experiment)

In [14]:
df_control = pd.read_csv('data/Final Project Results - Control.csv')
df_experiment = pd.read_csv('data/Final Project Results - Experiment.csv')

In [15]:
results = {'Control': pd.Series([df_control.Pageviews.sum(), df_control.Clicks.sum(),
                                 df_control.Enrollments.sum(), df_control.Payments.sum()],
                                index = ['cookies','clicks', 'enrollments','payments']),
           'Experiment': pd.Series([df_experiment.Pageviews.sum(), df_experiment.Clicks.sum(),
                                    df_experiment.Enrollments.sum(), df_experiment.Payments.sum()],
                                   index = ['cookies', 'clicks', 'enrollments', 'payments'])}
df_results = pd.DataFrame(results)
df_results

Unnamed: 0,Control,Experiment
cookies,345543.0,344660.0
clicks,28378.0,28325.0
enrollments,3785.0,3423.0
payments,2033.0,1945.0


### Sanity Checks

This check is primarily for the invariant metrics. For invariant metrics
we expect equal diversion into the experiment and control group. We will
test this at the 95% confidence level.

#### Count Metrics

For a count, calculate a confidence interval around the fraction of events you expect to be assigned to the control group, and the observed value should be the actual fraction that was assigned to the control group.

In [16]:
df_results['Total'] = df_results.Control + df_results.Experiment
df_results['Prob'] = 0.5
df_results['StdErr'] = np.sqrt((df_results.Prob * (1 - df_results.Prob))/df_results.Total)
df_results["MargErr"] = 1.96 * df_results.StdErr
df_results["CI_Lower"] = df_results.Prob - df_results.MargErr
df_results["CI_Upper"] = df_results.Prob + df_results.MargErr
df_results["Obs_Val"] = df_results.Control/df_results.Total
df_results["Pass_Sanity"] = df_results.apply(lambda x: (x.Obs_Val > x.CI_Lower) and (x.Obs_Val < x.CI_Upper), axis=1)
df_results['Rel_Diff'] = abs((df_results.Experiment - df_results.Control)/df_results.Total)

In [17]:
df_results

Unnamed: 0,Control,Experiment,Total,Prob,StdErr,MargErr,CI_Lower,CI_Upper,Obs_Val,Pass_Sanity,Rel_Diff
cookies,345543.0,344660.0,690203.0,0.5,0.000602,0.00118,0.49882,0.50118,0.50064,True,0.001279
clicks,28378.0,28325.0,56703.0,0.5,0.0021,0.004116,0.495884,0.504116,0.500467,True,0.000935
enrollments,3785.0,3423.0,7208.0,0.5,0.005889,0.011543,0.488457,0.511543,0.525111,False,0.050222
payments,2033.0,1945.0,3978.0,0.5,0.007928,0.015538,0.484462,0.515538,0.511061,True,0.022122


#### Other Metrics

In [18]:
# Control Values
cookies_cont = df_results.loc['cookies', 'Control']
clicks_cont = df_results.loc['clicks', 'Control']

print('Control cookies:', cookies_cont)
print('Control clicks:', clicks_cont)

# Experiment Values
cookies_exp = df_results.loc['cookies', 'Experiment']
clicks_exp = df_results.loc['clicks', 'Experiment']

print('Experiment cookies:', cookies_exp)
print('Experiment clicks:', clicks_exp)

Control cookies: 345543.0
Control clicks: 28378.0
Experiment cookies: 344660.0
Experiment clicks: 28325.0


$H_{0}: p_{cont} = p_{exp}\ (p_{exp} - p_{cont} = 0)$ <br>
$H_{a}: p_{cont} \neq p_{exp}$

$\hat{p_{pool}} = \frac{x_{cont}+x_{exp}}{N_{cont}+N_{exp}}$ <br>
$\hat{se_{pool}} = \sqrt{\hat{p_{pool}}(1 - \hat{p_{pool}})(\frac{1}{N_{cont}}+\frac{1}{N_{exp}})}$

$\widehat{d} = \hat{p_{exp}} - \hat{p_{cont}}$ <br>
$H_{0}: d = 0 \ where \ \widehat{d}\sim N(0, \hat{se_{pool}})$

$If \ \widehat{d} > 1.96*\hat{se_{pool}} \ or \ \widehat{d} < -1.96*\hat{se_{pool}}, \ reject \ H_{0} \ (i.e., outside \ of \ CI)$ <br>
$95\% \ CI: \ (\widehat{d} - ME, \ \widehat{d} + ME)$

In [19]:
# Click Through Probability (clicks/cookies)

# Control Proportion 
cont_p_hat = clicks_cont/cookies_cont
print('Control Proportion:', round(cont_p_hat, 4))

# Experiment Proportion
exp_p_hat = clicks_exp/cookies_exp
print('Experiment Proportion:', round(exp_p_hat, 4))

# Observed Difference d_hat
d_hat = exp_p_hat - cont_p_hat
print('Observed Difference:', round(d_hat, 4))
print()

# Pooled Estimated Proportion
p_hat_pooled = (clicks_cont + clicks_exp)/(cookies_cont + cookies_exp)
print('Pooled Estimated Proportion:', round(p_hat_pooled, 4))

# Pooled Standard Error
SE_pooled = np.sqrt((p_hat_pooled * (1 - p_hat_pooled))*((1/cookies_cont) + (1/cookies_exp)))
print('Pooled Standard Error:', round(SE_pooled, 4))

# Margin of Error for 95% Confidence Interval (z = 1.96)
ME = 1.96 * SE_pooled
print('Margin of Error:', round(ME, 4))

# 95% Confidence Interval
lower_CI = d_hat - ME
print('CI Lower Bound:', round(lower_CI, 4))
upper_CI = d_hat + ME
print('CI Upper Bound:', round(upper_CI, 4))

Control Proportion: 0.0821
Experiment Proportion: 0.0822
Observed Difference: 0.0001

Pooled Estimated Proportion: 0.0822
Pooled Standard Error: 0.0007
Margin of Error: 0.0013
CI Lower Bound: -0.0012
CI Upper Bound: 0.0014


In [20]:
print('95% CI: (', round(lower_CI, 4), ',', round(upper_CI, 4), '); Is', round(d_hat, 4), 'within this range?')

95% CI: ( -0.0012 , 0.0014 ); Is 0.0001 within this range?


<table style="width:100%;">
<colgroup>
<col width="10%" />
<col width="20%" />
<col width="20%" />
<col width="20%" />
<col width="20%" />
<col width="10%" />
</colgroup>
<thead>
<tr class="header">
<th align="center">Metric</th>
<th align="center">Expected Value</th>
<th align="center">Observed Value</th>
<th align="center">CI Lower Bound</th>
<th align="center">CI Upper Bound</th>
<th align="center">Result</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td align="center">Number of Cookies</td>
<td align="center">0.5000</td>
<td align="center">0.5006</td>
<td align="center">0.4988</td>
<td align="center">0.5012</td>
<td align="center">Pass</td>
</tr>
<tr class="even">
<td align="center">Number of clicks on &quot;start free trial&quot;</td>
<td align="center">0.5000</td>
<td align="center">0.5005</td>
<td align="center">0.4959</td>
<td align="center">0.5041</td>
<td align="center">Pass</td>
</tr>
<tr class="odd">
<td align="center">Click-through-probability</td>
<td align="center">0.0821</td>
<td align="center">0.0822</td>
<td align="center">-0.0012</td>
<td align="center">0.0014</td>
<td align="center">Pass</td>
</tr>
</tbody>
</table>

### Result Analysis

95% Confidence interval for the difference between the experiment and
control group for evaluation metrics.

In [21]:
df_control_notnull = df_control[pd.isnull(df_control.Enrollments) != True]
df_experiment_notnull = df_experiment[pd.isnull(df_control.Enrollments) != True]

In [22]:
results_notnull = {'Control': pd.Series([df_control_notnull.Pageviews.sum(), df_control_notnull.Clicks.sum(),
                                         df_control_notnull.Enrollments.sum(), df_control_notnull.Payments.sum()],
                                        index = ['cookies', 'clicks', 'enrollments', 'payments']),
                   'Experiment': pd.Series([df_experiment_notnull.Pageviews.sum(), df_experiment_notnull.Clicks.sum(),
                                            df_experiment_notnull.Enrollments.sum(), df_experiment_notnull.Payments.sum()],
                                           index = ['cookies', 'clicks', 'enrollments', 'payments'])}
df_results_notnull = pd.DataFrame(results_notnull)
df_results_notnull['Total'] = df_results_notnull.Control + df_results_notnull.Experiment
df_results_notnull

Unnamed: 0,Control,Experiment,Total
cookies,212163.0,211362.0,423525.0
clicks,17293.0,17260.0,34553.0
enrollments,3785.0,3423.0,7208.0
payments,2033.0,1945.0,3978.0


In [23]:
# Control Values
enrollments_cont = df_results_notnull.loc['enrollments'].Control
clicks_cont = df_results_notnull.loc['clicks'].Control
payments_cont = df_results_notnull.loc['payments'].Control

print('Control enrollments:', enrollments_cont)
print('Control clicks:', clicks_cont)
print('Control payments:', payments_cont)

# Experiment Values
enrollments_exp = df_results_notnull.loc['enrollments'].Experiment
clicks_exp = df_results_notnull.loc['clicks'].Experiment
payments_exp = df_results_notnull.loc['payments'].Experiment

print('Experiment enrollments:', enrollments_exp)
print('Experiment clicks:', clicks_exp)
print('Experiment payments:', payments_exp)

Control enrollments: 3785.0
Control clicks: 17293.0
Control payments: 2033.0
Experiment enrollments: 3423.0
Experiment clicks: 17260.0
Experiment payments: 1945.0


#### Metrics

$H_{0}: p_{cont} = p_{exp}\ (p_{exp} - p_{cont} = 0)$ <br>
$H_{a}: p_{cont} \neq p_{exp}$

$\hat{p_{pool}} = \frac{x_{cont}+x_{exp}}{N_{cont}+N_{exp}}$ <br>
$\hat{se_{pool}} = \sqrt{\hat{p_{pool}}(1 - \hat{p_{pool}})(\frac{1}{N_{cont}}+\frac{1}{N_{exp}})}$

$\widehat{d} = \hat{p_{exp}} - \hat{p_{cont}}$ <br>
$H_{0}: d = 0 \ where \ \widehat{d}\sim N(0, \hat{se_{pool}})$

$If \ \widehat{d} > 1.96*\hat{se_{pool}} \ or \ \widehat{d} < -1.96*\hat{se_{pool}}, \ reject \ H_{0} \ (i.e., outside \ of \ CI)$ <br>
$95\% \ CI: \ (\widehat{d} - ME, \ \widehat{d} + ME)$

In [24]:
def effect_size_test(x_cont, x_exp, N_cont, N_exp, d_min):
    
    ## Control Proportion 
    cont_p_hat = x_cont/N_cont
    print('Control Proportion:', round(cont_p_hat, 4))

    # Experiment Proportion
    exp_p_hat = x_exp/N_exp
    print('Experiment Proportion:', round(exp_p_hat, 4))

    # Observed Difference d_hat
    d_hat = exp_p_hat - cont_p_hat
    print('Observed Difference:', round(d_hat, 4))

    print('Minimum Detectable Effect (d_min):', d_min)
    print()
    
    # Pooled Estimated Proportion
    p_hat_pooled = (x_cont + x_exp)/(N_cont + N_exp)
    print('Pooled Estimated Proportion:', round(p_hat_pooled, 4))

    # Pooled Standard Error
    SE_pooled = np.sqrt((p_hat_pooled * (1 - p_hat_pooled))*((1/N_cont) + (1/N_exp)))
    print('Pooled Standard Error:', round(SE_pooled, 4))

    # Margin of Error for 95% Confidence Interval (z = 1.96)
    ME = 1.96 * SE_pooled
    print('Margin of Error:', round(ME, 4))

    # 95% Confidence Interval
    lower_CI = d_hat - ME
    print('CI Lower Bound:', round(lower_CI, 4))
    upper_CI = d_hat + ME
    print('CI Upper Bound:', round(upper_CI, 4))
    print()
    print('95% CI: (', round(lower_CI, 4), ',', round(upper_CI, 4), ')')
    print('The change is statistically significant if the CI does not include 0. \nIn that case, it is practically significant if d_min =', d_min, 'is not in the CI as well.')

In [25]:
# Gross Conversion (enrollments/clicks)
effect_size_test(enrollments_cont, enrollments_exp, clicks_cont, clicks_exp, 0.01)

Control Proportion: 0.2189
Experiment Proportion: 0.1983
Observed Difference: -0.0206
Minimum Detectable Effect (d_min): 0.01

Pooled Estimated Proportion: 0.2086
Pooled Standard Error: 0.0044
Margin of Error: 0.0086
CI Lower Bound: -0.0291
CI Upper Bound: -0.012

95% CI: ( -0.0291 , -0.012 )
The change is statistically significant if the CI does not include 0. 
In that case, it is practically significant if d_min = 0.01 is not in the CI as well.


In [26]:
# Net Conversion (payments/clicks)
effect_size_test(payments_cont, payments_exp, clicks_cont, clicks_exp, 0.0075)

Control Proportion: 0.1176
Experiment Proportion: 0.1127
Observed Difference: -0.0049
Minimum Detectable Effect (d_min): 0.0075

Pooled Estimated Proportion: 0.1151
Pooled Standard Error: 0.0034
Margin of Error: 0.0067
CI Lower Bound: -0.0116
CI Upper Bound: 0.0019

95% CI: ( -0.0116 , 0.0019 )
The change is statistically significant if the CI does not include 0. 
In that case, it is practically significant if d_min = 0.0075 is not in the CI as well.


#### Significance definitions
A metric is statistically significant if the confidence interval does not include 0 (that is, you can be confident there was a change), and it is practically significant if the confidence interval does not include the practical significance boundary (that is, you can be confident there is a change that matters to the business.)


<table style="width:100%;">
<colgroup>
<col width="10%" />
<col width="20%" />
<col width="20%" />
<col width="20%" />
<col width="20%" />
<col width="10%" />
</colgroup>
<thead>
<tr class="header">
<th align="center">Metric</th>
<th align="center">dmin</th>
<th align="center">Observed Difference</th>
<th align="center">CI Lower Bound</th>
<th align="center">CI Upper Bound</th>
<th align="center">Result</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td align="center">Gross Conversion</td>
<td align="center">0.01</td>
<td align="center">-0.0206</td>
<td align="center">-0.0291</td>
<td align="center">-0.0120</td>
<td align="center">Statistically and Practically Significant</td>
</tr>
<tr class="even">
<td align="center">Net Conversion</td>
<td align="center">0.0075</td>
<td align="center">-0.0049</td>
<td align="center">-0.0116</td>
<td align="center">0.0019</td>
<td align="center">Neither Statistically nor Practically Significant</td>
</tr>
</tbody>
</table>

### Sign Tests

Sign test is just an another method to validate the result obtained
above. The sensitivity of sign test is lower than that of the above
test.

In [27]:
df_SignTest = pd.merge(df_control_notnull, df_experiment_notnull, on='Date')
df_SignTest.head()

Unnamed: 0,Date,Pageviews_x,Clicks_x,Enrollments_x,Payments_x,Pageviews_y,Clicks_y,Enrollments_y,Payments_y
0,"Sat, Oct 11",7723,687,134.0,70.0,7716,686,105.0,34.0
1,"Sun, Oct 12",9102,779,147.0,70.0,9288,785,116.0,91.0
2,"Mon, Oct 13",10511,909,167.0,95.0,10480,884,145.0,79.0
3,"Tue, Oct 14",9871,836,156.0,105.0,9867,827,138.0,92.0
4,"Wed, Oct 15",10014,837,163.0,64.0,9793,832,140.0,94.0


In [28]:
df_SignTest['GrossConversion_cont'] = df_SignTest.Enrollments_x/df_SignTest.Clicks_x
df_SignTest['GrossConversion_exp'] = df_SignTest.Enrollments_y/df_SignTest.Clicks_y
df_SignTest['NetConversion_cont'] = df_SignTest.Payments_x/df_SignTest.Clicks_x
df_SignTest['NetConversion_exp'] = df_SignTest.Payments_y/df_SignTest.Clicks_y

In [29]:
cols = ['Date', 'GrossConversion_cont', 'GrossConversion_exp', 'NetConversion_cont', 'NetConversion_exp']
df_SignTest = df_SignTest[cols]

In [30]:
df_SignTest.head()

Unnamed: 0,Date,GrossConversion_cont,GrossConversion_exp,NetConversion_cont,NetConversion_exp
0,"Sat, Oct 11",0.195051,0.153061,0.101892,0.049563
1,"Sun, Oct 12",0.188703,0.147771,0.089859,0.115924
2,"Mon, Oct 13",0.183718,0.164027,0.10451,0.089367
3,"Tue, Oct 14",0.186603,0.166868,0.125598,0.111245
4,"Wed, Oct 15",0.194743,0.168269,0.076464,0.112981


In [31]:
df_SignTest['GC_Sign'] = df_SignTest.GrossConversion_cont - df_SignTest.GrossConversion_exp
df_SignTest['NC_Sign'] = df_SignTest.NetConversion_cont - df_SignTest.NetConversion_exp

In [32]:
df_SignTest

Unnamed: 0,Date,GrossConversion_cont,GrossConversion_exp,NetConversion_cont,NetConversion_exp,GC_Sign,NC_Sign
0,"Sat, Oct 11",0.195051,0.153061,0.101892,0.049563,0.04199,0.05233
1,"Sun, Oct 12",0.188703,0.147771,0.089859,0.115924,0.040933,-0.026065
2,"Mon, Oct 13",0.183718,0.164027,0.10451,0.089367,0.019691,0.015144
3,"Tue, Oct 14",0.186603,0.166868,0.125598,0.111245,0.019735,0.014353
4,"Wed, Oct 15",0.194743,0.168269,0.076464,0.112981,0.026474,-0.036517
5,"Thu, Oct 16",0.167679,0.163706,0.099635,0.077411,0.003974,0.022224
6,"Fri, Oct 17",0.195187,0.162821,0.101604,0.05641,0.032367,0.045194
7,"Sat, Oct 18",0.174051,0.144172,0.110759,0.095092,0.029879,0.015667
8,"Sun, Oct 19",0.18958,0.172166,0.086831,0.110473,0.017414,-0.023643
9,"Mon, Oct 20",0.191638,0.177907,0.11266,0.113953,0.013731,-0.001294


In [33]:
# Number of trials
n = len(df_SignTest)

In [34]:
# Number of "successes"
GC_x = n - len(df_SignTest[df_SignTest.GC_Sign > 0])
GC_x

4

In [35]:
# Number of "successes"
NC_x = n - len(df_SignTest[df_SignTest.NC_Sign > 0])
NC_x

10

[online sign test calculator](https://www.graphpad.com/quickcalcs/binomial1/)

In [36]:
# First a function for calculating probability of x=number of successes
def get_prob(x, n):
    p = round(math.factorial(n)/(math.factorial(x)*math.factorial(n - x))*0.5**x*0.5**(n - x),4)
    return p

# next a function to compute the pvalue from probabilities of maximum x
def get_2side_pvalue(x, n):
    p = 0
    for i in range(0, x + 1):
        p = p + get_prob(i, n)
    return 2*p

In [37]:
print('GC Change is significant if', round(get_2side_pvalue(GC_x, n), 4), 'is smaller than 0.05')
print('NC Change is significant if', round(get_2side_pvalue(NC_x, n), 4), 'is smaller than 0.05')

GC Change is significant if 0.0026 is smaller than 0.05
NC Change is significant if 0.6774 is smaller than 0.05


<table>
<thead>
<tr class="header">
<th align="center">Metric</th>
<th align="center">p-value for sign test</th>
<th align="center">Statistically Significant @ alpha .05?</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td align="center">Gross Conversion</td>
<td align="center">0.0026</td>
<td align="center">Yes</td>
</tr>
<tr class="even">
<td align="center">Net Conversion</td>
<td align="center">0.6776</td>
<td align="center">No</td>
</tr>
</tbody>
</table>

Summary
-------

An experiment was conducted in which potential Udacity students were
diverted by cookie into two groups, experiment and control. The
experiment group was asked to input the amount of time they are willing
to devote to study, after clicking a "start free trial button", whereas
the control group was not. Three invariant metrics (Number of Cookies,
Number of clicks on "start free trial", and Click-Through-Probability)
were chosen for purposes of validation and sanity checking while Gross
Conversion (enrollment/cookie) and Net Conversion (payments/cookie)
served as evaluation metrics. The null hypothesis is that there is no
difference in the evaluation metrics between the groups, futhermore, a
practical signifcance threshold was set for each metric. The requirement
for launching the experiment is that the null hypothesis must be
rejected for ALL evaluation metrics and that the difference between
branches must meet or exceed the practical signficance threshold.
Because our acceptance criteria requires statiscally signifcant
differences for ALL evaluation metrics, the use of the Bonferonni
correction is not appropriate. The Bonferonni correction is a method for
controlling for type I errors (false positives) when using multiple
metrics in which relevance of ANY of the metrics matches the hypothesis.
In this case the risk of type I errors increases as the number of
metrics increases (signifcance by random chance). In our case in which
ALL metrics must be relevant to launch, the risk of type II errors
(false negatives) increases as the number of metrics increases, so it
stands to reason that controlling for false positives is not consistent
with our acceptance criteria.

Analysis revealed the expected equal distribution of cookies into the
control and experimental groups, for the invariant metrics, at the 95%
CI. A difference in gross conversion was found to be statistically
signficant at the 95% CI, and the null hypothesis was rejected. Gross
conversion also met the practical signficance threshold. Net Conversion
was found to be neither statistically nor practically signficant at the
95% CI.

Recommendation
--------------

This experiment was designed to determine whether filtering students as
a function of study time commitment would improve the overall student
experience and the coaches' capacity to support students who are likely
to complete the course, without significantly reducing the number of
students who continue past the free trial. A statistically and
practically signficant decrease in Gross Conversion was observed but
with no significant differences in Net Conversion. This translates to a
decrease in enrollment not coupled to an increase in students staying
for the requisite 14 days to trigger payment. Considering this, my
recomendation is not to launch, but rather to pursue other experiments.

Follow-up Experiment
--------------------

The construct of student frustration could be assigned an operational
definition of "cancel early," where a convenient definition and measure
of early cancellation is prior to the end of the 14 day trial period in
which payment is triggered. An early cancellation is not necessarily
indicative of frustration but could be from other causes, such as a
course not being aligned to the students needs or expectations in terms
of content. For preventng early cancellation there are two primary
logical timepoint opportunities for intervention, (1) pre-enrollment,
and (2) post-enrollment but pre-payment.

The first opportunity for intervention was explored above wherein a poll
regarding time commitment was used as to filter out students likely to
become frustrated. This filter focused only on time commitment to the
class and did not address other reasons why a student might become
frustrated and cancel early. Even if the student was sincere in their
response and dilligent in their study, they may become frustrated if
they don't have the suggested pre-requisite skills and experience. That
is, their committed time may not be enough if they don't come in with
the pre-requisite skill set. Adding a checklist of pre-requisite skills
to the popup regarding time commitment may be informative. This
experiment would leverage the infrastrucure and data pipeline of the
original experiment and be set up in the same way as the original,
including the unit of diversion. The only difference would be the
information in the form. If the student's answer meets the time and
pre-requisite requirements (radiobox checklist) they are directed to
enroll in the free trial, otherwise they are encouraged to access the
free version. This experiment would be low cost in terms of resources
and may increase the selectivity of the pre-enrollment filter. A
succesful experiment would be one in which there is a signficant
decrease in Gross Conversion coupled to a significant increase in Net
Conversion.

A variety of approaches could be used to intervene post-enrollment but
pre-payment and could be deployed concurrently with pre-enrollment
intervention. An ideal approach would be one which minimizes the use of
additional coaching resources to best meet the original intent of the
intervention. An effective approach may be to employ peer
coaching/guidance by means of team formation. If a student has a team of
other students which they could consult, discuss coursework and
frustrations with, and be accountable to, they may be more likely to
stick out the growing pains and stay for the long term. The experiment
would function in the following manner.

**Setup:** Upon enrollment students will either be randomly assigned to
a control group in which they are not funnelled into a team, or an
experiment group in which they are.

**Null Hypothesis:** Participation in a team will not increase the
number of students enrolled beyond the 14 day free trial period by a
significant amount.

**Unit of Diversion:** The unit of diversion will be user-id as the
change takes place after a student creates an account and enrolls in a
course.

**Invariant Metrics:** The invariant metric will be user-id, since an
equal distribution between experiment and control would be expected as a
property of the setup.

**Evaluation Metrics:** The evaluation metric willl be Retention. A
statistically and practically significant increase in Retention would
indicate that the change is succesful.

If a statistically and practically signifcant positive change in
Retention is observed, assuming an acceptable impact on overall Udacity
resources (setting up and maintaining teams will require resource use),
the experiment will be launched.

