# A/B Testing Final Project - Udacity Course

## 1. Project Introduction

### 1.1. Experiment Overview: Free Trial Screener
At the time of this experiment, Udacity courses currently have two options on the course overview page: "start free trial", and "access course materials". If the student clicks "start free trial", they will be asked to enter their credit card information, and then they will be enrolled in a free trial for the paid version of the course. After 14 days, they will automatically be charged unless they cancel first. If the student clicks "access course materials", they will be able to view the videos and take the quizzes for free, but they will not receive coaching support or a verified certificate, and they will not submit their final project for feedback.

In the experiment, Udacity tested a change where if the student clicked "start free trial", they were asked how much time they had available to devote to the course. If the student indicated 5 or more hours per week, they would be taken through the checkout process as usual. If they indicated fewer than 5 hours per week, a message would appear indicating that Udacity courses usually require a greater time commitment for successful completion, and suggesting that the student might like to access the course materials for free. At this point, the student would have the option to continue enrolling in the free trial, or access the course materials for free instead. This is what the experiment looks like.
![image.png](attachment:image.png)

The hypothesis was that this might set clearer expectations for students upfront, thus reducing the number of frustrated students who left the free trial because they didn't have enough time—without significantly reducing the number of students to continue past the free trial and eventually complete the course. If this hypothesis held true, Udacity could improve the overall student experience and improve coaches' capacity to support students who are likely to complete the course.

The unit of diversion is a cookie, although if the student enrolls in the free trial, they are tracked by user-id from that point forward. The same user-id cannot enroll in the free trial twice. For users that do not enroll, their user-id is not tracked in the experiment, even if they were signed in when they visited the course overview page.

### 1.2. Metric choice
Any place "unique cookies" are mentioned, the uniqueness is determined by day. (That is, the same cookie visiting on different days would be counted twice.) User-ids are automatically unique since the site does not allow the same user-id to enroll twice.

 - __Number of cookies__: That is, number of unique cookies to view the course overview page. (dmin=3000)
 - __Number of user-ids__: That is, number of users who enroll in the free trial. (dmin=50)
 - __Number of clicks__: That is, number of unique cookies to click the "Start free trial" button (which happens before the free trial screener is trigger). (dmin=240)
 - __Click-through-probability__: That is, number of unique cookies to click the "Start free trial" button divided by number of unique cookies to view the course overview page. (dmin=0.01)
 - __Gross conversion__: That is, number of user-ids to complete checkout and enroll in the free trial divided by number of unique cookies to click the "Start free trial" button. (dmin= 0.01)
 - __Retention__: That is, number of user-ids to remain enrolled past the 14-day boundary (and thus make at least one payment) divided by number of user-ids to complete checkout. (dmin=0.01)
 - __Net conversion__: That is, number of user-ids to remain enrolled past the 14-day boundary (and thus make at least one payment) divided by the number of unique cookies to click the "Start free trial" button. (dmin= 0.0075)

### 1.3. Measuring variability
[This spreadsheet](https://docs.google.com/spreadsheets/d/1MYNUtC47Pg8hdoCjOXaHqF-thheGpUshrFA21BAJnNc/edit#gid=0) contains rough estimates of the baseline values for these metrics (again, these numbers have been changed from Udacity's true numbers).

For each metric you selected as an evaluation metric, estimate its standard deviation analytically. Do you expect the analytic estimates to be accurate? That is, for which metrics, if any, would you want to collect an empirical estimate of the variability if you had time?

### 1.4. Sizing
__Choosing Number of Samples given Power__

Using the analytic estimates of variance, how many pageviews total (across both groups) would you need to collect to adequately power the experiment? Use an alpha of 0.05 and a beta of 0.2. Make sure you have enough power for each metric.

__Choosing Duration vs. Exposure__

 - What percentage of Udacity's traffic would you divert to this experiment (assuming there were no other experiments you wanted to run simultaneously)? 
 - Is the change risky enough that you wouldn't want to run on all traffic?

Given the percentage you chose, how long would the experiment take to run, using the analytic estimates of variance? If the answer is longer than a few weeks, then this is unreasonably long, and you should reconsider an earlier decision.

### 1.5. Analysis
The data for you to analyze is [here](https://docs.google.com/spreadsheets/d/1Mu5u9GrybDdska-ljPXyBjTpdZIUev_6i7t4LRDfXM8/edit#gid=0). This data contains the raw information needed to compute the above metrics, broken down day by day. Note that there are two sheets within the spreadsheet - one for the experiment group, and one for the control group.

The meaning of each column is:
 - __Pageviews__: Number of unique cookies to view the course overview page that day.
 - __Clicks__: Number of unique cookies to click the course overview page that day.
 - __Enrollments__: Number of user-ids to enroll in the free trial that day.
 - __Payments__: Number of user-ids who who enrolled on that day to remain enrolled for 14 days and thus make a payment. (Note that the date for this column is the start date, that is, the date of enrollment, rather than the date of the payment. The payment happened 14 days later. Because of this, the enrollments and payments are tracked for 14 fewer days than the other columns.)
 
___Sanity Checks___

Start by checking whether your invariant metrics are equivalent between the two groups. If the invariant metric is a simple count that should be randomly split between the 2 groups, you can use a binomial test as demonstrated in Lesson 5. Otherwise, you will need to construct a confidence interval for a difference in proportions using a similar strategy as in Lesson 1, then check whether the difference between group values falls within that confidence level.
If your sanity checks fail, look at the day by day data and see if you can offer any insight into what is causing the problem.

___Check for Practical and Statistical Significance___

Next, for your evaluation metrics, calculate a confidence interval for the difference between the experiment and control groups, and check whether each metric is statistically and/or practically significance. A metric is statistically significant if the confidence interval does not include 0 (that is, you can be confident there was a change), and it is practically significant if the confidence interval does not include the practical significance boundary (that is, you can be confident there is a change that matters to the business.)

If you have chosen multiple evaluation metrics, you will need to decide whether to use the Bonferroni correction. When deciding, keep in mind the results you are looking for in order to launch the experiment. Will the fact that you have multiple metrics make those results more likely to occur by chance than the alpha level of 0.05?

___Run Sign Tests___

For each evaluation metric, do a sign test using the day-by-day breakdown. If the sign test does not agree with the confidence interval for the difference, see if you can figure out why.

___Make a Recommendation___

Finally, make a recommendation. Would you launch this experiment, not launch it, dig deeper, run a follow-up experiment, or is it a judgment call? If you would dig deeper, explain what area you would investigate. If you would run follow-up experiments, briefIy describe that experiment. If it is a judgment call, explain what factors would be relevant to the decision.

### 1.6. Follow-Up Experiment: How to Reduce Early Cancellations
If you wanted to reduce the number of frustrated students who cancel early in the course, what experiment would you try? Give a brief description of the change you would make, what your hypothesis would be about the effect of the change, what metrics you would want to measure, and what unit of diversion you would use. Include an explanation of each of your choices

__Other resources__:

 - __Project template__: [Template format](https://docs.google.com/document/d/16OX2KDSHI9mSCriyGIATpRGscIW2JmByMd0ITqKYvNg/edit)
 - __Project rubric__: [Rubric](https://docs.google.com/document/u/1/d/1Hga00A4258wSJ9dir0Td_ZPLeAtJ079UGbM3C0ykUtE/pub)
 
## 2. Experiment Design

The initial unit of diversion to the control and experiment groups is a unique cookie. However, once a student enrolls in the free trial, they are tracked by user-id. The same user-id can't enroll more than once. Users who don't enroll are not tracked by user-id. Note that the uniqueness of a cookie is determined per day.

### 2.1. Metric choice
__Invariant metrics__: Invariant metrics were chosen due to their expected property of being invariant. One would expect similiar distribution into control and experiment for the following metrics. If one were to find that this is not the case, it could be indicative of feeding mogwai after midnight.
 - Number of cookies: The number of unique cookies to visit the course overview page. This is the unit of diversion and even distribution amongst the control and experiment groups is expected. It is therefore appropriate as an invariant metric.
 - Number of clicks: The number of users (tracked as unique cookies at this stage) to click the free trial button. This is appropriate as an invariant metric but not an evaluation metric. Equal distribution amongst the experiment and control groups would be expected since at this point in the funnel the experience is the same for all users and therefore elements of the experiment would not be expected to impact clicking the "start free trial" button.
 - Click-through-probability: Unique cookies to click the "start free trial" button per unique cookies to view the course overview page. Equal distribution amongst the experiment and control groups would be expected since at this point in the funnel the experience is the same for all users and therefore elements of the experiment would not be expected to impact clicking the "start free trial" button.
 
 | Invariant metric | Formula | Notation | dmin |
 | :- | :- | :- | :- |
 | Number of cookies | # unique daily cookies on page | $C_k$ | 3000 clicks
 | Number of clicks | # unique daily clicks the "free trial" button | $C_l$ | 240 clicks
 | Click-through-probability| $\frac{C_l}{C_k}$ | $CTP$ | 0.01

 
__Evaluation metrics:__ Evaluation metrics were chosen since there is the possibility of different distributions between experiment and control groups as a function of the experiment. Each evaluation metric is associated with a minimum difference (dmin) that must be observed for consideration in the decision to launch the experiment. The ultimate goal is to minimize student frustration and satisfaction and to most effectively use limited coaching resources. Cancelling early may be one indication of frustration or low satisfaction and the more students enrolled in the course who do not make at least one payment, much less finish the course, the less coaching resources are being used effectively. With this in mind, in order to consider launching the experiment either of the following must be observed:

 - Increased retention (more students staying beyond the free trial in the experiment group)
 - Decreased Gross Conversion coupled to increased Net Conversion (less students enrolling in the free trial but more students staying beyond the free trial 

 | Evaluation metrics | Formula | Notation | dmin
 | :- | :- | :- | :- |
 | Gross conversion | $\frac{enrolled}{C_l}$ | Conversion_gross | 0.01
 | Retention | $\frac{paid}{enrolled}$ | Retention | 0.01
 | Net conversion | $\frac{paid}{C_l}$ | Conversion_net | 0.0075
 
__Unused metrics__: Number of user-ids
The number of users to enroll in the free trial. This is not a suitable invariant metric and while it could be used as an evalution metric, it is not ideal. User-ids are tracked only after enrolling in the free trial and equal distribution between the control and experimental branches would not be expected. User-id count could be used to evaluate how many enrollments stayed beyond the 14 day free trial boundary, but since it isn't normalized, I have chosen not to use it.

### 2.2. Measuring standard deviations
Make an analytics estimate of each evaluation metrics' standard deviation, given a sample size of 5000 cookies visiting the course overview page.
In order to estimate variance analytically, we can assume metrics which are probabilities (p) are binomially distributed, so we can use this formula for the standard deviation:

$SD$ = $\sqrt{\frac{p * (1-p)}{n}}$

where:
 - $p$ = baseline probabilities of the event to occur
 - $n$ = sample size
This assumption is only valid when the unit of diversion of the experiment is equal to the unit of analysis (the denominator of the metric formula). In the cases when this is not valid, the actual variance might be different and it is recommended to estimate it empirically.

__Gross Conversion__: The baseline probability for Gross Conversion can be calculated by the number of users to enroll in a free trial divided by the number of cookies clicking the free trial. In other words, the **probability of enrollment given a click**. 
In this case, the unit of diversion (Cookies), that is the element by which we differentiate samples and assign them to control and experiment groups, is equal to the unit of analysis (cookies who click), that is the denominator of the formula to calculate Gross Conversion (GC). When this is the case, this analytic estimate of variance is sufficient.

In [4]:
# import libraries
import pandas as pd
import numpy as np

In [5]:
# Import Baseline dataframe
baseline = pd.read_csv('final_project_baseline_values.csv', index_col = False, header = None, names = ['metrics','estimator'])
baseline.metrics = baseline.metrics.map(lambda x : x.lower())

# Calculate p and n
sample_size = 5000
total_cookies = baseline.iloc[0]['estimator']
scale = sample_size/total_cookies

p_gc = baseline.iloc[4]['estimator'] # baseline probability of Gross Conversion
p_rt = baseline.iloc[5]['estimator'] # baseline probability of Retention
p_nc = baseline.iloc[6]['estimator'] # baseline probability of Net Conversion

n_clicks = baseline.iloc[1]['estimator'] * scale
n_enrollments = baseline.iloc[2]['estimator'] * scale

baseline

Unnamed: 0,metrics,estimator
0,unique cookies to view course overview page pe...,40000.0
1,"unique cookies to click ""start free trial"" per...",3200.0
2,enrollments per day:,660.0
3,"click-through-probability on ""start free trial"":",0.08
4,"probability of enrolling, given click:",0.20625
5,"probability of payment, given enroll:",0.53
6,"probability of payment, given click",0.109313


In [6]:
# Calculate standard deviation
sd_Gconversion = round(np.sqrt((p_gc * (1- p_gc) / n_clicks)), 4)
print('Standard deviation of Gross Conversion:', sd_Gconversion)

sd_Retention = round(np.sqrt((p_rt * (1 - p_rt) / n_enrollments)), 4)
print('Standard deviation of Retention:', sd_Retention)

sd_Nconversion = round(np.sqrt((p_nc * (1 - p_nc) / n_clicks)), 4)
print('Standard deviation of Net Conversion:', sd_Nconversion)

Standard deviation of Gross Conversion: 0.0202
Standard deviation of Retention: 0.0549
Standard deviation of Net Conversion: 0.0156


### 2.3. Sizing
#### Number of Samples vs. Power
Indicate whether you will use the Bonferroni correction during your analysis phase, and give the number of pageviews you will need to power you experiment appropriately.

In this case, as $d_{min}$ = {0.01, 0.01, 0.0075} for Gross Conversion, Retention, and Net Conversion, none of $\alpha_{individual}$ = {0.0172, 0.0464, 0.0132} Bonferroni corretion  is qualified. 

At this point, once we have estimated our metrics in the baseline values (most importantly, their estimated variance), we can calculate the minumal number of samples we need so that our experiment will have enough statistical power and siginificance.

With $\alpha$ = 0.05 (significance level) and $\beta$ = 0.2 (power), we want to estimate how many total pageviews (cookies who viewed the course overview page) we need in the experiment. This amount will be divided into tthe two groups: control and experiment. This calculation can be done using an [online calculator](https://www.evanmiller.org/ab-testing/sample-size.html) or by calculating directly using the required formula. 

Pageviews for Each Evaluation Metric to Achieve Target Statistical Power:

Gross Conversion
 - Baseline Conversion: 20.525%
 - Minimum Detectable Effect: 1%
 - $\alpha$: 5%
 - $\beta$: 20%
 - 1 - $\beta$: 80%
 - Sample size = 25,835 enrollments/group
 - Number of groups = 2 (experiment and control)
 - Total sample size = 51,670 enrollments
 - clicks/pageview = 3200/40000 = .08 clicks/pageview
 - Pageviews = 645,875
 
Retention
 - Baseline Conversion: 53%
 - Minimum Detectable Effect: 1% (as dmin = 0.01)
 - $\alpha$: 5%
 - $\beta$: 20%
 - 1 - $\beta$: 80%
 - Sample size = 39,115 payments/group
 - Number of group = 2 (experiment & control)
 - Total sample size = 78230
 - Enrollments/Pageview = 660/40000 = .0165 enrollments/ pageview
 - Pageviews = 4,741,212
 
Net Conversion
 - Baseline Conversion: 10.93%
 - Minimum Detectable Effect: .75%
 - $\alpha$: 5%
 - $\beta$: 20%
 - 1 - $\beta$: 80%
 - Sample size = 27,413payments/group
 - Number of group = 2 (experiment & control)
 - Total sample size = 54,826
 - Enrollments/Pageview = 3200/40000 = .08 enrollments/ pageview
 - Pageviews = 685,325
 
=> Pageviews required: 4,741,212

#### Duration vs. Exposure
Indicate what fraction of traffic you would divert to this experiment and, given this, how many days you would need to run the experiment.


In [7]:
# Number of days to run
4741212/40000

118.5303

With 100% diversion, it takes about 119 days to complete the experiment

In [8]:
645875/40000

16.146875

In [9]:
685325/40000

17.133125

If we divert 100% of traffic, given 40,000 page views per day, the experiment would take ~ 119 days. If we eliminate retention, we are left with Gross Conversion and Net Conversion. This reduces the number of required pageviews to 685,325, and an ~ 18 day experiment with 100% diversion and ~ 35 days given 50% diversion.

A 119 day experiment with 100% diversion of traffic presents both a business risk (potential for: frustrated students, lower conversion and retention, and inefficient use of coaching resources) and an opportunity risk (performing other experiments). However, in general, this is not a risky experiment as the change would not be expected to cause a precipitous drop in enrollment. In terms of timing, an 18 day experiment is more reasonable, but % diversion may be scaled down depending on other experiments of interest to be performed concurrently.

In short, we need 18 days and 685,325 pageviews for the experiment.

## 3. Experiment Analysis
### 3.1. Sanity check
For each metric that you chose as an invariant metric, compute a 95% confidence interval for the value you expect to observe. In this part, we use the data from sheet "Final Project Result"

In [10]:
control = pd.read_csv("final_project_results_control.csv")
experiment = pd.read_csv("final_project_results_experiment.csv")
control.head()
experiment.head()

Unnamed: 0,Date,Pageviews,Clicks,Enrollments,Payments
0,"Sat, Oct 11",7716,686,105.0,34.0
1,"Sun, Oct 12",9288,785,116.0,91.0
2,"Mon, Oct 13",10480,884,145.0,79.0
3,"Tue, Oct 14",9867,827,138.0,92.0
4,"Wed, Oct 15",9793,832,140.0,94.0


First thing we have to do before even beginning to analyze this experiment's results is sanity checks. These checks help verify that the experiment was conducted as expected and that other factors did not influence the data which we collected. This also makes sure that data collection was correct.

We have 3 Invariant metrics::

 - Number of Cookies in Course Overview Page
 - Number of Clicks on Free Trial Button
 - Free Trial button Click-Through-Probability
Two of these metrics are simple counts like number of cookies or number of clicks and the third is a probability (CTP). We will use two different ways of checking whether these obsereved values are like we expect (if in fact the experiment was not damaged.

In [11]:
results = {"Control": pd.Series([control.Pageviews.sum(), control.Clicks.sum(), control.Enrollments.sum(), control.Payments.sum()], 
                                index = ["cookies", "clicks", "enrollments", "payments"]), 
           "Experiment":pd.Series([experiment.Pageviews.sum(), experiment.Clicks.sum(), experiment.Enrollments.sum(), experiment.Payments.sum()], 
                                  index = ["cookies","clicks", "enrollments", "payments"])}
df_results = pd.DataFrame(results)
df_results

Unnamed: 0,Control,Experiment
cookies,345543.0,344660.0
clicks,28378.0,28325.0
enrollments,3785.0,3423.0
payments,2033.0,1945.0


Now we will count metrics

In [14]:
df_results["Total"] = df_results.Control + df_results.Experiment
df_results["Prob"] = 0.5
df_results["SE"] = np.sqrt((df_results.Prob * (1 - df_results.Prob))/ df_results.Total)
df_results["MarginErr"] = 1.96 * df_results.SE
df_results["CI_lower"] = df_results.Prob - df_results.MarginErr
df_results["CI_upper"] = df_results.Prob + df_results.MarginErr
df_results["Observed_Vals"] = df_results.Experiment/df_results.Total
df_results["Expected_Vals"] = df_results.Control/df_results.Total
df_results["Results"] = df_results.apply(lambda x: (x.Observed_Vals > x.CI_lower) and (x.Observed_Vals < x.CI_upper), axis =1)
df_results["Difference"] = (df_results.Experiment - df_results.Control)/ df_results.Total

df_results

Unnamed: 0,Control,Experiment,Total,Prob,SE,MarginErr,CI_lower,CI_upper,Observed_Vals,Expected_Vals,Results,Difference
cookies,345543.0,344660.0,690203.0,0.5,0.000602,0.00118,0.49882,0.50118,0.49936,0.50064,True,-0.001279
clicks,28378.0,28325.0,56703.0,0.5,0.0021,0.004116,0.495884,0.504116,0.499533,0.500467,True,-0.000935
enrollments,3785.0,3423.0,7208.0,0.5,0.005889,0.011543,0.488457,0.511543,0.474889,0.525111,False,-0.050222
payments,2033.0,1945.0,3978.0,0.5,0.007928,0.015538,0.484462,0.515538,0.488939,0.511061,True,-0.022122


### 3.2. Results Analysis
#### Effect Size Tests
For each of your evaluation metrics, give a 95% confidence interval around the difference between the experiment and control groups. Indicate whether each metric is statistically and practically significant.

The next step is looking at the changes between the control and experiment groups with regard to our evaluation metrics to make sure the difference is there, that it is statistically significant and most importantly practically significant (the difference is "big" enough to make the experimented change beneficial to the company).

Now, all that is left is to measure for each evaluation metric, the difference between the values from both groups. Then, we compute the confidence interval for that difference and test whether or not this confidence interval is both statistically and practically significant.

In [17]:
# Calculate other evaluation metrics
# Click-through-probability (clicks/cookies)

control_clicks = df_results.loc["clicks", "Control"]
control_cookies = df_results.loc["cookies","Control"]
exp_clicks = df_results.loc["clicks","Experiment"]
exp_cookies = df_results.loc["cookies","Experiment"]

control_p_hat = control_clicks/control_cookies
exp_p_hat = exp_clicks/exp_cookies

SE_ClickProb = np.sqrt((control_p_hat * (1 - control_p_hat))/control_cookies)

ME_ClickProb = 1.96 * SE_ClickProb

upper_ClickProb = exp_p_hat + ME_ClickProb
lower_ClickProb = exp_p_hat - ME_ClickProb

In [23]:
df_control_notnull = control[pd.isnull(control.Enrollments) != True]
df_experiment_notnull = experiment[pd.isnull(experiment.Enrollments) != True]