# A/B Testing Final Project

Testing out the change made to the fictional "Audacity" educational website where a change was made to ask them how much time they can devote to a course before they chose to begin a free trial. 

## Experiment overview [[1]](https://docs.google.com/document/u/1/d/1aCquhIqsUApgsxQ8-SQBAigFDcfWVVohLEXcV6jWbdI/pub?embedded=True)

overview page: "start free trial", and "access course materials". If the student clicks "start free
trial", they will be asked to enter their credit card information, and then they will be enrolled in a
free trial for the paid version of the course. After 14 days, they will automatically be charged
unless they cancel first. If the student clicks "access course materials", they will be able to view
the videos and take the quizzes for free, but they will not receive coaching support or a verified
certificate, and they will not submit their final project for feedback.
In the experiment, Udacity tested a change where if the student clicked "start free trial", they
were asked how much time they had available to devote to the course. If the student indicated 5
or more hours per week, they would be taken through the checkout process as usual. If they
indicated fewer than 5 hours per week, a message would appear indicating that Udacity courses
usually require a greater time commitment for successful completion, and suggesting that the
student might like to access the course materials for free. At this point, the student would have
the option to continue enrolling in the free trial, or access the course materials for free
instead. This screenshot shows what the experiment looks like.
The hypothesis was that this might set clearer expectations for students upfront, thus reducing
the number of frustrated students who left the free trial because they didn't have enough timeâ€”
without significantly reducing the number of students to continue past the free trial and
eventually complete the course. If this hypothesis held true, Udacity could improve the overall
student experience and improve coaches' capacity to support students who are likely to
complete the course.
The unit of diversion is a cookie, although if the student enrolls in the free trial, they are tracked
by user-id from that point forward. The same user-id cannot enroll in the free trial twice. For
users that do not enroll, their user-id is not tracked in the experiment, even if they were signed
in when they visited the course overview page. 

Getting started by importing required libraries

In [1]:
import pandas as pd
import numpy as np

In [2]:
baseline = pd.read_csv('Data/baseline.csv')

In [3]:
baseline

Unnamed: 0,MeasureName,Value
0,Unique cookies to view course overview page pe...,40000.0
1,Unique cookies to click Start free trial per day,3200.0
2,Enrollments per day,660.0
3,Click-through-probability on Start free trial,0.08
4,Probability of enrolling given click,0.20625
5,Probability of payment given enroll,0.53
6,Probability of payment given click,0.109313


In [4]:
baseline.dtypes

MeasureName     object
 Value         float64
dtype: object

## Section 1 Question: Metric Choice

Which of the following metrics would you choose to measure for this experiment and why? For each metric you choose, indicate whether you would use it as an invariant metric or an evaluation metric. The practical significance boundary for each metric, that is, the difference that would have to be observed before that was a meaningful change for the business, is given in parentheses. All practical significance boundaries are given as absolute changes.


Any place "unique cookies" are mentioned, the uniqueness is determined by day. (That is, the same cookie visiting on different days would be counted twice.) User-ids are automatically unique since the site does not allow the same user-id to enroll twice.

Number of cookies: That is, number of unique cookies to view the course overview page. (dmin=3000)<br><br>
Number of user-ids: That is, number of users who enroll in the free trial. (dmin=50)<br><br>
Number of clicks: That is, number of unique cookies to click the "Start free trial" button (which happens before the free trial screener is trigger). (dmin=240)<br><br>
Click-through-probability: That is, number of unique cookies to click the "Start free trial" button divided by number of unique cookies to view the course overview page. (dmin=0.01)<br><br>
Gross conversion: That is, number of user-ids to complete checkout and enroll in the free trial divided by number of unique cookies to click the "Start free trial" button. (dmin= 0.01)<br><br>
Retention: That is, number of user-ids to remain enrolled past the 14-day boundary (and thus make at least one payment) divided by number of user-ids to complete checkout. (dmin=0.01)<br><br>
Net conversion: That is, number of user-ids to remain enrolled past the 14-day boundary (and thus make at least one payment) divided by the number of unique cookies to click the "Start free trial" button. (dmin= 0.0075)<br><br>

You should also decide now what results you will be looking for in order to launch the experiment. Would a change in any one of your evaluation metrics be sufficient? Would you want to see multiple metrics all move or not move at the same time in order to launch? This decision will inform your choices while designing the experiment.

## Section 1 Answers: Chosing invariant and evaluation metrics

#### 1. Invarant Metrics
of the 6 metrics provided, the metrics **"Unique cookies to visit the page per day"**, **"Unique cookies to click start free trial per day"** & **"Click through probability on the start free trial"** are 3 measured before the user actually sees any changes to the page. Therefore these 2 values are not going to change as an after effect of this experiment

#### 2. Evaluation Metrics
There are 3 possible candidates for the evaluation metrics, **"Probability of enrolling given click"**, **"Probability of payment given enroll"** and **"Probability of payment given click"** are good candidates for evaluation metrics. Once I completed the sizing section below, I found that the number of PVs that would be needed in order to reliably estimate the effect of the change would be very large for "Probability of payment given enroll" therefore the final 2 metrics to measure this change are:
1. **"Probability of enrolling given click"**
2. **"Probability of payment given click"** 

### Measuring Variability
[This spreadsheet](https://www.google.com/url?q=https://docs.google.com/a/knowlabs.com/spreadsheets/d/1MYNUtC47Pg8hdoCjOXaHqF-thheGpUshrFA21BAJnNc/edit%23gid%3D0&sa=D&ust=1521269700957000&usg=AFQjCNHpp0jBBaRuv1Hm6K9O5QCC2MwAcw) contains rough estimates of the baseline values for these metrics (again, these numbers have been changed from Udacity's true numbers).


For each metric you selected as an evaluation metric, estimate its standard deviation analytically. Do you expect the analytic estimates to be accurate? That is, for which metrics, if any, would you want to collect an empirical estimate of the variability if you had time?

**Based on the question provided on the project, at first I will be calculating the baseline metrics for the evaluation metrics and their SEs using 5000 unique cookies as my unit of diversion**

In [5]:
baseline_piv = baseline.pivot_table(columns='MeasureName', aggfunc='sum')
baseline_piv.columns = ['CTR', 'Enrollments', 'P_enroll_click', 'P_pay_click', 'P_pay_enroll', 'StartTrial', 'PVs']

In [6]:
baseline_piv = baseline_piv.append({'CTR':np.nan,
                     'Enrollments':np.nan,
                     'P_enroll_click':np.nan,
                     'P_pay_click':np.nan,
                     'P_pay_enroll':np.nan,
                     'StartTrial':np.nan,
                     'PVs': 5000}, ignore_index=True)

In [7]:
baseline_piv.loc[1, 'StartTrial'] = (baseline_piv.loc[0, 'StartTrial'] *\
                                      baseline_piv.loc[1, 'PVs'])/ baseline_piv.loc[0,'PVs']
baseline_piv.loc[1, 'Enrollments'] = (baseline_piv.loc[0, 'Enrollments'] *\
                                      baseline_piv.loc[1, 'StartTrial']) \
/ baseline_piv.loc[0,'StartTrial']
baseline_piv.loc[1, 'P_pay_enroll'] = baseline_piv.loc[0, 'P_pay_enroll']
baseline_piv.loc[1, 'P_enroll_click'] = baseline_piv.loc[0, 'P_enroll_click']
baseline_piv.loc[1, 'P_pay_click'] = baseline_piv.loc[0, 'P_pay_click']
baseline_piv.loc[1, 'CTR'] = baseline_piv.loc[1, 'StartTrial'] / baseline_piv.loc[1, 'PVs']

In [8]:
baseline_piv

Unnamed: 0,CTR,Enrollments,P_enroll_click,P_pay_click,P_pay_enroll,StartTrial,PVs
0,0.08,660.0,0.20625,0.109313,0.53,3200.0,40000.0
1,0.08,82.5,0.20625,0.109313,0.53,400.0,5000.0


In [9]:
# function to calculate 
def calc_se(prob, n1, n2=0):
    if n2 == 0:
        se = np.sqrt(prob*(1-prob)*(1/n1))
    else:
        se = np.sqrt(prob*(1-prob)*(1/n1+1/n2))
    return round(se,4)

In [10]:
print('P_enroll_click baseline SE using unique cookies as unit of diversion: ',\
      calc_se(baseline_piv.loc[1, 'P_enroll_click'],\
              baseline_piv.loc[1, 'StartTrial']))
print('P_pay_click baseline SE using unique cookies as unit of diversion: ',\
     calc_se(baseline_piv.loc[1, 'P_pay_click'],\
              baseline_piv.loc[1, 'StartTrial']))

P_enroll_click baseline SE using unique cookies as unit of diversion:  0.0202
P_pay_click baseline SE using unique cookies as unit of diversion:  0.0156


## Section 2 Questions: Sizing
#### Choosing Number of Samples given Power
Using the analytic estimates of variance, how many pageviews total (across both groups) would you need to collect to adequately power the experiment? Use an alpha of 0.05 and a beta of 0.2. Make sure you have enough power for each metric.

Using the pageview calculator, [here](https://www.evanmiller.org/ab-testing/sample-size.html), for probability of payment given click we get the number of pageviews needed as: 819 Pvs with clicks using this click rate, total pageviews per variation are:

In [11]:
33020/0.08 * 2

825500.0

#### Choosing Duration vs. Exposure
What percentage of Udacity's traffic would you divert to this experiment (assuming there were no other experiments you wanted to run simultaneously)? Is the change risky enough that you wouldn't want to run on all traffic?


Given the percentage you chose, how long would the experiment take to run, using the analytic estimates of variance? If the answer is longer than a few weeks, then this is unreasonably long, and you should reconsider an earlier decision.

Assuming we show this change to 90% of the traffic we will need to run this experiment for:

In [12]:
825500/ (0.9*40000)

22.930555555555557

roughly 23 days

## Section 3: Analyzing results
 
### Analysis

The data for you to analyze is here. This data contains the raw information needed to compute
the above metrics, broken down day by day. Note that there are two sheets within the
spreadsheet - one for the experiment group, and one for the control group.
The meaning of each column is:

* Pageviews: Number of unique cookies to view the course overview page that day.
* Clicks: Number of unique cookies to click the course overview page that day.
* Enrollments: Number of user-ids to enroll in the free trial that day.
* Payments: Number of user-ids who who enrolled on that day to remain enrolled for 14
days and thus make a payment. (Note that the date for this column is the start date, that
is, the date of enrollment, rather than the date of the payment. The payment happened
14 days later. Because of this, the enrollments and payments are tracked for 14 fewer
days than the other columns.)

#### Sanity Checks
Start by checking whether your invariant metrics are equivalent between the two groups. If the
invariant metric is a simple count that should be randomly split between the 2 groups, you can
use a binomial test as demonstrated in Lesson 5. Otherwise, you will need to construct a
confidence interval for a difference in proportions using a similar strategy as in Lesson 1, then
check whether the difference between group values falls within that confidence level.
If your sanity checks fail, look at the day by day data and see if you can offer any insight into
what is causing the problem.
Check for Practical and Statistical Significance
Next, for your evaluation metrics, calculate a confidence interval for the difference between the
experiment and control groups, and check whether each metric is statistically and/or practically
significance. A metric is statistically significant if the confidence interval does not include 0 (that
is, you can be confident there was a change), and it is practically significant if the confidence
interval does not include the practical significance boundary (that is, you can be confident there
is a change that matters to the business.)
If you have chosen multiple evaluation metrics, you will need to decide whether to use the
Bonferroni correction. When deciding, keep in mind the results you are looking for in order to
launch the experiment. Will the fact that you have multiple metrics make those results more
likely to occur by chance than the alpha level of 0.05?

#### Run Sign Tests
For each evaluation metric, do a sign test using the day-by-day breakdown. If the sign test does
not agree with the confidence interval for the difference, see if you can figure out why.

#### Make a Recommendation
Finally, make a recommendation. Would you launch this experiment, not launch it, dig deeper,
run a follow-up experiment, or is it a judgment call? If you would dig deeper, explain what area
you would investigate. If you would run follow-up experiments, briefIy describe that experiment.
If it is a judgment call, explain what factors would be relevant to the decision.

In [17]:
control = pd.read_csv('Data/ObservedControl.csv')
experiment = pd.read_csv('Data/ObservedExperiment.csv')

In [18]:
control.head()

Unnamed: 0,Date,Pageviews,Clicks,Enrollments,Payments
0,Sat Oct 11,7723,687,134.0,70.0
1,Sun Oct 12,9102,779,147.0,70.0
2,Mon Oct 13,10511,909,167.0,95.0
3,Tue Oct 14,9871,836,156.0,105.0
4,Wed Oct 15,10014,837,163.0,64.0


In [19]:
experiment.head()

Unnamed: 0,Date,Pageviews,Clicks,Enrollments,Payments
0,Sat Oct 11,7716,686,105.0,34.0
1,Sun Oct 12,9288,785,116.0,91.0
2,Mon Oct 13,10480,884,145.0,79.0
3,Tue Oct 14,9867,827,138.0,92.0
4,Wed Oct 15,9793,832,140.0,94.0


Now, let's calculate the value and create the 95% confidence interval for the 2 invariant metrics we chose as the first sanity check

In [31]:
num_cookies_control = control.Pageviews.sum()
num_cookies_expt = experiment.Pageviews.sum()

clicks_control = control.Clicks.sum()
clicks_expt = experiment.Clicks.sum()

total_PVs = num_cookies_control + num_cookies_expt
total_clicks = clicks_control + clicks_expt

observed_PV_control = num_cookies_control/total_PVs

observed_clicks_control = clicks_control/total_clicks

#building a binomial distribution to calculate SE
p_split = 0.5

SE_PV = np.sqrt(p_split*(1-p_split)*(1/total_PVs))
SE_clicks = np.sqrt(p_split*(1-p_split)*(1/total_clicks))

m_pv = 1.96 * SE_PV
m_clicks = 1.96 * SE_clicks

print('confidence interval around the fraction of PVs expected in the control group: ['
      , round(p_split-m_pv,4), ', '
      , round(observed_PV_control,4), ', '
      , round(p_split+m_pv,4),']')

print('confidence interval around the fraction of clicks expected in the control group: ['
      , round(p_split-m_clicks,4), ', '
      , round(observed_clicks_control,4), ', '
      , round(p_split+m_clicks,4),']')

confidence interval around the fraction of PVs expected in the control group: [ 0.4988 ,  0.5006 ,  0.5012 ]
confidence interval around the fraction of clicks expected in the control group: [ 0.4959 ,  0.5005 ,  0.5041 ]


Now it is time to calculate the observed values and confidence intervals for the 2 evaluation metrics and calculate the difference between the control and experiment to see if there was a statistically significant change. Since we need 23 days worth of data to run a valid experiment, that's how much data I am going to use below

In [38]:
click_cont = control.Clicks[:23].sum()
click_expt = experiment.Clicks[:23].sum()

enroll_cont = control.Enrollments[:23].sum()
enroll_expt = experiment.Enrollments[:23].sum()

pay_cont = control.Payments[:23].sum()
pay_expt = experiment.Payments[:23].sum()

GR_cont = enroll_cont/click_cont
GR_expt = enroll_expt/click_expt
GR_pool = (enroll_cont+enroll_expt)/(click_cont+click_expt)

NR_cont = pay_cont/click_cont
NR_expt = pay_expt/click_expt
NR_pool = (pay_cont+pay_expt)/(click_cont+click_expt)

SE_GR = np.sqrt(GR_pool*(1-GR_pool)*((1/click_cont)+(1/click_expt)))
m_GR = 1.96 * SE_GR
diff_GR = GR_expt - GR_cont

SE_NR = np.sqrt(NR_pool*(1-NR_pool)*((1/click_cont)+(1/click_expt)))
m_NR = 1.96 * SE_NR
diff_NR = NR_expt - NR_cont

print('Lower bound, observed value and upper bound for difference between Gross Retention for A vs B: ['
      , round(diff_GR-m_GR,4), ', '
      , round(diff_GR,4), ', '
      , round(diff_GR+m_GR,4) ,']')

print('Statistically significant but not practically significant')
print()
print('Lower bound, observed value and upper bound for difference between Gross Retention for A vs B: ['
      , round(diff_NR-m_NR,4), ', '
      , round(diff_NR,4), ', '
      , round(diff_NR+m_NR,4) ,']')
print('Statistically & practically not significant')

Lower bound, observed value and upper bound for difference between Gross Retention for A vs B: [ -0.0291 ,  -0.0206 ,  -0.012 ]
Statistically significant but not practically significant

Lower bound, observed value and upper bound for difference between Gross Retention for A vs B: [ -0.0116 ,  -0.0049 ,  0.0019 ]
Statistically & practically not significant


Now let's perform the sign test to see if the change was positive or negative for a significant number of days using the p-value using [this tool](https://www.graphpad.com/quickcalcs/binomial1.cfm) referenced in the course

In [44]:
click_cont = control.Clicks[:23]
click_expt = experiment.Clicks[:23]

enroll_cont = control.Enrollments[:23]
enroll_expt = experiment.Enrollments[:23]

pay_cont = control.Payments[:23]
pay_expt = experiment.Payments[:23]

GR_cont = enroll_cont/click_cont
GR_expt = enroll_expt/click_expt

sign_GR = GR_expt - GR_cont
print('number of days the change was positive: ', sum([1 for i in sign_GR if i >= 0]))
print('number of days the change was negative: ', sum([1 for i in sign_GR if i < 0]))

NR_cont = pay_cont/click_cont
NR_expt = pay_expt/click_expt

sign_NR = NR_expt - NR_cont
print('number of days the change was positive: ', sum([1 for i in sign_NR if i >= 0]))
print('number of days the change was negative: ', sum([1 for i in sign_NR if i < 0]))

number of days the change was positive:  4
number of days the change was negative:  19
number of days the change was positive:  10
number of days the change was negative:  13


The P-value for the Gross retention sign test for being negative is: 0.0026 <br>
The P-value for the Net retentiosn sign test for being positive is: 0.6776 

**Recommendation:** while the p-value for the negative gross retention is statistically significant and for the diff as well, I would not recommend this change. This change does seem to suceed in discouraging strudents to not start trials in case they don't have time to study, it does not increase the eventual course payment rate which is likely the goal.

### References:
[[1]](https://docs.google.com/document/u/1/d/1aCquhIqsUApgsxQ8-SQBAigFDcfWVVohLEXcV6jWbdI/pub?embedded=True) - Udacity project description document <br>
[[2]](https://www.evanmiller.org/ab-testing/sample-size.html) - Pageviews calculator <br>
[[3]](https://www.graphpad.com/quickcalcs/binomial1.cfm) - P-value calculator <br>