# Udacity AB Test (by Google) - Final Project 
## Can new approach effectively reduce online course cancellations? 

### Problem Statement / Overview
This analytical jobs aims to help Udacity determining if adopting new user-flows can help the company provide better service and support to students. Currenlty, when students visit Udacity course page, there are two options: "start free trial", and "access course materials.
- "Start free trial": after clicking this, students will be asked to enter their payment information and then enrolled in 14-day free trial; after 14 days, they will be charged automatically if they don't cancel the course first. 
- "Access course materials": students will be able to access course videos and quizes, but no coaching service, verified certificate, or project submission.

However, if students don't feel they have enough time committed to the course, they tend to cancel; therefore, Udacity wants to test a change that, **after students click on 'Start a free trial", they will be askes hou much time they can commit to study per week**; students who don't have enough time (under 5 hours/week) would be told "Udacity courses usually require a greater time commitment for successful completion" and asked if they would love to take "access course for free" instead. 

If students get to know what to expect upfront, they can make wiser decision and overall Udacity can improve its user experience. In other words: Udacity wants to reduce total number of users who enrolled without significantly reducing other metrics like pageviews or course completion.  

More details here: [Instructions](Instructions.pdf)


### User flows and Metrics Choice


![User flows and metrics](User_flows.png)


- **Unit of diversion**: (given by Udacity) unit to assign to treatment and control groups. The unit of diversion here is **a cookie**.

**Invariant Metrics**: metrics that are not affected by the change 

- Num of cookies: number of unique daily cookies to view course overview page
- Num of clicks: number of unique daily cookies to click "start free trial"
- CTR (click-through-rate): num of cookies / num of clicks to click "start free trial"

** Evaluation Metrics **: metrics that are affected by the change 

- Gross conversion ($GC$): number of user_ids enrolled / number of clicks
- Rentention ($RR$): number of user_ids who paid / number of user_ids enrolled
- Net conversion ($NC$): number of payments / number of clicks

### Hypothesis 



- $H_0$: asking students' time commitment after they clicking 'start free trial' **Will NOT** affect $GC$, $RR$, and $NC$;
- $H_a$: asking students' time commitment after they clicking 'start free trial' **WILL** affect $GC$, $RR$, and $NC$;




In [6]:
import numpy as np 
import pandas as pd

import math as m

from scipy import stats 
from statsmodels.stats.proportion import proportions_ztest
from statsmodels.stats.proportion import binom_test

In [7]:

# some baseline numbers given by Udacity

baseline = pd.DataFrame({
    'Metrics': ["Num of Cookies", "Num of Clicks", "Num of User-ids", "CTR", "GC", "RR", "NC"],
    'Estimator': [40000, 3200, 660, 0.08, 0.20625, 0.53, 0.109313],
    'dmin': [3000, 240, -50, 0.01, -0.01, 0.01, 0.0075]
})

baseline 

Unnamed: 0,Metrics,Estimator,dmin
0,Num of Cookies,40000.0,3000.0
1,Num of Clicks,3200.0,240.0
2,Num of User-ids,660.0,-50.0
3,CTR,0.08,0.01
4,GC,0.20625,-0.01
5,RR,0.53,0.01
6,NC,0.109313,0.0075


## Scaling & Measuring variability

- given a sample size of 5000 cookies, what are evaluation metrics' SE? 


In [8]:
# find estimated size for number of clicks and number of user ids

baseline['Scaled'] = np.nan

def get_scale(data, sample, index):
    
    baseline['Scaled'][0] = sample
    multiple = data['Estimator'][0]/sample
    return data['Estimator'][index]/multiple

for i in [1,2]:
    baseline['Scaled'][i] =get_scale(baseline, 5000,i)

baseline

Unnamed: 0,Metrics,Estimator,dmin,Scaled
0,Num of Cookies,40000.0,3000.0,5000.0
1,Num of Clicks,3200.0,240.0,400.0
2,Num of User-ids,660.0,-50.0,82.5
3,CTR,0.08,0.01,
4,GC,0.20625,-0.01,
5,RR,0.53,0.01,
6,NC,0.109313,0.0075,




$SE = \sqrt{(\hat{p} *(1-\hat{p}) / N)}$

- ... where $\hat{p}$ is the estimator value of each evaluation metrics 
- and $N_{estimated}$ is the estimated sample size to calcutae SE for each evaluation metrics, given 5000 cookies 
- **Assumption**: we're assuming that, with large sample, we can use normal distribution to estimate metric variability.



In [9]:

baseline['SE'] = np.nan

def get_se(N, p):
    return np.sqrt(p*(1-p)/N)

# Gross conversion: number of user_ids enrolled / number of clicks
baseline['SE'][4] = get_se(baseline['Scaled'][1], baseline['Estimator'][4])
# Rentention: number of user_ids who paid / number of user_ids enrolled
baseline['SE'][5] = get_se(baseline['Scaled'][2], baseline['Estimator'][5])
# Net conversion: number of user_ids who paid / number of clicks
baseline['SE'][6] = get_se(baseline['Scaled'][1], baseline['Estimator'][6])

baseline

Unnamed: 0,Metrics,Estimator,dmin,Scaled,SE
0,Num of Cookies,40000.0,3000.0,5000.0,
1,Num of Clicks,3200.0,240.0,400.0,
2,Num of User-ids,660.0,-50.0,82.5,
3,CTR,0.08,0.01,,
4,GC,0.20625,-0.01,,0.020231
5,RR,0.53,0.01,,0.054949
6,NC,0.109313,0.0075,,0.015602


## Sizing 
### Choosing Number of Samples given Power - how many pageviews/cookies we need?


Given by Udacity, $\alpha$ (Type I error rate) will be 0.05, and $\beta$ (Type II error) will be 0.2 (so the power of experiment would be 1-0.2 = 0.8)

#### Sample size of each group, assuming different standard deviations for two groups*

> An online calculator we can apply here: https://www.evanmiller.org/ab-testing/sample-size.html 





In [10]:

a = 0.05
b = 0.2

baseline['Sample'] = np.nan

# With help of calculator, for GC, RR, and NC, we have...

group_sample = [25835, 39155, 27413]

# however, since how many samples avaliable for testing GC/RR/NC are determined by clicks through rate/enrollment rate/enrollments rate (see the use flow diagram), we need further adjust the number

def get_sample(n, index):
    if index in [4,6]:
        n = round((n*2)/baseline['Estimator'][3])
        return n
    else:
        enroll_rate = baseline['Estimator'][2]/baseline['Estimator'][0]
        n = round(n*2)/ enroll_rate
        return n 

for i in [4,5,6]:
    baseline['Sample'][i] = round(get_sample(group_sample[i-4], i))

print(f'The required maximun sample size is {baseline["Sample"].max()}')

baseline

The required maximun sample size is 4746061.0


Unnamed: 0,Metrics,Estimator,dmin,Scaled,SE,Sample
0,Num of Cookies,40000.0,3000.0,5000.0,,
1,Num of Clicks,3200.0,240.0,400.0,,
2,Num of User-ids,660.0,-50.0,82.5,,
3,CTR,0.08,0.01,,,
4,GC,0.20625,-0.01,,0.020231,645875.0
5,RR,0.53,0.01,,0.054949,4746061.0
6,NC,0.109313,0.0075,,0.015602,685325.0


### Choosing Duration vs. Exposure

$D = N_{required} / N_{traffic}$

In [11]:

def get_d(index, traffic):

    if index == 5:
        # if test all three evaluation metrics or retention rate only
        d = m.ceil(baseline["Sample"].max()/(40000*traffic))
    elif index == 6:
        # if test net conversion rate only (or together with gross conversion)
        d = m.ceil(baseline["Sample"][6]/(40000*traffic))
    elif index == 4:
        # if test gross conversion rate only
        d = m.ceil(baseline["Sample"][4]/(40000*traffic))
    return d

def get_d_traffic(traffic):
    for i in [4,5,6]:
        print(f'Test {baseline["Metrics"][i]} with {traffic*100}% traffic requires: {get_d(i, traffic)} days.')

In [12]:
# assume we have 100% traffic (if not 100%, the there might be other experiments going on in the website)

get_d_traffic(1)

Test GC with 100% traffic requires: 17 days.
Test RR with 100% traffic requires: 119 days.
Test NC with 100% traffic requires: 18 days.


In [13]:
# assume we have 70%% traffic (if not 100%, the there might be other experiments going on in the website)

get_d_traffic(0.7)


Test GC with 70.0% traffic requires: 24 days.
Test RR with 70.0% traffic requires: 170 days.
Test NC with 70.0% traffic requires: 25 days.


In [14]:
# assume we have 50% traffic (if not 100%, the there might be other experiments going on in the website)

get_d_traffic(0.5)


Test GC with 50.0% traffic requires: 33 days.
Test RR with 50.0% traffic requires: 238 days.
Test NC with 50.0% traffic requires: 35 days.


#### Based on the analysis here, clearly retention rate is not a good metric choice as it takes too long to test. Therefore, choosing Gross Conversion (GC) and Net Conversion (NC) are more practical.

## Sanity Check - are invariant metrics stay the same for both group?

 - Number of cookies (pageviews)
 - Number of clicks 
 

In [15]:
# load the data

cont = pd.read_excel('Final Project Results.xlsx', sheet_name='Control')
exp = pd.read_excel('Final Project Results.xlsx', sheet_name='Experiment')

cont.head()

Unnamed: 0,Date,Pageviews,Clicks,Enrollments,Payments
0,"Sat, Oct 11",7723,687,134.0,70.0
1,"Sun, Oct 12",9102,779,147.0,70.0
2,"Mon, Oct 13",10511,909,167.0,95.0
3,"Tue, Oct 14",9871,836,156.0,105.0
4,"Wed, Oct 15",10014,837,163.0,64.0


In [16]:
cont.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37 entries, 0 to 36
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Date         37 non-null     object 
 1   Pageviews    37 non-null     int64  
 2   Clicks       37 non-null     int64  
 3   Enrollments  23 non-null     float64
 4   Payments     23 non-null     float64
dtypes: float64(2), int64(2), object(1)
memory usage: 1.6+ KB


In [43]:
exp.head()

Unnamed: 0,Date,Pageviews,Clicks,Enrollments,Payments
0,"Sat, Oct 11",7716,686,105.0,34.0
1,"Sun, Oct 12",9288,785,116.0,91.0
2,"Mon, Oct 13",10480,884,145.0,79.0
3,"Tue, Oct 14",9867,827,138.0,92.0
4,"Wed, Oct 15",9793,832,140.0,94.0


In [18]:
exp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37 entries, 0 to 36
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Date         37 non-null     object 
 1   Pageviews    37 non-null     int64  
 2   Clicks       37 non-null     int64  
 3   Enrollments  23 non-null     float64
 4   Payments     23 non-null     float64
dtypes: float64(2), int64(2), object(1)
memory usage: 1.6+ KB


In [24]:
exp_sample = exp.Pageviews.sum()
cont_sample = cont.Pageviews.sum()
total_sample = exp_sample+cont_sample
print(f'Total samples: {total_sample}\nTreatment group sample: {exp_sample}\ncontrol group samples: {cont_sample}')


Total samples: 690203
Treatment group sample: 344660
control group samples: 345543


In [25]:
exp_clicks = exp.Clicks.sum()
cont_clicks = cont.Clicks.sum()
total_clicks = exp_clicks+cont_clicks
print(f'Total clicks: {total_clicks}\nTreatment group clicks: {exp_clicks}\ncontrol group clicks: {cont_clicks}')


Total clicks: 56703
Treatment group clicks: 28325
control group clicks: 28378


#### Number of cookies (pageviews), number of clicks, and CTR


- Standard error: $SE = \sqrt{(\hat{p} *(1-\hat{p}) / N)}$
- Confidence interval: $p \pm Z_{\alpha/2}*SE$

.. where $\alpha = 0.05$, $p = 0.5$




In [38]:
def passed(metric, p, alpha):
    if metric == 'Pageviews':
        p_hat = exp_sample/total_sample
        se = np.sqrt(p_hat*(1-p_hat)/total_sample)
    elif metric == 'Clicks':
        p_hat = exp_clicks/total_clicks
        se = np.sqrt(p_hat*(1-p_hat)/total_clicks)
    elif metric == 'CTR':
        p = exp_clicks/exp_sample
        p_hat = cont_clicks / cont_sample
        se = np.sqrt(p_hat*(1-p_hat)*((1/cont_sample)+(1/exp_sample)))

    upper = p+(stats.norm.ppf(1-alpha/2)*se)
    lower = p-(stats.norm.ppf(1-alpha/2)*se)

    if p_hat>upper or p_hat <lower:
        result = 'Not pass'
    else:
        result = 'Pass'

    return p_hat, upper, lower, result

for i in ['Pageviews', 'Clicks','CTR']:
    p_hat, upper, lower, result = passed(i, 0.5, 0.05)
    print(f'Test: {i}, p_hat: {round(p_hat,4)}, upper bound ci: {round(upper,4)}, lower bound: {round(lower,4)}; Test result: {result}')


Test: Pageviews, p_hat: 0.4994, upper bound ci: 0.5012, lower bound: 0.4988; Test result: Pass
Test: Clicks, p_hat: 0.4995, upper bound ci: 0.5041, lower bound: 0.4959; Test result: Pass
Test: CTR, p_hat: 0.0821, upper bound ci: 0.0835, lower bound: 0.0809; Test result: Pass


According to the result of sanity check, all invariant metrics passed the test.

### Test Analysis

Since we will only use gross conversion and net conversion as evaluation metrics, therefore we can further refine our null hypothesis and alternative hypothesis.

For gross conversion ($GC$)
- $H_{0}: GC_{experiment} \ne GC_{control}$
- $H_{a}: GC_{experiment} = GC_{control}$

For net conversion ($NC$)
- $H_{0}: NC_{experiment} \ne NC_{control}$
- $H_{a}: NC_{experiment} = NC_{control}$

In the dataset given by Udaicty, there are only 23 days of observations for both payments and enrollments. Therefore we need to adjust sample size we used in this test.


In [229]:
exp_clicks_adj = exp.Clicks[:23].sum()
cont_clicks_adj = cont.Clicks[:23].sum()
total_clicks_adj = exp_clicks_adj+cont_clicks_adj

def test_result(metric, diff_min, alpha = 0.05, diff_null = 0):
    n_exp = exp_clicks_adj
    n_cont = cont_clicks_adj
    if metric == 'GC':
        x_exp = exp.Enrollments.sum()
        x_cont = cont.Enrollments.sum()
        p_exp = exp.Enrollments.sum() / exp_clicks_adj
        p_cont = cont.Enrollments.sum() / cont_clicks_adj
    elif metric == 'NC':
        x_exp = exp.Payments.sum()
        x_cont = cont.Payments.sum()
        p_exp = exp.Payments.sum() / exp_clicks_adj
        p_cont = cont.Payments.sum() / cont_clicks_adj

    p_pool = (x_exp + x_cont) / (n_exp + n_cont)

    se_pool = np.sqrt(p_pool*(1-p_pool)*((1/n_exp)+(1/n_cont)))

    diff = p_exp - p_cont

    me = stats.norm.ppf(1-(alpha)/2) * se_pool
    upper = diff + me
    lower = diff - me

    # Z_ss = (diff - diff_null)/se_pool # critical value for statistically significant 
    # Z_ps = (diff + diff_min)/se_pool # critical value for practically significant 

    if diff_null > upper or diff_null < lower:
        result_ss = 'Statisticaly significant: Yes, 0 does not fall with in C.I.'
    else:
        result_ss = 'Statisticaly significant: No, 0 falls within C.I.'

    if diff_min < upper and diff_min > lower:
        result_ps = 'Practically significant: Yes, d_min falls with in C.I.'
    else:
        result_ps = 'Practically significant: No, d_min does not fall within C.I.'

    print(f'The test results for {metric} are: \n{result_ss}\n{result_ps}')
    print(f'Observed diff: {round(diff,4)}\nd_min: {round(diff_min,4)}')
    print(f'Lower bound {round(lower,4)}; Upper bound {round(upper,4)}\n')


In [230]:
for m,i in zip(['GC', 'NC'],[4,6]):
    test_result(m, baseline['dmin'][i])

The test results for GC are: 
Statisticaly significant: Yes, 0 does not fall with in C.I.
Practically significant: No, d_min does not fall within C.I.
Observed diff: -0.0206
d_min: -0.01
Lower bound -0.0291; Upper bound -0.012

The test results for NC are: 
Statisticaly significant: No, 0 falls within C.I.
Practically significant: No, d_min does not fall within C.I.
Observed diff: -0.0049
d_min: 0.0075
Lower bound -0.0116; Upper bound 0.0019



## Summary

Therefore, only Gross Conversion is statistically signifcant affected by the change made by Udacity. However, although 0 is not included in the C.I. and therefore statistically signifcant, minimal difference required which is -0.01 also not included in the C.I. This indicates that, at 95% chance, there is a signifcant change in Gross Conversion but this effect is not strong enough to be practical.

Net Conversion, on ther other hand, was neither statistically signifcant or practically. There is 0 included in the C.I, and minimal difference required (0.0075) also not included in C.I. 

Since this test is deisgned to determine if Udacity should launching 'ask about time commitment' can improve student experience without signifcantly reducing students who end up continuing the course and paying, we can therefore conclude that this new feature would help reducing number of students who enrolled, and numbers of students who pay would not signifcantly change. However, the reduction in gross Conversion might not be practically signifcant, probably due to cost or technical reasons. 

#### Overall, it is still a good choice to launch the feature, but this decision would be partly depend on how important the minimal required change is to Udacity's business and cost structure. Without further information, it is hard to analyze but it's a good idea for where to go in next step.
