In [2]:
import numpy as np
from scipy.stats import norm
import pandas as pd

# Experiment Design

## Calculating Standard Deviation Analytically

Udacity provided the following baseline values:

In [3]:
# Given baseline values
baseline = {"Cookies":40000,"Clicks":3200,"Enrollments":660,"CTP":0.08,"GrossConversion":0.20625,
           "Retention":0.53,"NetConversion":0.109313}

In [4]:
baseline

{'Cookies': 40000,
 'Clicks': 3200,
 'Enrollments': 660,
 'CTP': 0.08,
 'GrossConversion': 0.20625,
 'Retention': 0.53,
 'NetConversion': 0.109313}

Assuming a sample size of 5,000 cookies visiting the course overview page per day, I calculated the standard deviations for the chosen evaluation metrics. The baseline values were scaled down to match the sample size of 5000 cookies to figure out how many units of analysis will correspond to 5000 pageviews for each metric.

In [5]:
# Scale baseline values for sample size of 5000 cookies
baseline["Cookies"] = 5000
baseline["Clicks"]=baseline["Clicks"]*(5000/40000)
baseline["Enrollments"]=baseline["Enrollments"]*(5000/40000)
baseline

{'Cookies': 5000,
 'Clicks': 400.0,
 'Enrollments': 82.5,
 'CTP': 0.08,
 'GrossConversion': 0.20625,
 'Retention': 0.53,
 'NetConversion': 0.109313}

The three evaluation metrics are assumed to follow the binomial distribution with probability p. In this case, the probability is 
<center><font size="3">$SD=\sqrt{\frac{{p}*(1-{p})}{n}}$</font></center>

where p = baseline probability of the metric event occuring and n = sample size.

In [6]:
# Standard deviation for Gross Conversion
GC={}
GC["d_min"]=0.01
GC["p"]=baseline["GrossConversion"]
GC["n"]=baseline["Clicks"]
GC["sd"]=np.round(np.sqrt((GC["p"]*(1-GC["p"]))/GC["n"]),4)
GC["sd"]

0.0202

In [7]:
# Standard deviation for Retention
Ret={}
Ret["d_min"]=0.01
Ret["p"]=baseline["Retention"]
Ret["n"]=baseline["Enrollments"]
Ret["sd"]=np.round(np.sqrt((Ret["p"]*(1-Ret["p"]))/Ret["n"]),4)
Ret["sd"]

0.0549

In [8]:
# Standard deviation for Net Conversion
NetConv={}
NetConv["d_min"]=0.0075
NetConv["p"]=baseline["NetConversion"]
NetConv["n"]=baseline["Clicks"]
NetConv["sd"]=np.round(np.sqrt((NetConv["p"]*(1-NetConv["p"]))/NetConv["n"]),4)
NetConv["sd"]

0.0156

## Sizing

In order to calculate the total number of samples to adequately power the experiment, I used the following assumptions and formulas: 

 $\alpha=0.05$ and  $\beta=0.2$ where $\alpha$ represents the probability of Type 1 error and 1-$\beta$ is the power of the experiment.
 
 
The required sample size for either the control or the experiment group, for minimum practical significance $d$ and baseline value $p$ is:

<center> <font size="4"> $n = \frac{(Z_{1-\frac{\alpha}{2}}sd_1 + Z_{1-\beta}sd_2)^2}{d^2}$</font> where <br><br>
$sd_1 = \sqrt{2p(1-p)}$<br><br>
$sd_2 = \sqrt{p(1-p)+(p+d)(1-(p+d))}$ </center><br>
 
 

### Number of Samples vs. Power

In [9]:
# Helper function for required standard deviations & sample size
def standard_devs(p,d):
    sd1=np.sqrt(2*p*(1-p))
    sd2=np.sqrt(p*(1-p)+(p+d)*(1-(p+d)))
    return [sd1,sd2]

def z_score(alpha):
    return norm.ppf(alpha)

def sample_size(alpha,beta,sd,d):
    n=((z_score(1-alpha/2)*sd[0]+z_score(1-beta)*sd[1])**2)/(d**2)
    return np.round(n,0)

In [10]:
# alpha and beta values for experiment
alpha = 0.05
beta = 0.2

# For gross conversion
sds = standard_devs(GC['p'],GC['d_min'])
# Multiply by 2 for control & treatment groups
GC['SampleSize'] = sample_size(alpha,beta,sds,GC['d_min'])*2
# This number represents number of clicks needed
# Find number of pageviews needed
clicksPerPageview = baseline['Clicks']/baseline['Cookies']
# Divide # clicks / (clicks/cookies) to get cookies(pageview)
GC['SampleSize'] = round(GC['SampleSize']/clicksPerPageview)
GC['SampleSize']

645875.0

In [11]:
# For retention
sds = standard_devs(Ret['p'],Ret['d_min'])
# Multiply by 2 for control & treatment groups
Ret['SampleSize'] = sample_size(alpha,beta,sds,Ret['d_min'])*2
# This number represents number of enrollments needed
# Find number of pageviews needed
EnrollPerPageview = baseline['Enrollments']/baseline['Cookies']
Ret['SampleSize'] = round(Ret['SampleSize']/EnrollPerPageview)
Ret['SampleSize']

4737818.0

This number is too big (experiment would take too long), so we would need to drop this metric

In [12]:
# For net conversion
sds = standard_devs(NetConv['p'],NetConv['d_min'])
# Multiply by 2 for control & treatment groups
NetConv['SampleSize'] = sample_size(alpha,beta,sds,NetConv['d_min'])*2
# This number represents number of clicks needed
# Find number of pageviews needed
ClicksPerPageview = baseline['Clicks']/baseline['Cookies']
NetConv['SampleSize'] = round(NetConv['SampleSize']/ClicksPerPageview)
NetConv['SampleSize']

685325.0

This number is bigger out of the two evaluation metrics I will use. Therefore, the required number of pageviews is 685325. We know we get about 40,000 unique cookie pageviews a day. Assuming that we divert 75% of the traffic each day to the experiment, it would take approximately 23 days to complete the experiment.

In [13]:
685325/(40000*0.75)

22.844166666666666

# Analyzing Experiment Results

## Sanity Checks

Before analyzing the results from the experiment, I performed sanity checks on the invariant metrics to verify that the experiment was conducted appropriately.

In [14]:
# Load experiment data
control = pd.read_csv('/kaggle/input/abtest/Control')
treat = pd.read_csv('/kaggle/input/abtest/Treatment')
control.head()

Unnamed: 0,Date,Pageviews,Clicks,Enrollments,Payments
0,"Sat, Oct 11",7723,687,134.0,70.0
1,"Sun, Oct 12",9102,779,147.0,70.0
2,"Mon, Oct 13",10511,909,167.0,95.0
3,"Tue, Oct 14",9871,836,156.0,105.0
4,"Wed, Oct 15",10014,837,163.0,64.0


For all of the metrics, I would like to make sure that the differences between the treatment and the control groups are not significant.

For number of cookies and number of clicks, you would expect about half of them to be in the control group and the other half to be in the treatment group. Therefore, I created a 95% confidence interval for the expected proportion between the two groups (control/treatment) using the binomial distribution. The standard deviation is:
$\sqrt{\frac{p(1-p)}{N}}$ where p = number of samples in control/number of samples in treatment and N = total number of samples.

In [15]:
# Number of Cookies
total_pageview = control['Pageviews'].sum() + treat['Pageviews'].sum()
# Observed value
p_hat = control['Pageviews'].sum()/total_pageview
# 95% Confidence interval
# We expect about half in the control group
p = 0.5
std = np.sqrt(0.5*(1-0.5)/total_pageview)
CI = [round(p-1.96*std,4),round(p+1.96*std,4)]
CI

[0.4988, 0.5012]

In [16]:
round(p_hat,4)

0.5006

Because the observed value is inside of the confidence interval for the expected value, this sanity test passes.

In [17]:
# Number of Clicks
total_clicks = control['Clicks'].sum() + treat['Clicks'].sum()
# Observed value
p_hat = control['Clicks'].sum()/total_clicks
# 95% Confidence interval
# We expect about half in the control group
p = 0.5
std = np.sqrt(0.5*(1-0.5)/total_clicks)
CI = [round(p-1.96*std,4),round(p+1.96*std,4)]
CI

[0.4959, 0.5041]

In [18]:
round(p_hat,4)

0.5005

For click-through probability, I want to make sure that the CTPs in control and treatment groups are the same. In order to create a 95% confidence interval, I used the following formulas:

<center><font size="4">$SD_{pool}=\sqrt{\hat{p_{pool}}(1-\hat{p_{pool})}(\frac{1}{N_{cont}}+\frac{1}{N_{exp}})}$</font></center>
and <br> <center><font size="5"> $\hat{p_{pool}}=\frac{x_{cont}+x_{exp}}{N_{cont}+N_{exp}}$ </font></center>

In [19]:
# Click-through probability
# Expected difference is 0.
N_cont = control['Pageviews'].sum()
N_exp = treat['Pageviews'].sum()
X_cont = control['Clicks'].sum()
X_exp = treat['Clicks'].sum()
CTP_cont = X_cont/N_cont
CTP_exp = X_exp/N_exp
diff_hat = CTP_exp-CTP_cont
p_pool = (X_cont+X_exp)/(N_cont+N_exp)
sd_pool = np.sqrt(p_pool*(1-p_pool)*((1/N_cont)+(1/N_exp)))
SE = 1.96*sd_pool
CI = [round(0-SE,4),round(0+SE,4)]
CI

[-0.0013, 0.0013]

In [20]:
round(diff_hat,4)

0.0001

The observed difference is within the confidence interval, so this sanity check also passes.

## Result Analysis

### Effect Size Tests

In this step, I looked at the difference in evaluation metrics between the control group and the treatment group. I wanted to make sure that the difference was both statistically and practically significant.

First, I noticed that the data was missing enrollments and payments data for certain days. I decided to use data with no missing entries only.

In [21]:
# Data missing enrollments & payments for certain days
control.tail()

Unnamed: 0,Date,Pageviews,Clicks,Enrollments,Payments
32,"Wed, Nov 12",10134,801,,
33,"Thu, Nov 13",9717,814,,
34,"Fri, Nov 14",9192,735,,
35,"Sat, Nov 15",8630,743,,
36,"Sun, Nov 16",8970,722,,


In [22]:
# Excluding missing values
control1= control[~control['Enrollments'].isna()]
treat1= treat[~treat['Enrollments'].isna()]

I built a 95% confidence interval for the observed difference. A metric is statistically significant if it does not include 0 in its confidence interval. It is practically significant if it doesn't include the practical significance boundary. 

In [23]:
# Evaluation metrics
# Gross conversion
X_cont = control1['Enrollments'].sum()
X_exp = treat1['Enrollments'].sum()
N_cont = control1['Clicks'].sum()
N_exp = treat1['Clicks'].sum()
GC_cont = X_cont/N_cont
GC_exp = X_exp/N_exp
diff_hat = GC_exp-GC_cont
p_pool = (X_cont+X_exp)/(N_cont+N_exp)
sd_pool = np.sqrt(p_pool*(1-p_pool)*((1/N_cont)+(1/N_exp)))
SE = 1.96*sd_pool
# CI around observed value
CI = [round(diff_hat-SE,4),round(diff_hat+SE,4)]
# CI doesn't include 0 or -0.01
CI

[-0.0291, -0.012]

In [25]:
round(diff_hat,4)

-0.0206

The confidence interval for gross conversion doesn't include 0 or the practical significance boundary, -0.01, so it is statistcally and practically significant. This means the gross conversion was reduced by 2.06%, which was significant. In other words, fewer people enrolled in the course per click after the Free Trial Screener feature was implemented.

In [25]:
# Net conversion
X_cont = control1['Payments'].sum()
X_exp = treat1['Payments'].sum()
N_cont = control1['Clicks'].sum()
N_exp = treat1['Clicks'].sum()
GC_cont = X_cont/N_cont
GC_exp = X_exp/N_exp
diff_hat = GC_exp-GC_cont
p_pool = (X_cont+X_exp)/(N_cont+N_exp)
sd_pool = np.sqrt(p_pool*(1-p_pool)*((1/N_cont)+(1/N_exp)))
SE = 1.96*sd_pool
CI = [round(diff_hat-SE,4),round(diff_hat+SE,4)]
# CI includes 0 and -0.0075
CI

[-0.0116, 0.0019]

In [26]:
print(round(diff_hat,4))

-0.0049


The difference in net conversion between the two groups is not significant in any way. The observed difference was 0.49%, which was not significant.

### Sign Tests

In this section, I performed sign tests using the day-by-day data. By checking the trends in day-by-day data, I can verify the findings in the previous section.

In [27]:
# Combine control and treatment datasets
data = control1.join(treat1,how='inner',lsuffix = '_control',rsuffix='_treat')
data.head()

Unnamed: 0,Date_control,Pageviews_control,Clicks_control,Enrollments_control,Payments_control,Date_treat,Pageviews_treat,Clicks_treat,Enrollments_treat,Payments_treat
0,"Sat, Oct 11",7723,687,134.0,70.0,"Sat, Oct 11",7716,686,105.0,34.0
1,"Sun, Oct 12",9102,779,147.0,70.0,"Sun, Oct 12",9288,785,116.0,91.0
2,"Mon, Oct 13",10511,909,167.0,95.0,"Mon, Oct 13",10480,884,145.0,79.0
3,"Tue, Oct 14",9871,836,156.0,105.0,"Tue, Oct 14",9867,827,138.0,92.0
4,"Wed, Oct 15",10014,837,163.0,64.0,"Wed, Oct 15",9793,832,140.0,94.0


In [28]:
# See on how many of the days the experiment group had a higher value
# For Gross Conversion
GC_control = data['Enrollments_control']/data['Clicks_control']
GC_treat = data['Enrollments_treat']/data['Clicks_treat']
print("Number of days when gross conversion for treatment group is higher:",
      np.sum(GC_treat>GC_control))
# For Net Conversion
NC_control = data['Payments_control']/data['Clicks_control']
NC_treat = data['Payments_treat']/data['Clicks_treat']
print("Number of days when net conversion for treatment group is higher:",
      np.sum(NC_treat>NC_control))

Number of days when gross conversion for treatment group is higher: 4
Number of days when net conversion for treatment group is higher: 10


In [29]:
# Number of days in dataset
print(len(data))

23


In a sign test, you basically want to know how likely it is to see the number you calculated above in another experiment. I calculated this by looking at the binomial distribution with p=0.5 and N=number of days. 0.5 is reasonable because I expect the treatment to have a higher value about half of the time if there is no difference between the two groups.

In [32]:
# Calculate binomial cdf for each metric
# For GC
from scipy.stats import binom
p = 0.5
n = len(data)
k = np.sum(GC_treat>GC_control)
round(binom.cdf(k,n,p)*2,4)

0.0026

In [33]:
# For NC
p = 0.5
n = len(data)
k = np.sum(NC_treat>NC_control)
round(binom.cdf(k,n,p)*2,4)

0.6776

The change in GC and NC are significant if they are less than 0.05. From the results of the sign test, the change in gross conversion is significant but the change in net conversion isn't.

## Conclusion & Recommendations

The goal of the Free Trial Screener feature was to see whether filtering students based on time commitment could improve student experience and allow coaches to focus on students who are likely to complete the course. If the feature achieved this without significantly reducing the number of students who continue past the free trial, it would be successful.

Based on the results above, I recommend to not launch the feature and pursue other experiments instead. It has caused a significant change in gross conversion, which represents the probability of enrolling per click, but not in net conversion, which is more closely related to Udacity's business goals. This means that there was a decrease in enrollment, but no significant changes in students staying past the 14 days boundary to pay for the course.
