# Project: A/B Test - Free Trial Screener
Author: Wei Chong Ong

## Table of Contents
<ul>
<li><a href="#design">Experiment Design</a></li>
<li><a href="#analysis">Experiment Analysis</a></li>
<li><a href="#follow-up">Follow-Up Experiment</a></li>
</ul>

### Import Libraries

In [1]:
import pandas as pd
import numpy as np
from scipy.stats import norm, binom_test
import statsmodels.api as sm

### Import Data

In [2]:
df_control = pd.read_csv('Final Project Results - Control.csv')
df_experiment = pd.read_csv('Final Project Results - Experiment.csv')

In [3]:
df = df_control.merge(df_experiment, on='Date', suffixes=('_cont', '_exp'))
df

Unnamed: 0,Date,Pageviews_cont,Clicks_cont,Enrollments_cont,Payments_cont,Pageviews_exp,Clicks_exp,Enrollments_exp,Payments_exp
0,"Sat, Oct 11",7723,687,134.0,70.0,7716,686,105.0,34.0
1,"Sun, Oct 12",9102,779,147.0,70.0,9288,785,116.0,91.0
2,"Mon, Oct 13",10511,909,167.0,95.0,10480,884,145.0,79.0
3,"Tue, Oct 14",9871,836,156.0,105.0,9867,827,138.0,92.0
4,"Wed, Oct 15",10014,837,163.0,64.0,9793,832,140.0,94.0
5,"Thu, Oct 16",9670,823,138.0,82.0,9500,788,129.0,61.0
6,"Fri, Oct 17",9008,748,146.0,76.0,9088,780,127.0,44.0
7,"Sat, Oct 18",7434,632,110.0,70.0,7664,652,94.0,62.0
8,"Sun, Oct 19",8459,691,131.0,60.0,8434,697,120.0,77.0
9,"Mon, Oct 20",10667,861,165.0,97.0,10496,860,153.0,98.0


<a id='design'></a>
# Experiment Design

## Metric Choice

**Invariant Metrics**  
Invariant metrics are the metrics that shouldn’t change across our experiment and control when we run our experiment.
- **Number of cookies**: The number of unique cookies that view the course overview page. This is a population sizing metric. Since the unit of diversion is cookie, the cookies are being randomly assigned to the experiment and control groups. So, we should definitely have roughly the same number of cookies in each group. 
- **Number of clicks**: The number of unique cookies that click the “Start free trial" button, which happens before the free trial screener is trigger. Since cookies click the "Start free trial" button before they were asked how much time they had available to devote to the course, this metric shouldn't be affected by the change.
- **Click-through-probability**: The number of unique cookies to click the "Start free trial" button divided by number of unique cookies to view the course overview page. Since this metric is calculated from the two invariant metrics above, this metric shouldn't change as well.

**Evaluation Metrics**  
Evaluation metrics are the metrics in which we expect to see a change, and are relevant to the business goals we aim to achieve. In this case, the business goal of Udacity is to improve the overall student experience and improve coaches' capacity to support students who are likely to complete the course.
- **Gross conversion**: The number of user-ids to complete checkout and enroll in the free trial divided by number of unique cookies to click the "Start free trial" button. This metric should change if the screener does affect the number of enrollments and we would assume this metric to decrease, by reducing the number of enrolled users that have left the free trial.
- **Retention**: The number of user-ids to remain enrolled past the 14-day boundary (and thus make at least one payment) divided by number of user-ids to complete checkout. If having the screener does affect the number of enrolled users leaving the free trial, we would expect this metric to increase by increasing the number users passing the free trial and making payment.
- **Net conversion**: The number of user-ids to remain enrolled past the 14-day boundary (and thus make at least one payment) divided by the number of unique cookies to click the "Start free trial" button. In contrast to the gross conversion, we would expect this metric to remain the same, or increase, because we do not want the screener to significantly reduce the number of students to continue past the free trial and eventually complete the course.

Number of user-ids: This is not a good invariant metric because it is expected to reduce as a result of the experiment. It is an applicable evaluation metric because it would record the number of students continue past the free trial, but it's not the best metric because it's not normalized.

`Any place "unique cookies" are mentioned, the uniqueness is determined by day. (That is, the same cookie visiting on different days would be counted twice.) User-ids are automatically unique since the site does not allow the same user-id to enroll twice.`

Would a change in any one of your evaluation metrics be sufficient? Would you want to see multiple metrics all move or not move at the same time in order to launch? 

> There are two gates to pass in order to be sure that the experiment is a success and launch the change:
1. The first criterion is that the experiment group must have a statistically and practically significant **lower gross conversion** than the control group. It shows that the screener does has effect on reducing the number of enrollments. 
2. Then, the second criterion is that there is a significantly **higher retention** in experiment group and the **net conversion** should at least remain the **same** or **does not decrease significantly**. Due to the lower number of enrollments in experiment group, if the results shows similar retention in both groups, it will lead to a decrease in net conversion. Therefore, both control and experiment group should have at least similar number of payments, that is, without significantly reducing the number of students to continue past the free trial, so that the net conversion does not decrease. If there is a significant increase in net conversion, that's good, but this is not the criterion.

### Hypotheses

The hypothesis was that the free trial screener sets clearer expectations for students upfront, thus reducing the number of frustrated students who left the free trial because they didn't have enough time, without significantly reducing the number of students to continue past the free trial and eventually complete the course. Since this statement involves several evaluation metrics, we can define the null and alternative hypotheses for each of the metrics:

**Gross Conversion:** 
**$$H_0: CVR_{exp}-CVR_{cont} \ge 0$$**
**$$H_1: CVR_{exp}-CVR_{cont} \lt 0$$**

**Retention:** 
**$$H_0: CVR_{exp}-CVR_{cont} \le 0$$**
**$$H_1: CVR_{exp}-CVR_{cont} \gt 0$$**

**Net Conversion:** 
**$$H_0: CVR_{exp}-CVR_{cont} \le 0$$**
**$$H_1: CVR_{exp}-CVR_{cont} \gt 0$$**


## Measuring Variability of Evaluation Metrics

This dataframe contains rough estimates of the baseline values for the following metrics.

In [4]:
df_baseline_values = pd.read_csv('Final Project Baseline Values.csv', header=None, names=['Metric', 'Value'])
pd.options.display.max_colwidth = 100
df_baseline_values

Unnamed: 0,Metric,Value
0,Unique cookies to view course overview page per day:,40000.0
1,"Unique cookies to click ""Start free trial"" per day:",3200.0
2,Enrollments per day:,660.0
3,"Click-through-probability on ""Start free trial"":",0.08
4,"Probability of enrolling, given click:",0.20625
5,"Probability of payment, given enroll:",0.53
6,"Probability of payment, given click",0.109313


In [5]:
page_views = df_baseline_values['Value'][0]
clicks = df_baseline_values['Value'][1]
enroll = df_baseline_values['Value'][2]
gross_cvr = df_baseline_values['Value'][4]
retent = df_baseline_values['Value'][5]
net_cvr = df_baseline_values['Value'][6]

### Analytic Estimate of Variability
Estimate the standard error of the evaluation metrics using a sample size of 5000 cookies visiting the course overview page and the baseline values. Since the evaluation metrics are probabilities, we can assume a binomial distribution.

$$SE = \sqrt{\frac{p*(1-p}{n}}$$

In [6]:
sample_page_views = 5000
sample_clicks = clicks / page_views * sample_page_views
sample_enroll = enroll / page_views * sample_page_views

# Compute standard error (Binomial)
SE_gross = np.sqrt(gross_cvr * (1 - gross_cvr) / sample_clicks)
SE_retent = np.sqrt(retent * (1 - retent) / sample_enroll)
SE_net = np.sqrt(net_cvr * (1 - net_cvr) / sample_clicks)

print('Standard error of gross conversion = {:.4f}'.format(SE_gross))
print('Standard error of retention = {:.4f}'.format(SE_retent))
print('Standard error of net conversion = {:.4f}'.format(SE_net))

Standard error of gross conversion = 0.0202
Standard error of retention = 0.0549
Standard error of net conversion = 0.0156


Since the unit of analysis for gross conversion and net conversion is same as the unit of conversion, their analytical variability will be similar to the empirical variability. However, for retention, the analytic variability is likely to be an under-estimate because its unit of analysis (user-id) is different from the unit of diversion (cookie). The reason is that when we are doing cookie-based diversion, every single cookie is a different random draw, and so the independence assumption of the analytical calculation is actually valid. This independence assumption is no longer valid when we apply to retention because a user id could consist of multiple cookies. The cookies within a user id are actually correlated together. That would increase the variability greatly. Therefore, we would want to collect an empirical estimate of variability of retention if we had time.

## Sizing

### Number of Samples vs. Power

Calculate the required sample size, i.e, total number of pageviews (across both groups) to adequately power the experiment. We want to make sure that we have enough power for each metric by using an alpha of 0.05 and a beta of 0.2. 

### Minimum Detectable Effect for each Evaluation Metrics
The difference that would have to be observed before that was a meaningful change for the business

In [7]:
dmin_gross = 0.01
dmin_retent = 0.01
dmin_net = 0.0075

### Option 1: Calculate required sample size iteratively using analytics estimate

In [8]:
# Compute critical z score
def get_z_crit(alpha):
    return -norm.ppf(alpha / 2)


# s is the pooled standard error for N=1 in each group, which is sqrt(p*(1-p)*(1/1 + 1/1))
def get_beta(z_crit, s, d_min, N):
    SE = s / np.sqrt(N)
    return norm.cdf(z_crit * SE, loc=d_min, scale=SE)


# Compute the required sample size for the experiment
def required_size(s, d_min, alpha=0.05, beta=0.2):
    N = 1
    dmin = abs(get_beta(get_z_crit(alpha), s, d_min, N) - beta)
    for N in range(2, 40000):
        d = abs(get_beta(get_z_crit(alpha), s, d_min, N) - beta)
        if d < dmin:
            dmin = d
            size = N
    return size

In [9]:
sample_clicks_gross_conversion_per_group = required_size(s=SE_gross * np.sqrt(sample_clicks) * np.sqrt(2), d_min=dmin_gross)
sample_page_views_gross_conversion_per_group = sample_clicks_gross_conversion_per_group * page_views / clicks
print('Gross Conversion')
print('clicks: {}' .format(int(sample_clicks_gross_conversion_per_group) * 2))
print('pageviews: {}' .format(int(sample_page_views_gross_conversion_per_group) * 2))

print('')

sample_enroll_retention_per_group = required_size(s=SE_retent * np.sqrt(sample_enroll) * np.sqrt(2), d_min=dmin_retent)
sample_page_views_retention_per_group = sample_enroll_retention_per_group * page_views / enroll
print('Retention')
print('enrollments: {}' .format(int(sample_enroll_retention_per_group) * 2))
print('pageviews: {}' .format(int(sample_page_views_retention_per_group) * 2))

print('')

sample_clicks_net_conversion_per_group = required_size(s=SE_net * np.sqrt(sample_clicks) * np.sqrt(2), d_min=dmin_net)
sample_page_views_net_conversion_per_group = sample_clicks_net_conversion_per_group * page_views / clicks
print('Net Conversion')
print('clicks: {}' .format(int(sample_clicks_net_conversion_per_group) * 2))
print('pageviews: {}' .format(int(sample_page_views_net_conversion_per_group) * 2))

Gross Conversion
clicks: 51398
pageviews: 642474

Retention
enrollments: 78206
pageviews: 4739756

Net Conversion
clicks: 54342
pageviews: 679274


### Option 2: Calculate required sample size using [Online Calculator](https://www.evanmiller.org/ab-testing/sample-size.html) or [Equation (2.33)](http://vanbelle.org/chapters%5Cwebchapter2.pdf)

**Gross Conversion:**  
clicks = 51670  
pageviews = 645875

**Retention:**  
enrollments = 78230  
pageviews = 4741212

**Net Conversion:**  
clicks = 54826  
pageviews = 685325

-> Pageviews Required for the experiement: 4741212

### Duration vs. Exposure

What fraction of traffic you would divert to this experiment and, given this, how many days you would need to run the experiment.

> Assuming there were no other experiments we wanted to run simultaneously, we could divert 100% of the traffic to our experiment. Given 40,000 page views per day, the experiment would take about 119 days, which is unreasonably long to run an experiment. This might cause some potential problems for Udacity. First, we cannot perfom any other experiment during this period (opportunity costs). Secondly, if the change harms the user experience (frustrated students, inefficient coaching resources) and decreases conversion rates, we won't notice it (or cannot really say so) for more than four months (business risk). Therefore, it seems more reasonable to only test the gross conversion and net conversion and to eliminate retention as an evaluation metric. Then, our required sample size would be much smaller and it will take about 17 days to run the experiment with 100% diversion.

Is the change risky enough that you wouldn't want to run on all traffic?

> Since the experiment does not involve a feature that is critical with regards to potential media coverage, it  does not seem to be very risky and we could divert a high percentage of traffic to the experiment. Still, since there is always the potential that something goes wrong during implemention, we may not want to divert all of our traffic to it. 80% of the traffic (22 days) seems to be quite reasonable. The rest of the traffic could be then diverted to other comparable experiements and run them at the same time.

In [10]:
# fraction of daily traffic -> 1 = All daily traffic
exposure = 1        

# Retention
duration = 4741212 / (page_views * exposure)
print('Duration (Retention, 100% traffic) : {:.1f} days' .format(duration))

# Net Conversion
duration = 685325 / (page_views * exposure)
print('Duration (Net Conversion, 100% traffic): {:.1f} days' .format(duration))

Duration (Retention, 100% traffic) : 118.5 days
Duration (Net Conversion, 100% traffic): 17.1 days


In [11]:
# 80% of the daily traffic
exposure = 0.8

# Net Conversion
duration = 685325 / (page_views * exposure)
print('Duration (Net Conversion, 80% traffic): {:.1f} days' .format(duration))

Duration (Net Conversion, 80% traffic): 21.4 days


<a id='analysis'></a>
# Experiment Analysis

## Sanity Checks

To ensure the experiment was run properly, we first need to make sure that all invariant metrics pass the sanity check. We would expect that these metrics do not differ significantly between control and experiment group. Otherwise, this would imply that someting is wrong with the experiment setup and that our results are biased. We might then have to look at the day by day data and see if we can offer any insight into what is causing the problem.

#### Sanity Check: Population sizing invariants
Check whether the population sizing invariant metrics are equivalent between the two groups. Since the invariant metrics (number of cookies and number of clicks) are a simple count that should be randomly split between the two groups, we can use a binomial test. As the sample size is large, we can assume that the sampling distribution of the sample proportion approximates a normal distribution.

In [12]:
# Expected proprotion (evenly split between the two gropus)
actual_prop = 0.5

def population_sizing_invariant(N_cont, N_exp, p, alpha=0.05):
    
    # Compute standard deviation of the sampling distribution for the proportion (standard error) of 0.5
    SE = np.sqrt(p * (1 - p) / (N_cont + N_exp))
    
    # Compute margin of error with 95% confidence interval
    MOE = SE * get_z_crit(alpha)
    lb, ub = p - MOE, p + MOE
    
    # Observed proportion
    phat = N_cont / (N_cont + N_exp)
    print('Observed proportion: {:.4f}'.format(phat))
    print('Confidence Interval:[{:.4f},{:.4f}]'.format(lb, ub))
    
    # Check if the observed proportion falls within the confidence interval
    if (phat >= lb) & (phat <= ub):
        return print('-> Sanity check passed')
    else:
        return print('-> Sanity check failed')

In [13]:
print('Number of cookies that views the page')
population_sizing_invariant(df['Pageviews_cont'].sum(), df['Pageviews_exp'].sum(), actual_prop)

print('')

print('Number of clicks on "Start free trial"')
population_sizing_invariant(df['Clicks_cont'].sum(), df['Clicks_exp'].sum(), actual_prop)

Number of cookies that views the page
Observed proportion: 0.5006
Confidence Interval:[0.4988,0.5012]
-> Sanity check passed

Number of clicks on "Start free trial"
Observed proportion: 0.5005
Confidence Interval:[0.4959,0.5041]
-> Sanity check passed


### Alternative 1: compute p-value using z test

In [14]:
stat, pval_ztest = sm.stats.proportions_ztest(df['Pageviews_cont'].sum(), df['Pageviews_exp'].sum() + df['Pageviews_cont'].sum(), value = 0.5, alternative = 'two-sided')

if pval_ztest > 0.05:
    print('-> Sanity check passed')
else:
    print('-> Sanity check failed')

-> Sanity check passed


### Alternative 2: compute p-value using the exact binomial test

In [15]:
pval_binom = binom_test(df['Pageviews_cont'].sum(), df['Pageviews_exp'].sum() + df['Pageviews_cont'].sum(), p=0.5, alternative='two-sided')

if pval_binom > 0.05:
    print('-> Sanity check passed')
else:
    print('-> Sanity check failed')

-> Sanity check passed


#### Sanity Check: Click-Through Probability invariants
Check whether the observed difference in click-through probability between control and experiment group falls within the 95% confidence level.

In [16]:
# Expected difference in click-through probability between the two groups
actual_diff = 0

def CTP_invariant(N_cont, N_exp, X_cont, X_exp, d, alpha=0.05):
    
    # Compute pooled standard error assuming that both of the variances from the two samples are equal
    p_pool = (X_cont + X_exp) / (N_cont + N_exp)
    SE_pool = np.sqrt(p_pool * (1 - p_pool) * (1 / N_cont + 1 / N_exp))
    
    # Compute margin of error with 95% confidence interval
    MOE = SE_pool * get_z_crit(alpha)
    lb, ub = d - MOE, d + MOE
    
    # Observed difference in click-through probability
    dhat = (X_exp / N_exp) - (X_cont / N_cont)
    print('Observed difference in proportion: {:.4f}'.format(dhat))
    print('Confidence Interval:[{:.4f},{:.4f}]'.format(lb, ub))
    
    # Check if the observed difference in click-through probability falls within the confidence interval
    if (dhat >= lb) & (dhat <= ub):
        return print('-> Sanity check passed')
    else:
        return print('-> Sanity check passed')

In [17]:
print('Clicks-through probabiliy on "Start free trial"')
CTP_invariant(df['Pageviews_cont'].sum(), df['Pageviews_exp'].sum(), df['Clicks_cont'].sum(), df['Clicks_exp'].sum(), actual_diff)

Clicks-through probabiliy on "Start free trial"
Observed difference in proportion: 0.0001
Confidence Interval:[-0.0013,0.0013]
-> Sanity check passed


### Alternative: compute p-value using z test

In [18]:
n_clicks = np.array([df['Clicks_cont'].sum(), df['Clicks_exp'].sum()])
n = np.array([df['Pageviews_cont'].sum(), df['Pageviews_exp'].sum()])
stat, pval_ztest = sm.stats.proportions_ztest(n_clicks, n, alternative = 'two-sided')

if pval_ztest > 0.05:
    print('-> Sanity check passed')
else:
    print('-> Sanity check failed')

-> Sanity check passed


## Result Analysis and Intepretation

#### Check for Practical and Statistical Significance
Next, for our evaluation metrics, calculate a 95% confidence interval for the difference between the experiment and control groups, and check whether each metric is statistically and/or practically significance. 

A metric is statistically significant if the confidence interval does not include 0 (that is, we can be confident there was a change), and it is practically significant if the confidence interval does not include the practical significance boundary (that is, we can be confident there is a change that matters to the business.)

#### Effect Size Tests

In [19]:
def effect_size_test(N_cont, N_exp, X_cont, X_exp, dtrue, dmin, alpha=0.05):
    
    # Compute pooled standard error assuming that both of the variances from the two samples are equal
    p_pool = (X_cont + X_exp) / (N_cont + N_exp)
    SE_pool = np.sqrt(p_pool * (1 - p_pool) * (1 / N_cont + 1 / N_exp))
    
    # Compute margin of error with 95% confidence interval
    MOE = SE_pool * get_z_crit(alpha)
    
    # Observed difference in proportion
    dhat = (X_exp / N_exp) - (X_cont / N_cont)
    lb, ub = dhat - MOE, dhat + MOE
    print('Observed difference in proportion: {:.4f}'.format(dhat))
    print('Confidence Interval:[{:.4f},{:.4f}]'.format(lb, ub))
    
    # Check if it is statistically and practically significant
    if (dtrue >= lb) & (dtrue <= ub):
        print('-> Not statistical significant')
    else:
        print('-> Statistical significant')
    
    if (dmin >= lb) & (dmin <= ub):
        print('Not practical significant')
    elif (-dmin >= lb) & (-dmin <= ub):
        print('-> Not practical significant')
    else:
        print('-> Practical significant')

In [20]:
print('Gross Conversion')
effect_size_test(df['Clicks_cont'][:23].sum(), df['Clicks_exp'][:23].sum(), df['Enrollments_cont'].sum(), df['Enrollments_exp'].sum(), actual_diff, dmin_gross)

print('')

print('Net Conversion')
effect_size_test(df['Clicks_cont'][:23].sum(), df['Clicks_exp'][:23].sum(), df['Payments_cont'].sum(), df['Payments_exp'].sum(), actual_diff, dmin_net)

Gross Conversion
Observed difference in proportion: -0.0206
Confidence Interval:[-0.0291,-0.0120]
-> Statistical significant
-> Practical significant

Net Conversion
Observed difference in proportion: -0.0049
Confidence Interval:[-0.0116,0.0019]
-> Not statistical significant
-> Not practical significant


#### Sign Tests
For each evaluation metric, do a sign test using the day-by-day breakdown and report the p-value of the sign test and whether the result is statistically significant.

In [21]:
# Compute the difference in gross conversion between two groups
df['diff_gross'] = df['Enrollments_exp'] / df['Clicks_exp'] - df['Enrollments_cont'] / df['Clicks_cont']

# Compute the difference in net conversion between two groups
df['diff_net'] = df['Payments_exp'] / df['Clicks_exp'] - df['Payments_cont'] / df['Clicks_cont']

In [22]:
def sign_test(n, k, p):
    # n = total number of days
    # k = total number of days with negative change (experiment - control)
    p_value = binom_test(k, n, p, alternative='two-sided')
    
    # Check if it is statistically and practically significant
    print('p-value: {:.4f}'.format(p_value))
    if (p_value < 0.05):
        print('-> Statistically significant')
    else:
        print('-> Not statistically significant')

In [23]:
# Both observed gross and net conversion are negative
# If there was no difference, then there would be a 50% chance of a negative change on each day.

print('Gross Conversion')
sign_test(df['Enrollments_cont'].count(), (df['diff_gross']<0).sum(), 0.5)

print('')

print('Net Conversion')
sign_test(df['Payments_cont'].count(), (df['diff_net']<0).sum(), 0.5)

Gross Conversion
p-value: 0.0026
-> Statistically significant

Net Conversion
p-value: 0.6776
-> Not statistically significant


### Results Interpretation 
The Bonferroni correction was not used because we want both the evaluation metrics to pass the practical significant boundary in order to launch the change. The Bonferroni correction is useful in reducing Type I errors (Deciding to launch the change when there is actually not a significant difference), and not necessarily effective in reducing Type II errors (Deciding not to launch the change when there is actually a significant difference).

Both the Effect Size Tests and Sign Tests have produced same results. There is both statistically and practically significant difference in gross conversion, whereas the difference in net conversion are both statistically and practically insignificant. The lower bound of the 95% confidence interval is negative and beyond the negative practical significance boundary. 

### Recommendation
Gross conversion had a significant decrease both statistically and practically, which means the free trial screener has succeeded in its first aspect which was to set clearer expectations for students upfront about the course load so that they may reconsider joining the free trial, thereby improve coaches' capacity to support students who are likely to complete the course. However, looking at the confidence interval of the net conversion and the observed difference, it is very likely to cause the number of payment to decrease rather than to increase, and eventually causing a decrease in revenue. Hence, I would recommend not to launch the experiment for now. 

Nevertheless, having the free trial screener may benefit in a long term. It may increase the total number of student who opt for the freely available materials. Those student may want to first access the free course materials and then upgrade to paid course. This screener may still eventually increase net conversion. However, it would take a long time period in order for this to take effect.

<a id='follow-up'></a>
# Follow-Up Experiment

Since the previous experiment has successfully reduced the number of enrollments without significantly reduced the number of students to continue past the free trial, next, we may run another experiment with new feature focusing on the period of free trial to test whether the retention increases. Optimizing free trial process is a long-term process, so we need to invest time and resources into different tactics to really see if they work.

**Possible features:** 
- Not asking for a credit card at signup
- Personalized e-mails
- Experiment with different trial periods
- Provide an early discount if they purchase the course before the free trial expiration

**Null Hypothesis:** The feature does not reduce the number of frustrated students who cancel early in the course

**Alternative Hypothesis:** The feature does reduce the number of frustrated students who cancel early in the course

**Unit of conversion:** User-id  

**Invariant metric:** Number of user-ids  
Number of users who enroll in the free trial. This is a population sizing metric. Since the unit of diversion is user-id, we should definitely have roughly the same number of user-ids in each group.

**Evalution metric:** Retention   
The number of user-ids to remain enrolled past the 14-day boundary (and thus make at least one payment) divided by number of user-ids to complete checkout. We would expect this metric to increase if the feature successfully decreases the number of frustrated students leaving the course early.

**Final thought:**  
We could also try to increase the practical significance boundary, significance level $\alpha$ or $\beta$ to reduce the size as well as the duration of the experiment. The practical significance boundary of retention should be higher than the prior experiment because the follow-up experiment will be more expensive and risky. If the change is risky enough, we don’t want to expose such a large percentage of our traffic to it, we should run the experiment for a longer period with less traffic.