# Bayes Factor Experiment with Optional Stopping

In [1]:
using ProgressMeter: @showprogress
using DataFrames

using Plots
using Random
using BayesianExperiments

# number of columns in a dataframe to show 
ENV["COLUMNS"] = 200;

Optional stopping refers to the practice of peeking at data and make decision whether or not to continue an experiment. Such practice is usually prohibited in the frequentist AB testing framework. By using simulation-based result, *Rouder (2014)*[2] showed that a Bayes factor experiment with optional stopping can be valid with proper interpretation of the Bayesian quantities. 

This notebook follows the examples in *Schönbrodt et al. (2016)*[1] to conduct the error analysis of Bayes factor based experiment with optional stopping.

The simulation will be conducted by following steps:

1. Choose a threshold of Bayes factor for decision making. For example, if the threshold is set to 10, when a Bayes factor of $\text{BF}_{10}$ is larger than 10, or less than 1/10, we decide we have collected enough evidence and stop the experiment.
2. Choose a prior distribituion for the effect size under $H_1$. We will use the `StudentTEffectSize` model in the package. You can check the definition of `NormalEffectSize` model from the docstring by typing `?NormalEffectSize`.
3. Run a minimum number of steps (20 as the same in the paper), increase the sample size. Compute the bayes factor at each step.
4. As soon as the bayes factor value reached or exceeded the one of the thresholds as set in (1), or the maximum number of steps is reached, we will stop the experiment.

Some constants used in the simulation:

* Number of simulations: 5000
* Minimum number of steps: 20

The simulation function can be quickly created based on our package:

In [2]:
function simulate(δ, n, σ0; r=0.707, thresh=9, minsample=20)
    # we will use two-sided decision rule for bayes factor
    rule = TwoSidedBFThresh(thresh)
    
    # the prior distribution of effect size,
    # r is the standard deviation
    model = StudentTEffectSize(r=r)
    
    # setup the experiment
    experiment = ExperimentBF(model=model, rule=rule)
    
    # create a sample with size n, the effect size is 
    # specified as δ
    xs = rand(Normal(δ, 1), n)
    
    i = 0
    # specify the stopping condition
    while (i < n) & (experiment.winner === nothing)
        i += 1
        
        # if minimum number of sample is not reached, 
        # keep collecting data
        if i < minsample
            continue
        end
        
        stats = NormalStatistics(xs[1:i])
        experiment.stats = stats
        decide!(experiment)
    end
    experiment
end


simulate (generic function with 1 method)

## Case when alternative $\delta = 0$

When alternative $\delta > 0$, the error rate relates to the false positive rate. 

In [3]:
#deltas = collect(range(0, 1.5, step=0.2));
delta = 0.0
rs = [0.707, 1.0, 1.414];
threshs = [3, 5, 7, 10];
totalnum = length(rs)*length(threshs);

paramsgrid = reshape(collect(Base.Iterators.product(rs, threshs)), (totalnum, 1));
paramsgrid = [(r=r, thresh=thresh) for (r, thresh) in paramsgrid];
@show length(paramsgrid);

length(paramsgrid) = 12


In [4]:
n =  1000
ns = 5000
minsample = 20

sim_result1 = DataFrame(
    delta=Float64[], 
    r=Float64[], 
    thresh=Float64[], 
    num_sim=Int64[], 
    num_null=Int64[], 
    num_alt=Int64[],
    err_rate=Float64[], 
    avg_sample_size=Int64[])

@showprogress for params in paramsgrid
    delta = 0
    r = params.r
    thresh = params.thresh
    winners = []
    samplesizes = []
    for _ in 1:ns
        experiment = simulate(delta, n, r, thresh=thresh, minsample=minsample)
        push!(winners, experiment.winner)
        push!(samplesizes, experiment.stats.n)
    end
    
    num_null = sum(winners .== "null")
    num_alt = sum(winners .== "alternative")
    
    err_rate = num_alt/ns
    avg_sample_size = mean(samplesizes)
    push!(sim_result1, (delta, r, thresh, ns, num_null, num_alt, err_rate, convert(Int64, round(avg_sample_size))))
end

[32mProgress: 100%|█████████████████████████████████████████| Time: 0:01:00[39m


In [5]:
sim_result1

Unnamed: 0_level_0,delta,r,thresh,num_sim,num_null,num_alt,err_rate,avg_sample_size
Unnamed: 0_level_1,Float64,Float64,Float64,Int64,Int64,Int64,Float64,Int64
1,0.0,0.707,3.0,5000,4666,334,0.0668,24
2,0.0,1.0,3.0,5000,4691,309,0.0618,24
3,0.0,1.414,3.0,5000,4714,286,0.0572,24
4,0.0,0.707,5.0,5000,4727,273,0.0546,48
5,0.0,1.0,5.0,5000,4743,256,0.0512,48
6,0.0,1.414,5.0,5000,4724,276,0.0552,47
7,0.0,0.707,7.0,5000,4750,244,0.0488,99
8,0.0,1.0,7.0,5000,4779,215,0.043,100
9,0.0,1.414,7.0,5000,4733,259,0.0518,102
10,0.0,0.707,10.0,5000,4755,181,0.0362,207


## Case when alternative $\delta > 0$

We create a grid of combinations of all parameters.

In [6]:
deltas = collect(range(0.1, 1.0, step=0.2));
rs = [0.707, 1.0, 1.414];
threshs = [3, 5, 7, 10];
totalnum = length(deltas)*length(rs)*length(threshs);

paramsgrid = reshape(collect(Base.Iterators.product(deltas, rs, threshs)), (totalnum, 1));
paramsgrid = [(delta=delta, r=r, thresh=thresh) for (delta, r, thresh) in paramsgrid]
@show length(paramsgrid);
@show paramsgrid[1:5];

length(paramsgrid) = 60
paramsgrid[1:5] = NamedTuple{(:delta, :r, :thresh),Tuple{Float64,Float64,Int64}}[(delta = 0.1, r = 0.707, thresh = 3), (delta = 0.3, r = 0.707, thresh = 3), (delta = 0.5, r = 0.707, thresh = 3), (delta = 0.7, r = 0.707, thresh = 3), (delta = 0.9, r = 0.707, thresh = 3)]


The simulation is similar to the $\delta=0$ case. When alternative $\delta > 0$, the error rate relates to the false negative evidence.

In [7]:
n =  1000
ns = 5000
minsample = 20

sim_result2 = DataFrame(
    delta=Float64[], 
    r=Float64[], 
    thresh=Float64[], 
    num_sim=Int64[], 
    num_null=Int64[], 
    num_alt=Int64[],
    err_rate=Float64[], 
    avg_sample_size=Int64[])

@showprogress for params in paramsgrid
    delta=params.delta
    r = params.r
    thresh = params.thresh
    winners = []
    samplesizes = []
    for _ in 1:ns
        experiment = simulate(delta, n, r, thresh=thresh, minsample=minsample)
        push!(winners, experiment.winner)
        push!(samplesizes, experiment.stats.n)
    end
    
    num_null = sum(winners .== "null")
    num_alt = sum(winners .== "alternative")
    err_rate = 1-num_alt/ns
    avg_sample_size = mean(samplesizes)
    push!(sim_result2, (delta, r, thresh, ns, num_null, num_alt, 
            err_rate, convert(Int64, round(avg_sample_size))))
end

[32mProgress: 100%|█████████████████████████████████████████| Time: 0:02:35[39m


Simulation result when $\delta=0.5$

In [8]:
sim_result2 |>
    df -> filter(x->x.delta==0.5, df)|>
    df -> sort(df, [:delta, :r])

Unnamed: 0_level_0,delta,r,thresh,num_sim,num_null,num_alt,err_rate,avg_sample_size
Unnamed: 0_level_1,Float64,Float64,Float64,Int64,Int64,Int64,Float64,Int64
1,0.5,0.707,3.0,5000,785,4215,0.157,26
2,0.5,0.707,5.0,5000,86,4914,0.0172,33
3,0.5,0.707,7.0,5000,1,4999,0.0002,37
4,0.5,0.707,10.0,5000,0,5000,0.0,40
5,0.5,1.0,3.0,5000,731,4269,0.1462,26
6,0.5,1.0,5.0,5000,81,4919,0.0162,33
7,0.5,1.0,7.0,5000,1,4999,0.0002,36
8,0.5,1.0,10.0,5000,0,5000,0.0,40
9,0.5,1.414,3.0,5000,744,4256,0.1488,25
10,0.5,1.414,5.0,5000,86,4914,0.0172,34


## Evaluate the simulation result with Type I & II Error and FDR

As pointed out by [2], we can evaluate the simulation result from the perspective of false discovery rate. Here we assume there is a 50-50 chance that the data is from either the null model or alternative model. 

We can merge the two simulations results by the prior standard deviation $r$ and threshold of bayes factor. In the merged dataframe, each row represents a simulation with the 5000 samples from the null model and 5000 samples from the alternative model with the corresponding parameters ($r$, $thresh$, $\delta_1$).

In [9]:
sim_result = leftjoin(sim_result1, sim_result2, 
    on=[:r, :thresh, :num_sim],
    renamecols= "_0" => "_1"
);

sim_result.num_dis = sim_result.num_alt_0 + sim_result.num_alt_1;
sim_result.num_false_dis = sim_result.num_alt_0;
sim_result.fdr = sim_result.num_false_dis ./ sim_result.num_dis;

sim_result.type1_error = sim_result.num_alt_0 ./ sim_result.num_sim;
#sim_result.type2_error = 1 .- sim_result.num_alt_1 ./ sim_result.num_sim;
sim_result.power = sim_result.num_alt_1 ./ sim_result.num_sim;

sim_result = sim_result |>
    df -> select(df, [:delta_1, :r, :thresh, :num_sim, :num_null_0, :num_alt_0, 
        :num_null_1, :num_alt_1, :type1_error, :power, :fdr]);

In [10]:
sim_result.num_dis = sim_result.num_alt_0 + sim_result.num_alt_1;
sim_result.num_false_dis = sim_result.num_alt_0;
sim_result.fdr = sim_result.num_false_dis ./ sim_result.num_dis;

sim_result.type1_error = sim_result.num_alt_0 ./ sim_result.num_sim;
sim_result.type2_error = 1 .- sim_result.num_alt_1 ./ sim_result.num_sim;

In [11]:
sim_result = sim_result |>
    df -> select(df, [:delta_1, :r, :thresh, :num_sim, :num_null_0, :num_alt_0, 
        :num_null_1, :num_alt_1, :type1_error, :power, :fdr]);

Examples from merged dataframe:

In [12]:
sim_result |>
    df -> filter(
        x -> ((x.delta_1 == 0.1) .& (x.r == 0.707)) .|
             ((x.delta_1 == 0.1) .& (x.r == 1.0)) .|
             ((x.delta_1 == 0.3) .& (x.r == 1.0))
            , df) |>
    df -> sort(df, [:delta_1, :r, :thresh])

Unnamed: 0_level_0,delta_1,r,thresh,num_sim,num_null_0,num_alt_0,num_null_1,num_alt_1,type1_error,power,fdr
Unnamed: 0_level_1,Float64?,Float64,Float64,Int64,Int64,Int64,Int64?,Int64?,Float64,Float64,Float64
1,0.1,0.707,3.0,5000,4666,334,4451,549,0.0668,0.1098,0.378256
2,0.1,0.707,5.0,5000,4727,273,4197,799,0.0546,0.1598,0.254664
3,0.1,0.707,7.0,5000,4750,244,3595,1346,0.0488,0.2692,0.153459
4,0.1,0.707,10.0,5000,4755,181,2533,2042,0.0362,0.4084,0.0814215
5,0.1,1.0,3.0,5000,4691,309,4453,547,0.0618,0.1094,0.360981
6,0.1,1.0,5.0,5000,4743,256,4192,807,0.0512,0.1614,0.240828
7,0.1,1.0,7.0,5000,4779,215,3649,1301,0.043,0.2602,0.141821
8,0.1,1.0,10.0,5000,4776,168,2569,2053,0.0336,0.4106,0.0756416
9,0.3,1.0,3.0,5000,4691,309,2584,2416,0.0618,0.4832,0.113394
10,0.3,1.0,5.0,5000,4743,256,1136,3864,0.0512,0.7728,0.0621359


## References

1. Schönbrodt, Felix D., Eric-Jan Wagenmakers, Michael Zehetleitner, and Marco Perugini. "Sequential hypothesis testing with Bayes factors: Efficiently testing mean differences." Psychological methods 22, no. 2 (2017): 322.
2. Deng, Alex, Jiannan Lu, and Shouyuan Chen. "Continuous monitoring of A/B tests without pain: Optional stopping in Bayesian testing." In 2016 IEEE international conference on data science and advanced analytics (DSAA), pp. 243-252. IEEE, 2016.
3. Rouder, Jeffrey N. "Optional stopping: No problem for Bayesians." Psychonomic bulletin & review 21, no. 2 (2014): 301-308.