
# PARLA

## Problem
This task is similar to the previous one, except now the decision is not made independently for each experiment.

For example, suppose we have **5 text versions for a marketing email campaign**,
and we want to test **which one performs better — and whether any of them perform better at all**.

The algorithm is as follows:
1. Create non-overlapping control and experimental groups for each of the 5 variants.
2. Run 5 experiments in parallel.
3. Use **Holm’s method** to determine in which experiments there were statistically significant differences.
4. If no significant differences are found, conclude that there is no effect, and reject all variants.
5. If significant differences are found, select the variant with the **smallest p-value** among those with significant effects — this one will be used going forward.

We consider a **Type I error** to have occurred if significant differences are found when in fact there were none in any variant.

We consider a **Type II error** to have occurred if:
* No significant differences are found when in fact there were some; or
* The variant selected for further use actually had no effect, even though other variants did have an effect.

**Experiment parameters:**
* We are testing the hypothesis of equal means;
* Significance level — 0.05;
* Acceptable probability of Type II error — 0.1;
* Expected effect — a 3% increase in values;
* Method for adding the effect in synthetic A/B tests — multiplication by a constant.

**Note:** When evaluating the probability of a Type II error,
consider the worst-case scenario where only **one** of the experiments has an effect.
The more experiments that have an effect, the lower the probability of a Type II error.

Assume that the distribution of the measured values follows a **normal distribution** with a **mean of 100** and a **standard deviation of 10**.

As your answer, enter the **maximum number of experiments** that can be run with the parameters described above.

## Action
- calculated minimal required sample size ( 234 )
- calculated maximal amount of parallel experiments ( 21 )
- for different number of parallel experiments ( 10-21 ) calculated:
    - empirically assessed probability of type-I error:
        - used AA-tests to empirically assess type I error
        - at least one false positive in all parallel experiments is considered type-I error
        - p-values for parallel experiments were updated using Holm-correction
    - empirically assessed probability of type-II error:
        - used AB-tests to empirically assess type II error
        - used worst-case scenario, when only one of parallel tests has an effect
        - p-values for parallel experiments were updated using Holm-correction
- calculated confidence intervals for errors using asymptotic normal approximation

## Result
- For different number of experiments, successfully calculated:
    - empirical probability of type I error
    - confidence interval for type I error
    - empirical probability of type II error
    - confidence interval for type II error
- Found the maximal number of parallel experiments ( 12 ) that can be run while using Holm's correction for p-values

## Learning
- I revised relevant Python, Numpy, Scipy, and StatsModels functionality
- I learned how to calculate:
    - empirical probability of type I error
    - empirical probability of type II error
    - confidence interval for type I and type II errors
- I learned how to apply Holm's correction to p-values

## Application
- I can apply relevant Python, Numpy, Scipy, and StatsModels functionality for similar data-related problems
- Using real-world data, I can:
    - calculate empirical probability of type I and type II errors
    - calculate confidence intervals for type I and type II errors
    - apply Holm's correction to p-values


In [2]:

import numpy as np
import scipy as sp
import statsmodels.api as sm


In [3]:

# experiment parameters:
alpha = 0.05  # probability of type-I error
beta = 0.1  # probability of type-II error
confidence_level = 1 - alpha  # probability of correctly NOT REJECTING null hypothesis
power = 1 - beta  # probability of correctly REJECTING null hypothesis
effect = 0.03  # expected effect size
mu = 100  # mean of normal distribution in synthetic tests
sigma = 10  # standard deviation of normal distribution in synthetic test
population_size = 10 ** 4  # total number of observations
n_repetitions = 10 ** 3  # number of experiments repetitions

def get_minimal_sample_size(effect, std, alpha, beta):
    """
    Get minimal sample size for A and B groups
    :param effect: expected effect during experiment
    :param std: standard deviation
    :param alpha: probability of type-I error
    :param beta: probability of type-II error
    :return: minimal sample size
    """
    ppf_alpha = sp.stats.norm.ppf(1 - alpha / 2, loc=0, scale=1)
    ppf_beta = sp.stats.norm.ppf(1 - beta, loc=0, scale=1)
    var = 2 * std ** 2
    sample_size = np.ceil((ppf_alpha + ppf_beta) ** 2 * var / (effect ** 2))
    return sample_size

# calculate minimal required sample size
sample_size = get_minimal_sample_size(mu * effect, sigma, alpha, beta)
print(f'sample_size = {sample_size}')

# calculate maximal amount of parallel experiments
max_experiments = population_size / (sample_size * 2)
print(f'max_experiments = {max_experiments:0.1f}')
max_experiments = int(max_experiments)

# estimate:
# - empirical probability of type-I error
# - empirical probability of type-II error
# - confidence intervals for errors
for n_experiments in range(10, 15):
    aa_type_1_errors = []
    ab_type_2_errors = []
    sample_size = int(population_size / (n_experiments * 2))

    for _ in range(n_repetitions):
        # AA-TESTS ################################################################################
        # we use AA-tests to empirically assess type I error
        # at least one false positive in all parallel experiments is considered type-I error
        parallel_pvalues = []
        parallel_errors = []
        for _ in range(n_experiments):
            # generate groups A and B from normal distribution
            a = np.random.normal(loc=mu, scale=sigma, size=sample_size)
            b = np.random.normal(loc=mu, scale=sigma, size=sample_size)
            b_effect = b * (1 + effect)
            parallel_pvalues.append(sp.stats.ttest_ind(a, b).pvalue)

        # update p-values using Holm-correction
        parallel_pvalues = sm.stats.multipletests(parallel_pvalues, alpha=alpha, method='holm')[1]

        # check is there is a type-I error
        parallel_errors = parallel_pvalues < alpha
        aa_type_1_errors.append(np.sum(parallel_errors) > 0)

        # AB-TESTS ################################################################################
        # we use AB-tests to empirically assess type II error
        # we use worst-case scenario:
        #  - only one of parallel tests has an effect
        #  - because the more variants have an effect, the lower the probability of type-II error
        parallel_effects = []
        parallel_pvalues = []
        for i in range(n_experiments):
            # generate groups A and B from normal distribution
            a = np.random.normal(loc=mu, scale=sigma, size=sample_size)
            b = np.random.normal(loc=mu, scale=sigma, size=sample_size)
            b_effect = b * (1 + effect)

            # generate p-value
            pvalue = 0
            if i == 0:
                pvalue = sp.stats.ttest_ind(a, b_effect).pvalue
            else:
                pvalue = sp.stats.ttest_ind(a, b).pvalue
            parallel_pvalues.append(pvalue)

        # update p-values using Holm-correction
        parallel_pvalues = sm.stats.multipletests(parallel_pvalues, alpha=alpha, method='holm')[1]
        parallel_effects = parallel_pvalues < alpha

        # if no effect is found, then it is type-II error (since there is for sure 1 effect at index 0)
        if np.sum(parallel_effects) == 0:
            ab_type_2_errors.append(True)

        # if effects are found, check that the smallest pvalue is from the first parallel experiment
        else:
            # find index of the minimal p-value
            min_index = np.argmin(parallel_pvalues)

            # first parallel experiment provides the smallest p-value => no type-II error
            if min_index == 0:
                ab_type_2_errors.append(False)

            # another experiment provided the smallest p-value => type-II error
            else:
                ab_type_2_errors.append(True)

    # estimate empirical probabilities of errors:
    # each iteration has a binary outcome ('True' for error, 'False' for no error), therefore, it is a Bernoulli random variable
    # since each outcome is a Bernoulli random variable, their sum follows a Binomial distribution
    # therefore, Bernoulli-proportion 'n_errors / n' (i.e. 'mean') should approximately follow Normal distribution
    p_type_1_error = np.mean(aa_type_1_errors)
    p_type_2_error = np.mean(ab_type_2_errors)

    # calculate confidence intervals, using asymptotic normal approximation
    ci_type_1_error = sm.stats.proportion_confint(
        np.sum(aa_type_1_errors),
        len(aa_type_1_errors),
        alpha=alpha,
        method='normal'
    )
    ci_type_2_error = sm.stats.proportion_confint(
        np.sum(ab_type_2_errors),
        len(ab_type_2_errors),
        alpha=alpha,
        method='normal'
    )

    # print
    print('')
    print(f'n_experiments = {n_experiments}')
    print(f'sample_size = {sample_size}')
    print(f'empirical probability of type I error = {p_type_1_error:0.4f}')
    print(f'confidence interval = [{ci_type_1_error[0]:0.4f}, {ci_type_1_error[1]:0.4f}]')
    print(f'empirical probability of type II error = {p_type_2_error:0.4f}')
    print(f'confidence interval = [{ci_type_2_error[0]:0.4f}, {ci_type_2_error[1]:0.4f}]')


sample_size = 234.0
max_experiments = 21.4

n_experiments = 10
sample_size = 500
empirical probability of type I error = 0.0460
confidence interval = [0.0330, 0.0590]
empirical probability of type II error = 0.0280
confidence interval = [0.0178, 0.0382]

n_experiments = 11
sample_size = 454
empirical probability of type I error = 0.0530
confidence interval = [0.0391, 0.0669]
empirical probability of type II error = 0.0510
confidence interval = [0.0374, 0.0646]

n_experiments = 12
sample_size = 416
empirical probability of type I error = 0.0410
confidence interval = [0.0287, 0.0533]
empirical probability of type II error = 0.0830
confidence interval = [0.0659, 0.1001]

n_experiments = 13
sample_size = 384
empirical probability of type I error = 0.0390
confidence interval = [0.0270, 0.0510]
empirical probability of type II error = 0.1290
confidence interval = [0.1082, 0.1498]

n_experiments = 14
sample_size = 357
empirical probability of type I error = 0.0390
confidence interval = [0.027