# Power Analysis

The power to reject the null hypothesis

1. Determine the Power of the test:
- probability of correctly rejecting the null hypothesis
- probability of not making a type II error (1 - beta)
- Beta: type II error
- common to pick 80% as the power of the A/B test

This means that we are fine with not detecting a true treatment effect 20% of the time when there is a treatment effect (meaning we have failed to reject the null)

# Significance Level

2. Determine the significance level of the test
- probability of rejecting the null while the null is true
- detecting statistically significance while its not
- probability of making a type 1 error (alpha)
- common to pick 5% bas significance level of the test

probability of making a false discovery - that is a false positive rate (saying there is an effect when there isnt). Using 5% means there is a 5% risk that there isnt a difference, but we say there is. This translates to a 95% confidence in any affect found (5% risk of false positive type 1 error)

You can vary the significance level - make it bigger if running the AB test is related to high engineering costs, then the business might decide to choose a high alpha, so that it will be easier to detect a treatment effect. However if the implementation cost of the proposed change is high, we want to set a small alpha to be really really sure that we can reject the null and have seen a change (making it harder to reject the null)

#  Minimum Detectable Effect

3. Determin the minimum detectable effect

- What is the substantive to the statistical significance for business to find this investment worth it - that this feature is worth to launch into production
- proxy that relates to the smallest effect that would matter in practice
- No common level here - this depends on the business ask - usually set by product / 

What is the % increase we want to see in the product to make this new release or new product / change worth it to the business

# Power Analysis

1. Beta - Probability of type II error
2. (1-Beta) - Power of the test
3. Alpha - Porbability of Type 1 error, Significance level
4. Delta - the minimum desired effect


# Calculating the minimum sample size

To make sure our results are are repeatable, robust, and can be generalised to the entire population, we need to avoid P hacking, to ensure that real statistical significance and to avoid bias results. 

So we want to make sure that we collect enough observations and we run the test for a minimum predetermined period of time.

Therefore before running the test  we need to determine the sample size of the control and experiment groups, as well as the test duration.

This is calculated using the defined:
- power of the test, 
- the significance level 
- as well as the mimum desired effect.

There are two possabilities here -
1. The primary metric is in the form of a binary variable
2. The primary metric is in the form of proportions or averages

For example 1 would be conversion vs no conversion, click vs no click etc.

For example 2 this would be mean click through rate, mean completion rate etc.

For means, this would be:
H0 (null hypothesis): mean(con) = mean(exp)
H1 (alternative hypothesis): mean(con) != mean(exp)

This is based on the core central limit theorum and that the means of both groups follow a normal distribution



https://clincalc.com/stats/samplesize.aspx

Plug in your values for the means, Alpha, Power and continuous / dichotomous

# A/B test Duriation

This needs to be calculated before the experiment is run, otherwise you can be in danger of stopping the test when you detect statistical significance, which is known as p hacking, and is not what we want to do

Duration = Minimum sample size / # visitors per day

Dangers:
- Too small a test duration -> novelty effect, users tend to react positively to new changes / new products that wear off over time
- Test duration to large -> maturation effects, 

In [None]:
alpha = 0.05 #probabilty of type 1 error, the possibility of rejecting the null hypothesis when its true, we are comfortable making this mistake 5% of the time
delta = 0.1 #Delta, minimum desired effect
power = 0.8 # 1 - (the probabilty of type 2 error, false negative rate, number of times we end up failing to reject the null when its false)




# Test Analysis

In [13]:
!pip install numpy
!pip install pandas
!pip install scipy

import numpy as np
import pandas as pd
from scipy.stats import norm

N_exp = 10000
N_con = 10000

click_exp = pd.Series(np.random.binomial(1,0.5, size = N_exp))
click_con = pd.Series(np.random.binomial(1,0.2, size = N_con))

exp_id = pd.Series(np.repeat("exp", N_exp))
con_id = pd.Series(np.repeat("con", N_con))

df_exp = pd.concat(([click_exp, exp_id]), axis = 1)
df_con = pd.concat(([click_con, con_id]), axis = 1)

df_exp.columns = ["click", "group"]
df_con.columns = ["click", "group"]

df_ab_test = pd.concat([df_exp, df_con], axis = 0).reset_index(drop=True)

X_con = df_ab_test.groupby("group")["click"].sum().loc["con"]
X_exp = df_ab_test.groupby("group")["click"].sum().loc["exp"]

p_con_hat = X_con / N_con
p_exp_hat = X_exp / N_exp
print("Click probability in control group: ", p_con_hat)
print("Click probability in Experimental group: ", p_exp_hat)

p_pooled_hat = (X_con+X_exp)/(N_con+N_exp)
print("p_pooled hat is", p_pooled_hat)

pooled_variance = p_pooled_hat * (1-p_pooled_hat) * (1/N_con + 1/N_exp)
print("p pooled variance is:", pooled_variance)

SE = np.sqrt(pooled_variance)
print("Standard Error is: ", SE)

Test_stat = (p_con_hat - p_exp_hat)/SE
print("Test Statistics for 2-sample Z-test is:", Test_stat)

alpha = 0.05
print("Alpha: significance level is: ", alpha)

# Calculating the rejection region
Z_crit = norm.ppf(1-alpha/2)
print("Z-critical value from the standard normal distribution: ", Z_crit)

p_value = 2 * norm.sf(abs(Test_stat))
print("P-value of the 2-sample Z-test: ", round(p_value, 3))

Collecting scipy
  Downloading scipy-1.16.3-cp312-cp312-macosx_14_0_arm64.whl.metadata (62 kB)
Downloading scipy-1.16.3-cp312-cp312-macosx_14_0_arm64.whl (20.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m20.9/20.9 MB[0m [31m18.3 MB/s[0m  [33m0:00:01[0m eta [36m0:00:01[0m
[?25hInstalling collected packages: scipy
Successfully installed scipy-1.16.3
Click probability in control group:  0.1992
Click probability in Experimental group:  0.4915
p_pooled hat is 0.34535
p pooled variance is: 4.52166755e-05
Standard Error is:  0.006724334576744378
Test Statistics for 2-sample Z-test is: -43.46898517080014
Alpha: significance level is:  0.05
Z-critical value from the standard normal distribution:  1.959963984540054
P-value of the 2-sample Z-test:  0.0


In [None]:

# Baseline conversion rate: (the current rate at which users take the desired action)
baseline_conversion_rate = 0.05

variation_of_population = 
# Minimum desired effect (The smallest improvement that is worth your time and effort to detect, eg do we change from 5% to 5.01% - we probably dont care. From 5-6% thats a 20% relative increase) 
# Here we want to use the absolute increase instead
delta = 0.1

# Statistical significance (alpha) -> confidence level, the % that we will reject the null when there is inface no change
alpha = 0.05

# Statstical power (1 - beta) -> probability that if there is a real difference our test will actually find it
beta = 0.8


# Calculating the rejection region
Z_crit_sig = norm.ppf(1-alpha/2)
Z_crit_power = norm.ppf(1-beta/2)
sample_size_per_variation = ((variation_of_population ** 2) * (Z_crit_sig + Z_crit_power)**2) / delta**2


In [21]:
!pip install statsmodels
import statsmodels.stats.api as sms
from math import ceil

baseline_rate = 0.45
mde = 0.05
alpha = 0.05
power = 0.80

effect_size = sms.proportion_effectsize(baseline_rate, baseline_rate+mde)

required_n = sms.NormalIndPower().solve_power(
    effect_size = effect_size,
    power=power,
    alpha=alpha,
    ratio=1
)

required_n = ceil(required_n)
print(f"You need {required_n} users per group")
print(f"total traffic needed: {required_n *2 }")

You need 1565 users per group
total traffic needed: 3130


Known sigma (standard diviation of the population)

In [23]:
from scipy import stats as st
# Mean of group 1, μ_1
mu_1 = 5
# Mean of group 2, μ_2
mu_2 = 10
# Sampling ratio, κ = n_1 / n_2
kappa = 1
# Population standard deviation, σ
sigma = 10
# Type I error rate, α
alpha = 0.05
# Type II error rate, β
beta = 0.20

# Sample size estimation
n_2 = (1 + 1 / kappa) * (
    sigma * (
        st.norm.ppf(1 - alpha / 2) + st.norm.ppf(1 - beta)
    ) / (mu_1 - mu_2)
)**2
n_2 = np.ceil(n_2)

print(f'Sample size = {n_2:n} for both groups')

Sample size = 63 for both groups


Unknown sigma (standard deviation)

In [25]:
import numpy as np
from scipy import stats as st
import math

def calculate_sample_size_means(mu_1, mu_2, sigma, alpha=0.05, beta=0.20, kappa=1):
    '''
    Calculate sample size for comparing two means (continous data)
    '''
    # 1. Get the Critical Z-Vaues
    # For 95% confidence (alpha=0.05), we need the 97.5th percentile (two tailed)
    z_alpha = st.norm.ppf(1-beta)

    # For 80% power (beta=0.20), we need the 80th percentile (one tailed)
    z_beta = st.norm.ppf(1-beta)

    # 2. The numerator: (Sigma * (Z_alpha + Z_beta))
    # This represents the total spread we need to cover to distinguish the signal
    numerator = sigma * (z_alpha + z_beta)

    # 3. The Denominator: The minimal detectable difference
    denominator = mu_1 - mu_2

    # 4. The formula:
    n_2 = (1+ 1/kappa)*(numerator / denominator)**2

    # 5. Round up to the ceiling because you cant have half a user
    return math.ceil(n_2)

mu_1 = int(input("What is the mean of group 1: "))
mu_2 = int(input("What is the mean of group 2: "))
sigma = int(input("What is the standard deviation: "))

required_n = calculate_sample_size_means(mu_1=mu_1, mu_2=mu_2, sigma=sigma)

print(f"Required sample size per group (means): {required_n}")

Required sample size per group (means): 23


# Proportions / Conversion rates

In [27]:
def calculate_sample_size_proportions(baseline_rate, mde, alpha=0.05, beta=0.20, kappa=1):
    '''
    Calculate the sample size for comparing two proportions (conversion rates)
    Calculated from scratch using the pooled variance method'''

    # 1. Define p1 and p2
    p1 = baseline_rate
    p2 = baseline_rate + mde

    # 2. Pooled Probability (average of the two rates)
    p_avg = (p1 + p2) / 2

    # 3. Get critical z-values
    z_alpha = st.norm.ppf(1 - alpha / 2)
    z_beta = st.norm.ppf(1 - beta)

    # 4. Calculate the Variance Components
    # Standard error depends on the variance of both groups
    # Variance = p * (1-p)
    # We sum the variances (multipled by 2 for equal groups approximation)
    # A common standard approximation is 2 * p_avg * (1 - p_avg)
    pooled_variance = 2 * p_avg * (1 - p_avg)

    # 5. The formula:
    numerator = pooled_variance * (z_alpha + z_beta)**2
    denominator = (p2 - p1) ** 2

    n = numerator / denominator

    return math.ceil(n)

baseline_rate = float(input("please enter the baseline rate of conversion: "))
mde = float(input("please enter the minimal desired absolute change / effect: "))
n_proportions = calculate_sample_size_proportions(baseline_rate=baseline_rate, mde=mde)
print(f"The required sample size per group (proportions): {n_proportions}")




The required sample size per group (proportions): 321
