# A Practical Guide To A/B Tests in Python

source: https://towardsdatascience.com/a-practical-guide-to-a-b-tests-in-python-66666f5c3b02

Best practices that data scientists should follow pre-, during-, and after- experiments

A/B tests are effective and only rely on mild assumptions, and the most important assumption is the Stable Unit Treatment Value Assumption, SUTVA. It states that the treatment and control units don’t interact with each other; otherwise, the interference leads to biased estimates.

A/B test can be roughly divided into three stages:
- Stage 1 Pre-Test: run a power analysis to decide the sample size.
- Stage 2 At-Test: keep an eye on the key metrics. Be aware of sudden drops.
- Stage 3 Post-Test: analyze data and reach conclusions.

Business Scenario:
TikTok develops a new animal filter and wants to assess its effects on users. They are interested in two key metrics:
1. How does the filter affect user engagement (e.g., time spent on the app)?
2. How does the filter affect user retention (e.g., active)?

The company decides to hire a small group of very talented data scientists, and you are the team leader in charge of model selection and research design. After consulting with multiple stakeholders, you propose an A/B test and suggest the following best practices.



**Stage 1 Pre-Test: Goal, Metrics, and Sample Size**
- What is the goal of the test?
- How to measure success?
- How long should we run it?

As a first step, we want to clarify the goal of the test and relay it back to the team. As mentioned, the study aims to measure user engagement and retention after rolling out the filter.

Next, we move to the metrics and decide how to measure the success. As a social networking app, we adopt the time spent on the app to measure user engagement and two boolean variables, metric 1 and metric 2 (described below), indicating if the user is active after 1 day and 7 days, respectively.



The remaining question is: how long should we run the test? A common strategy is to stop the experiment once we observe a statistically significant result (e.g., a small p-value). Established data scientists strongly oppose p-hacking as it leads to biased results (Kohavi et al. 2020). On a related note, Airbnb has encountered the same problem when p-hacking leads to false positives (Experiments at Airbnb).

**Instead, we should run a power analysis and decide a minimum sample size, according to three parameters:**
- The significance Level, also denoted as alpha or α: the probability of rejecting a null hypothesis when it is true. By rejecting a true null hypothesis, we falsely claim there is an effect when there is no actual effect. Thus, it is also called the probability of False Positive.


- Statistical Power: the probability of correctly identifying the effect when there is indeed an effect. Power = 1 — Type II Error.

- The Minimum Detectable Effect, MDE: to find a widely agreed upon MDE, our data team sits down with the PM and decides the smallest acceptable difference is 0.1. In other words, the difference between the two groups scaled by the standard deviation needs to be at least 0.1. Otherwise, the release won’t compensate for the business costs incurred (e.g., engineers’ time, product lifecycle, etc.). For example, it won’t make any sense to roll out a new design if it only brings in a 0.000001% lift, even if it is statistically significant.


Here is the bi-relationship between these three parameters and the required sample size:

- Significance Level decreases → Larger Sample Size
- Statistical Power increases → Larger Sample Size
- The Minimum Detectable Effect decreases → Larger Sample Size

Typically, we set the significance level at 5% (or alpha = 5%) and statistical power at 80%. Thus, the sample size is calculated by the following formula:

<img src="1.webp" alt="Alternative text" />

where:
- σ²: sample variance.
- 𝛿: the difference between the treatment and control groups (in percentage).

To obtain the sample variance (σ²), we typically run an A/A test that follows the same design thinking as an A/B test except assigning the same treatment to both groups.

Splitting the users into two groups and then assign the same treatment to both.

In [4]:
from statsmodels.stats.power import TTestIndPower
# parameters for power analysis 
# effect_size has to be positive
effect = 0.1
alpha = 0.05
power = 0.8
# perform power analysis 
analysis = TTestIndPower()
result = analysis.solve_power(effect, power = power,nobs1= None, ratio = 1.0, alpha = alpha)
print('Sample Size: %.3f' % round(result))

Sample Size: 1571.000


We need 1571 for each variant. In terms of how long we should run the test, it depends on how much traffic the app receives. Then, we divide the daily traffic equally into these two variants and wait until collecting a sufficiently large sample size (≥1571).

**Best Practices**
- Understand the goal of the experiment and how to measure the success.
- Run an A/A test to estimate the variance of the metric. Check out my latest post on how to run and interpret A/A tests in Python.
- Run a power analysis to obtain the minimum sample size.

We roll out the test and initiate the data collection process. Here, we simulate the Data Generation Process (DGP) and artificially create variables that follow specific distributions. The true parameters are known to us, which comes in handy when comparing the estimated treatment effect to the true effects. In other words, we can evaluate the effectiveness of A/B tests and check to what extent they lead to unbiased results.



There are five variables to be simulated in our case study:
1. userid
2. version
3. minutes of plays
4. user engagement after 1 day (metric_1)
5. user engagement after 7 days (metric_2)

In [6]:
# Variables 1 and 2: userid and version
# We intentionally create 1600 control units and 1749 treated units to signal a potential Sample Ratio Mismatch, SRM.

# variable 1: userid
user_id_control = list(range(1,1601))# 1600 control
user_id_treatment = list(range(1601,3350))# 1749 treated
# variable 2: version 
import numpy as np
control_status = [user_id_control]*1600
treatment_status = [user_id_treatment]*1749

In [8]:
# Variable 3: minutes of plays
# We simulate variable 3 (“minutes of plays”) as a normal distribution with a μ of 30 minutes and σ² of 10. In specific, the mean for the control group is 30 minutes, and the variance is 10.
# To recap, the effect parameter to the MDE is calculated as the difference between the two groups divided by the standard deviation (μ_1 — μ_2)/σ_squared = 0.1. According to the formula, we obtain μ_2 = 31. The variance is also 10.

# for control group
μ_1 = 30
σ_squared_1 = 10

np.random.seed(123)
minutes_control = np.random.normal(loc = μ_1, scale = σ_squared_1, size = 1600)
# for treatment group, which increases the user engagement by 
# according to the formula (μ_1 — μ_2)/σ_squared = 0.1, we obtain μ_2 = 31
μ_2 = 31
σ_squared_2 = 10
np.random.seed(123)
minutes_treat = np.random.normal(loc = μ_2, scale = σ_squared_2, size = 1749)



In [9]:
# variable 4: user engagement after 1 day, metric_1
# Our simulation shows that the control group has 30% active (True) and 70% inactive (False) users after 1 day (metric_1), while the treatment has 35% active and 65% inactive users, respectively.
Active_status = [True,False]
# control 
day_1_control = np.random.choice(Active_status, 1600, p=[0.3,0.7])
# treatment
day_1_treatment = np.random.choice(Active_status, 1749, p=[0.35,0.65])



In [10]:
# # variable 5: user engagement after 7 day, metric_2
# The simulation data shows the control group has a 35% active user rate, while the treatment has a 25% after 7 days.

# control 
day_7_control = np.random.choice(Active_status, 1600, p=[0.35,0.65])
# treatment
day_7_treatment = np.random.choice(Active_status, 1749, p=[0.25,0.75])

The true data contains a reversed pattern: the treatment performs better in the short term but the control group comes back and stands out after one week.

Let’s check if the A/B test picks up the reversed signal.

In [11]:
import pandas as pd

In [12]:
final_data = pd.DataFrame({'userid':user_id_control+user_id_treatment,
                            'version':control_status+treatment_status,
                            'minutes_play':list(minutes_control)+list(minutes_treat),
                            'day_1':list(day_1_control)+list(day_1_treatment),
                            'day_7':list(day_7_control)+list(day_7_treatment)})

In [13]:
final_data.head()

Unnamed: 0,userid,version,minutes_play,day_1,day_7
0,1,"[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14...",19.143694,False,True
1,2,"[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14...",39.973454,True,False
2,3,"[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14...",32.829785,False,False
3,4,"[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14...",14.937053,False,True
4,5,"[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14...",24.213997,False,True


**Stage 3 After-Test: Data Analysis**
After collecting enough data, we move to the last stage of experiments, which is data analysis. As a first step, it would be beneficial to check how many users fell into each variant.

**After taking a closer look, the control group has 29.7% active users, and the treatment has 35%.**

- Naturally, we are interested in the following questions:

- Is the higher retention rate in the treatment group statistically significant?

- What is its variability?

- If we repeat the process for 10,000 times, how often do we observe at least as extreme values?

Bootstrap can answer these questions. It is a resample strategy that repeatedly samples from the original data with replacements. According to the Central Limit Theorem, the distribution of the resample means approximately normally distributed (check my other posts on Bootstrap, in R or Python).



The reversed pattern between 1-day and 7-day metrics supports the novelty effect as users become activated and intrigued by the new design, not because the change actually improves engagement. The novelty effect is popular in consumer-side A/B tests.

**Best Practices**
- SRM is a real concern. We apply a chi-square test to formally test for the SRM. If the p-value is smaller than the threshold (α = 0.001), the randomization process does not work as expected.
An SRM introduces selection bias that invalidates any test results.
- Three fundamental statistical concepts to master: SRM, chi-square test, and bootstrap.
- Compare short-term and long-term metrics to evaluate the novelty effect.