# A/B Test Bias Correction: A Practical Guide

This notebook demonstrates the **common pitfalls and correct methods** for handling pre-existing bias (pre-bias) in A/B testing.

We'll simulate data with and without pre-bias and compare four methods:
- ❌ Naive Post-Only Comparison
- ❌ Manual Bias Correction
- ✅ Difference-in-Differences (DiD)
- ✅ Regression Adjustment

This is intended as a practical reference for analysts and data scientists who run controlled experiments.


In [1]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns

np.random.seed(42)

# Sample size
n = 1000

# Simulate pre-period KPI
control_pre = np.random.normal(loc=100, scale=15, size=n)
test_pre = np.random.normal(loc=105, scale=15, size=n)  # pre-bias +5

# Simulate post-period KPI
control_post = control_pre + np.random.normal(loc=2, scale=10, size=n)
test_post = test_pre + np.random.normal(loc=5, scale=10, size=n)  # includes treatment effect +3

# Combine data
df = pd.DataFrame({
    "group": ["control"] * n + ["test"] * n,
    "pre_kpi": np.concatenate([control_pre, test_pre]),
    "post_kpi": np.concatenate([control_post, test_post])
})


In [15]:
# control_pre
# test_pre
# control_post
# test_post
df.head()

Unnamed: 0,group,pre_kpi,post_kpi
0,control,107.450712,102.69893
1,control,97.926035,98.480849
2,control,109.715328,103.791129
3,control,122.845448,121.765833
4,control,96.487699,79.551553


In [19]:
mean_control_post = df[df.group == "control"]["post_kpi"].mean()


In [31]:
mean_control_post, mean_test_post, naive_diff

(102.34832298289511, 110.87535138086962, 8.527028397974505)

In [17]:
# Naive Post-Only Comparison
mean_control_post = df[df.group == "control"]["post_kpi"].mean()
mean_test_post = df[df.group == "test"]["post_kpi"].mean()
naive_diff = mean_test_post - mean_control_post

# Manual Bias Correction
mean_control_pre = df[df.group == "control"]["pre_kpi"].mean()
mean_test_pre = df[df.group == "test"]["pre_kpi"].mean()
pre_bias = mean_test_pre - mean_control_pre
manual_adjusted_diff = naive_diff - pre_bias

# Difference-in-Differences (DiD)
control_diff = control_post - control_pre
test_diff = test_post - test_pre
did_effect = test_diff.mean() - control_diff.mean()

# Regression Adjustment
df["group_binary"] = df["group"].apply(lambda x: 1 if x == "test" else 0)
X = sm.add_constant(df[["group_binary", "pre_kpi"]])
model = sm.OLS(df["post_kpi"], X).fit()
regression_effect = model.params["group_binary"]
p_value = model.pvalues["group_binary"]

# Summary Table
summary_df = pd.DataFrame({
    "Scenario": [
        "Naive (Post KPI Only)",
        "Manual Bias Correction (Subtract Pre-Bias)",
        "Difference-in-Differences (DiD)",
        "Regression Adjustment"
    ],
    "Estimated Effect": [
        naive_diff,
        manual_adjusted_diff,
        did_effect,
        regression_effect
    ],
    "p-value": [
        "❌ Not Reliable", 
        "❌ Not Reliable", 
        "✅ Via DiD Test", 
        f"✅ {p_value:.4f}"
    ]
})
summary_df

Unnamed: 0,Scenario,Estimated Effect,p-value
0,Naive (Post KPI Only),8.527028,❌ Not Reliable
1,Manual Bias Correction (Subtract Pre-Bias),2.754466,❌ Not Reliable
2,Difference-in-Differences (DiD),2.754466,✅ Via DiD Test
3,Regression Adjustment,2.822871,✅ 0.0000


In [None]:
# Confidence Intervals
se_control = np.std(control_diff, ddof=1) / np.sqrt(n)
se_test = np.std(test_diff, ddof=1) / np.sqrt(n)
se_did = np.sqrt(se_control**2 + se_test**2)
ci_did_lower = did_effect - 1.96 * se_did
ci_did_upper = did_effect + 1.96 * se_did

conf_int_reg = model.conf_int().loc["group_binary"]

# Plot
methods = ["Naive", "Manual Correction", "DiD", "Regression"]
effects = [naive_diff, manual_adjusted_diff, did_effect, regression_effect]
ci_lowers = [None, None, ci_did_lower, conf_int_reg[0]]
ci_uppers = [None, None, ci_did_upper, conf_int_reg[1]]
colors = ["red", "orange", "green", "blue"]

plt.figure(figsize=(10, 6))
for i, (method, effect, low, high, color) in enumerate(zip(methods, effects, ci_lowers, ci_uppers, colors)):
    plt.scatter(i, effect, color=color, label=method, s=100)
    if low is not None and high is not None:
        plt.plot([i, i], [low, high], color=color, linestyle='-', linewidth=2)

plt.axhline(0, color='gray', linestyle='--')
plt.xticks(range(len(methods)), methods)
plt.ylabel("Estimated Effect")
plt.title("Comparison of Bias Correction Methods with Confidence Intervals")
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()

## 🧠 Conclusion

- **Naive post-only comparisons** can overstate the effect if there's any pre-bias.
- **Manual bias correction** gives the correct point estimate but lacks statistical validity (no p-value or confidence interval).
- **Difference-in-Differences (DiD)** and **Regression Adjustment** are statistically sound and recommended.
- Always check for **pre-period imbalances** and use methods that account for them.

You can reuse this notebook to validate the robustness of your A/B test results. Happy experimenting!
