# A/B Testing in Business Contexts

**Module: Descriptive & Inferential Statistics**

## Learning Objectives
- Design and set up A/B tests
- Calculate required sample sizes
- Analyze A/B test results for statistical significance
- Interpret results and make business recommendations

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats

np.random.seed(42)

---
## Quick Refresher

### A/B Test Components
| Component | Description |
|-----------|-------------|
| **Control (A)** | Current version, baseline |
| **Treatment (B)** | New version being tested |
| **Metric** | What you're measuring (conversion rate, revenue, etc.) |
| **Sample size** | Number of users per group |
| **Duration** | How long to run the test |

### Key Concepts
- **Minimum Detectable Effect (MDE)**: Smallest improvement worth detecting
- **Statistical Power**: Probability of detecting a real effect (typically 80%)
- **Significance Level (α)**: Probability of false positive (typically 5%)
- **Practical Significance**: Is the effect size meaningful for the business?

### Common Pitfalls
- Peeking at results early and stopping when significant
- Running tests too short (novelty effects)
- Not accounting for multiple comparisons
- Ignoring segment differences

---
## Sample Size Calculation

In [None]:
def sample_size_proportion(p1, p2, alpha=0.05, power=0.80):
    """
    Calculate required sample size per group for comparing two proportions.
    
    p1: baseline conversion rate
    p2: expected conversion rate with treatment
    alpha: significance level (Type I error rate)
    power: 1 - Type II error rate
    """
    # Z-scores
    z_alpha = stats.norm.ppf(1 - alpha/2)  # Two-tailed
    z_beta = stats.norm.ppf(power)
    
    # Pooled proportion
    p_pooled = (p1 + p2) / 2
    
    # Sample size formula
    n = (2 * p_pooled * (1 - p_pooled) * (z_alpha + z_beta)**2) / (p2 - p1)**2
    
    return int(np.ceil(n))

# Example: Testing a checkout page
# Current conversion: 3%
# Want to detect if new version achieves 3.5% (0.5% absolute lift)

baseline = 0.03
expected = 0.035
n_per_group = sample_size_proportion(baseline, expected)

print(f"Baseline conversion: {baseline*100}%")
print(f"Expected conversion: {expected*100}%")
print(f"Minimum detectable effect: {(expected-baseline)*100}% absolute")
print(f"\nRequired sample size per group: {n_per_group:,}")
print(f"Total sample size: {n_per_group*2:,}")

In [None]:
# How MDE affects sample size
print("MDE (absolute) | Sample Size per Group")
print("-" * 40)
for mde in [0.001, 0.002, 0.005, 0.01, 0.02]:
    n = sample_size_proportion(baseline, baseline + mde)
    print(f"    {mde*100:.1f}%       |     {n:,}")

---
## Working Example: E-commerce Checkout A/B Test

In [None]:
# Simulate A/B test data
np.random.seed(42)

n_control = 5000
n_treatment = 5000

# True conversion rates (in real life, you don't know treatment rate)
true_control_rate = 0.032
true_treatment_rate = 0.038

# Generate data
control_conversions = np.random.binomial(1, true_control_rate, n_control)
treatment_conversions = np.random.binomial(1, true_treatment_rate, n_treatment)

# Create DataFrame
ab_data = pd.DataFrame({
    'group': ['control'] * n_control + ['treatment'] * n_treatment,
    'converted': np.concatenate([control_conversions, treatment_conversions])
})

ab_data.head()

In [None]:
# Summary statistics
summary = ab_data.groupby('group')['converted'].agg(['sum', 'count', 'mean'])
summary.columns = ['conversions', 'visitors', 'conversion_rate']
summary['conversion_rate_pct'] = summary['conversion_rate'] * 100
print(summary)

In [None]:
# Extract values for analysis
control = ab_data[ab_data['group'] == 'control']['converted']
treatment = ab_data[ab_data['group'] == 'treatment']['converted']

n_c, conv_c = len(control), control.sum()
n_t, conv_t = len(treatment), treatment.sum()
p_c = conv_c / n_c
p_t = conv_t / n_t

print(f"Control: {conv_c}/{n_c} = {p_c*100:.2f}%")
print(f"Treatment: {conv_t}/{n_t} = {p_t*100:.2f}%")
print(f"\nAbsolute difference: {(p_t - p_c)*100:.2f}%")
print(f"Relative lift: {(p_t - p_c)/p_c*100:.1f}%")

### Statistical Significance Test

In [None]:
def ab_test_proportions(conversions_a, total_a, conversions_b, total_b):
    """
    Perform z-test for comparing two proportions.
    Returns z-statistic, p-value, and confidence interval for difference.
    """
    # Proportions
    p_a = conversions_a / total_a
    p_b = conversions_b / total_b
    
    # Pooled proportion under H0
    p_pooled = (conversions_a + conversions_b) / (total_a + total_b)
    
    # Standard error
    se = np.sqrt(p_pooled * (1 - p_pooled) * (1/total_a + 1/total_b))
    
    # Z-statistic
    z = (p_b - p_a) / se
    
    # Two-tailed p-value
    p_value = 2 * (1 - stats.norm.cdf(abs(z)))
    
    # 95% CI for difference (using unpooled SE)
    se_diff = np.sqrt(p_a*(1-p_a)/total_a + p_b*(1-p_b)/total_b)
    ci_lower = (p_b - p_a) - 1.96 * se_diff
    ci_upper = (p_b - p_a) + 1.96 * se_diff
    
    return {
        'z_statistic': z,
        'p_value': p_value,
        'ci_lower': ci_lower,
        'ci_upper': ci_upper,
        'significant': p_value < 0.05
    }

results = ab_test_proportions(conv_c, n_c, conv_t, n_t)

print("=" * 50)
print("A/B TEST RESULTS")
print("=" * 50)
print(f"Z-statistic: {results['z_statistic']:.3f}")
print(f"P-value: {results['p_value']:.4f}")
print(f"95% CI for difference: ({results['ci_lower']*100:.2f}%, {results['ci_upper']*100:.2f}%)")
print(f"\nStatistically significant at α=0.05: {'Yes ✓' if results['significant'] else 'No'}")

In [None]:
# Alternative: Using scipy's chi-squared test
contingency_table = pd.crosstab(ab_data['group'], ab_data['converted'])
chi2, p_value, dof, expected = stats.chi2_contingency(contingency_table)

print("Chi-squared test:")
print(f"χ² = {chi2:.3f}, p = {p_value:.4f}")

### Business Interpretation

In [None]:
# Project the impact
monthly_visitors = 100000
avg_order_value = 75  # dollars

# Current monthly conversions and revenue
current_conversions = monthly_visitors * p_c
current_revenue = current_conversions * avg_order_value

# Projected with treatment
projected_conversions = monthly_visitors * p_t
projected_revenue = projected_conversions * avg_order_value

# Uplift
additional_conversions = projected_conversions - current_conversions
additional_revenue = projected_revenue - current_revenue

print("PROJECTED MONTHLY IMPACT")
print("-" * 40)
print(f"Additional conversions: {additional_conversions:,.0f}")
print(f"Additional revenue: ${additional_revenue:,.0f}")
print(f"Annual revenue impact: ${additional_revenue * 12:,.0f}")

---
## Exercises

### Exercise 1: Sample Size Planning

In [None]:
# You're planning an A/B test for a sign-up form.
# Current sign-up rate: 12%
# You want to detect a 1.5% absolute improvement (to 13.5%)

# TODO: Calculate the required sample size per group
# Use alpha=0.05 and power=0.80



In [None]:
# TODO: Your website gets 2,000 visitors per day.
# If you split traffic 50/50, how many days do you need to run the test?



In [None]:
# TODO: Marketing says they can only wait 2 weeks.
# What's the minimum detectable effect you can achieve in that time?
# Hint: Work backwards - with 14 days * 1000 per group, what MDE is possible?



### Exercise 2: Analyze A/B Test Results

In [None]:
# Email subject line A/B test results
email_test = pd.DataFrame({
    'variant': ['Subject A', 'Subject B'],
    'emails_sent': [15000, 15000],
    'opens': [2850, 3150]
})

email_test['open_rate'] = email_test['opens'] / email_test['emails_sent']
print(email_test)

In [None]:
# TODO: Calculate the absolute and relative difference in open rates



In [None]:
# TODO: Test if the difference is statistically significant at α = 0.05
# Calculate the z-statistic and p-value



In [None]:
# TODO: Calculate the 95% confidence interval for the difference in open rates



In [None]:
# TODO: Write a brief recommendation: Should we switch to Subject B?



### Exercise 3: Full A/B Test Analysis

In [None]:
# Landing page A/B test with user-level data
np.random.seed(123)

landing_test = pd.DataFrame({
    'user_id': range(1, 20001),
    'variant': np.random.choice(['original', 'new_design'], 20000),
    'device': np.random.choice(['mobile', 'desktop'], 20000, p=[0.6, 0.4]),
})

# Conversion depends on variant AND device
def get_conversion(row):
    if row['variant'] == 'original':
        rate = 0.05 if row['device'] == 'mobile' else 0.08
    else:
        rate = 0.055 if row['device'] == 'mobile' else 0.095  # New design works better
    return np.random.binomial(1, rate)

landing_test['converted'] = landing_test.apply(get_conversion, axis=1)
landing_test.head()

In [None]:
# TODO: Calculate overall conversion rates for each variant



In [None]:
# TODO: Test if the new design has significantly higher conversion (overall)



In [None]:
# TODO: Break down conversion rates by device type
# Is the new design better on both mobile and desktop?



In [None]:
# TODO: Run separate significance tests for mobile and desktop users



### Exercise 4: Continuous Metric A/B Test

In [None]:
# Testing a new recommendation algorithm on average order value
np.random.seed(456)

# Control: current algorithm
control_aov = np.random.normal(loc=85, scale=35, size=1000)
control_aov = np.clip(control_aov, 10, 300)  # Realistic bounds

# Treatment: new recommendation algorithm
treatment_aov = np.random.normal(loc=92, scale=38, size=1000)
treatment_aov = np.clip(treatment_aov, 10, 300)

print(f"Control: mean=${control_aov.mean():.2f}, std=${control_aov.std():.2f}")
print(f"Treatment: mean=${treatment_aov.mean():.2f}, std=${treatment_aov.std():.2f}")

In [None]:
# TODO: Test if the treatment has significantly higher average order value
# Use a two-sample t-test



In [None]:
# TODO: Calculate the 95% CI for the difference in mean AOV



In [None]:
# TODO: Calculate Cohen's d effect size
# Is this a small, medium, or large effect?



In [None]:
# TODO: If you process 5,000 orders per month, estimate the monthly revenue impact
# Include confidence bounds



### Exercise 5: Multiple Variants Test

In [None]:
# Testing 3 different button colors (A/B/C test)
button_test = pd.DataFrame({
    'variant': ['Blue (Control)', 'Green', 'Orange'],
    'visitors': [10000, 10000, 10000],
    'clicks': [320, 380, 345]
})

button_test['ctr'] = button_test['clicks'] / button_test['visitors']
print(button_test)

In [None]:
# TODO: Perform chi-squared test to check if there's any difference among the three



In [None]:
# TODO: Compare Green vs Control (Blue) - is Green significantly better?
# Note: When doing multiple comparisons, consider Bonferroni correction
# Adjusted alpha = 0.05 / number of comparisons



In [None]:
# TODO: Which button would you recommend, and why?



---
## Solutions

In [None]:
# Exercise 1 Solutions

# Sample size calculation
baseline_rate = 0.12
target_rate = 0.135
n = sample_size_proportion(baseline_rate, target_rate)
print(f"Required sample size per group: {n:,}")

# Days needed
daily_visitors = 2000
visitors_per_group_per_day = daily_visitors / 2
days_needed = np.ceil(n / visitors_per_group_per_day)
print(f"Days needed: {days_needed:.0f}")

# MDE with 2-week constraint
n_available = 14 * visitors_per_group_per_day
print(f"\nWith 2 weeks, you have {n_available:.0f} visitors per group")
# Approximate MDE (would need to solve equation, but roughly):
# For n=14000, MDE is approximately 1.8-2%
print("MDE would be approximately 1.8-2% absolute (would need to solve iteratively)")

In [None]:
# Exercise 2 Solutions

# Differences
p_a = 2850 / 15000
p_b = 3150 / 15000
abs_diff = p_b - p_a
rel_diff = (p_b - p_a) / p_a * 100

print(f"Subject A open rate: {p_a*100:.2f}%")
print(f"Subject B open rate: {p_b*100:.2f}%")
print(f"Absolute difference: {abs_diff*100:.2f}%")
print(f"Relative improvement: {rel_diff:.1f}%")

# Significance test
results = ab_test_proportions(2850, 15000, 3150, 15000)
print(f"\nZ-statistic: {results['z_statistic']:.3f}")
print(f"P-value: {results['p_value']:.4f}")
print(f"95% CI: ({results['ci_lower']*100:.2f}%, {results['ci_upper']*100:.2f}%)")
print(f"\nStatistically significant: {'Yes' if results['significant'] else 'No'}")

print("\nRecommendation: Yes, switch to Subject B. The 2% absolute improvement is")
print("statistically significant and represents a meaningful 10.5% relative lift.")

In [None]:
# Exercise 3 Solutions

# Overall conversion rates
overall = landing_test.groupby('variant')['converted'].agg(['sum', 'count', 'mean'])
print("Overall Results:")
print(overall)

# Significance test
orig = landing_test[landing_test['variant'] == 'original']
new = landing_test[landing_test['variant'] == 'new_design']

results = ab_test_proportions(
    orig['converted'].sum(), len(orig),
    new['converted'].sum(), len(new)
)
print(f"\nOverall test: p-value = {results['p_value']:.4f}")
print(f"Significant: {results['significant']}")

In [None]:
# Breakdown by device
print("\nBy Device:")
device_breakdown = landing_test.groupby(['device', 'variant'])['converted'].agg(['sum', 'count', 'mean'])
print(device_breakdown)

# Test for each device
for device in ['mobile', 'desktop']:
    subset = landing_test[landing_test['device'] == device]
    orig_d = subset[subset['variant'] == 'original']
    new_d = subset[subset['variant'] == 'new_design']
    
    res = ab_test_proportions(
        orig_d['converted'].sum(), len(orig_d),
        new_d['converted'].sum(), len(new_d)
    )
    print(f"\n{device.capitalize()}: p-value = {res['p_value']:.4f}, significant = {res['significant']}")

In [None]:
# Exercise 4 Solutions

# T-test
t_stat, p_value = stats.ttest_ind(treatment_aov, control_aov)
print(f"T-statistic: {t_stat:.3f}")
print(f"P-value: {p_value:.4f}")
print(f"Significant: {p_value < 0.05}")

# CI for difference
mean_diff = treatment_aov.mean() - control_aov.mean()
se_diff = np.sqrt(treatment_aov.var()/len(treatment_aov) + control_aov.var()/len(control_aov))
ci = (mean_diff - 1.96*se_diff, mean_diff + 1.96*se_diff)
print(f"\n95% CI for AOV difference: (${ci[0]:.2f}, ${ci[1]:.2f})")

# Cohen's d
pooled_std = np.sqrt((treatment_aov.var() + control_aov.var()) / 2)
d = mean_diff / pooled_std
print(f"\nCohen's d: {d:.3f} (small effect)")

# Monthly impact
monthly_orders = 5000
expected_monthly_lift = mean_diff * monthly_orders
ci_lower_monthly = ci[0] * monthly_orders
ci_upper_monthly = ci[1] * monthly_orders
print(f"\nMonthly revenue impact: ${expected_monthly_lift:,.0f}")
print(f"95% CI: (${ci_lower_monthly:,.0f}, ${ci_upper_monthly:,.0f})")

In [None]:
# Exercise 5 Solutions

# Chi-squared test
contingency = np.array([[320, 10000-320], [380, 10000-380], [345, 10000-345]])
chi2, p_value, dof, expected = stats.chi2_contingency(contingency)
print(f"Chi-squared test: χ² = {chi2:.2f}, p = {p_value:.4f}")
print(f"There IS a significant difference among the three variants.")

# Green vs Control with Bonferroni correction
# 3 comparisons: Blue-Green, Blue-Orange, Green-Orange
bonferroni_alpha = 0.05 / 3
print(f"\nBonferroni-corrected alpha: {bonferroni_alpha:.4f}")

green_vs_blue = ab_test_proportions(320, 10000, 380, 10000)
print(f"\nGreen vs Blue:")
print(f"  p-value: {green_vs_blue['p_value']:.4f}")
print(f"  Significant at corrected alpha: {green_vs_blue['p_value'] < bonferroni_alpha}")

print("\nRecommendation: Switch to Green button.")
print("- Green shows 18.75% relative improvement over Blue")
print("- The difference is significant even with Bonferroni correction")
print("- Orange improvement (7.8%) is smaller and may not be significant")