# Exercise 1: Calculating Required Sample Size




> an A/B test to evaluate the impact of a new email subject line on the open rate. Based on past data, you expect a small effect size of 0.3 (an increase from 20% to 23% in the open rate). You aim for an 80% chance (power = 0.8) of detecting this effect if it exists, with a 5% significance level (α = 0.05).
Calculate the required sample size per group using Python’s statsmodels library.
What sample size is needed for each group to ensure your test is properly powered?



main additional assumptions for the A/B test:


Equal Group Sizes:
ratio=1: both groups in the A/B test (Group A and Group B) will have the same number of participants in order to provide the most statistical power for a given total sample size.

Varience in Group Size - Minimized



In [None]:
from statsmodels.stats.power import TTestIndPower

# Creating a TTestIndPower object
effect_size = 0.3
alpha = 0.05
power = 0.8
ratio = 1

analysis = TTestIndPower()

# Calculating required sample size per group
sample_size_per_group = analysis.solve_power(effect_size=effect_size, alpha=alpha, power=power, ratio=ratio)

print(f"Required sample size per group: {sample_size_per_group:.0f}")


Required sample size per group: 175


#  Exercise 2: Understanding the Relationship Between Effect Size and Sample Size



>Using the same A/B test setup as in Exercise 1:
Calculate the required sample size for the following effect sizes: 0.2, 0.4, and 0.5, keeping the significance level and power the same.
How does the sample size change as the effect size increases? Explain why this happens.



In [None]:
from statsmodels.stats.power import TTestIndPower

# Creating a TTestIndPower object
alpha = 0.05
power = 0.8
ratio = 1

# Effect sizes:
effect_sizes = [0.2, 0.4, 0.5]
analysis = TTestIndPower()

# Calculating required sample sizes for each effect size
sample_sizes = {effect_size: analysis.solve_power(effect_size=effect_size, alpha=alpha, power=power, ratio=ratio)
                for effect_size in effect_sizes}
for effect_size, sample_size in sample_sizes.items():
    print(f"Effect Size {effect_size}: Required Sample Size Per Group = {sample_size:.0f}")


Effect Size 0.2: Required Sample Size Per Group = 393
Effect Size 0.4: Required Sample Size Per Group = 99
Effect Size 0.5: Required Sample Size Per Group = 64


As the effect size increases, the required sample size decreases. This happens because:
1.   Larger Effects Are Easier to Detect:
2.  A larger effect size means a bigger difference between the groups, making  
it easier to detect the difference with fewer participants.
3.  A larger effect size means a bigger difference between the groups, making it easier to detect the difference with fewer participants.

Signal-to-Noise Ratio:
A smaller effect size has a lower signal-to-noise ratio, requiring more data to distinguish the effect from random variation.

**This demonstrates the inverse relationship between effect size and sample size in statistical testing.**

#Exercise 3: Exploring the Impact of Statistical Power



> Imagine you are conducting an A/B test where you expect a small effect size of 0.2. You initially plan for a power of 0.8 but wonder how increasing or decreasing the desired power level impacts the required sample size.
Calculate the required sample size for power levels of 0.7, 0.8, and 0.9, keeping the effect size at 0.2 and significance level at 0.05.
Question: How does the required sample size change with different levels of statistical power? Why is this understanding important when designing A/B tests?




In [None]:
from statsmodels.stats.power import TTestIndPower

# Creating a TTestIndPower object
effect_size = 0.2
alpha = 0.05
ratio = 1

# different power levels to explore
power_levels = [0.7, 0.8, 0.9]
analysis = TTestIndPower()

# Calculating required sample sizes for each power level
sample_sizes_power = {power_level: analysis.solve_power(effect_size=effect_size, alpha=alpha, power=power_level, ratio=ratio)
                      for power_level in power_levels}

for power_level, sample_size in sample_sizes_power.items():
    print(f"Power Level {power_level}: Required Sample Size Per Group = {sample_size:.0f}")


Power Level 0.7: Required Sample Size Per Group = 310
Power Level 0.8: Required Sample Size Per Group = 393
Power Level 0.9: Required Sample Size Per Group = 526


When designing A/B tests, setting an appropriate power level ensures the test has a good chance of detecting meaningful effects while avoiding unnecessary data collection.

As the desired power increases, the required sample size also increases. This
happens because:
1.   Higher Power Means Greater Sensitivity: A higher statistical power reduces the probability of a Type II error (failing to detect an effect that actually exists). Achieving this sensitivity requires more data.
2.   Balancing Resources: Understanding the relationship between power and sample size helps in balancing the need for reliable results against the cost and feasibility of data collection


# Exercise 4: Implementing Sequential Testing



> You are running an A/B test on two versions of a product page to increase the purchase rate. You plan to monitor the results weekly and stop the test early if one version shows a significant improvement.
Define your stopping criteria.
Decide how you would implement sequential testing in this scenario.
At the end of week three, Version B has a p-value of 0.02. What would you do next?



**Defining Stopping Criteria**

---


Stopping criteria for sequential testing involve rules to decide when to stop the test early:

1. P-value Threshold: If the p-value drops below a predefined threshold (e.g., 0.05), the test can stop early with the conclusion that one version outperforms the other.
2. Confidence Boundaries: Use adjusted boundaries (e.g., Bonferroni correction or alpha-spending) to control the overall Type I error rate when checking results multiple times.
3. Maximum Duration: If no significant difference is found by a predefined endpoint, declare the test inconclusive.

**Implementing Sequential Testing in Code**

---



In [None]:
import numpy as np

def sequential_testing(p_values, alpha=0.05):
    """
    Simulate sequential testing by checking p-values at each stage.

    Parameters:
        p_values (list of float): List of p-values from each week's data.
        alpha (float): Significance level threshold.

    Returns:
        str: Decision to stop or continue the test.
    """
    for week, p_value in enumerate(p_values, start=1):
        adjusted_alpha = alpha / week  # Bonferroni correction
        if p_value < adjusted_alpha:
            return f"Stop the test: Significant result in Week {week} (p={p_value:.3f}, adjusted alpha={adjusted_alpha:.3f})"
    return "Continue the test: No significant results yet."


weekly_p_values = [0.10, 0.08, 0.01]
decision = sequential_testing(weekly_p_values)

print(decision)


Stop the test: Significant result in Week 3 (p=0.010, adjusted alpha=0.017)


Sequential testing increases the chance of a Type I error (false positive) if you use the standard
α=0.05 threshold multiple times. Adjusting the alpha controls for this inflation.

#  Exercise 5: Applying Bayesian A/B Testing



> You’re testing a new feature in your app, and you want to use a Bayesian approach. Initially, you believe the new feature has a 50% chance of improving user engagement. After collecting data, your analysis suggests a 65% probability that the new feature is better.
Describe how you would set up your prior belief.
After collecting data, how does the updated belief (posterior distribution) influence your decision?
What would you do if the posterior probability was only 55%?




1.**Setting Up the Prior Belief:**
The prior belief is symmetric: the feature is equally likely to be better or worse. This is represented as a Beta(1, 1) distribution (a uniform prior).

2.**Decision Based on Posterior Probability:**
65% Probability: If the posterior probability that the new feature is better is 65%, this indicates moderate confidence. A threshold for decision-making (e.g., 95%) should guide whether to implement the feature.
55% Probability: If the posterior probability is only 55%, this indicates minimal confidence in the feature's superiority, suggesting insufficient evidence to roll out the feature.

---



In [None]:
from scipy.stats import beta

def bayesian_ab_test(alpha_prior, beta_prior, successes, trials, threshold=0.95):
    """
    Perform a Bayesian A/B test.

    Parameters:
        alpha_prior (int): Alpha parameter for the prior distribution (successes).
        beta_prior (int): Beta parameter for the prior distribution (failures).
        successes (int): Number of successes for the new feature.
        trials (int): Total trials for the new feature.
        threshold (float): Probability threshold for decision-making.

    Returns:
        str: Decision based on posterior probability.
    """
    # Updating posterior parameters
    alpha_post = alpha_prior + successes
    beta_post = beta_prior + trials - successes

    # Calculating posterior probability that the new feature is better
    posterior_mean = beta(alpha_post, beta_post).mean()
    prob_feature_better = 1 - beta(alpha_post, beta_post).cdf(0.5)  # Probability > 0.5

    if prob_feature_better > threshold:
        return f"Roll out the feature: Posterior Probability = {prob_feature_better:.2f}"
    else:
        return f"Hold off: Posterior Probability = {prob_feature_better:.2f}"

# Example: Prior belief and observed data
alpha_prior = 1
beta_prior = 1
successes = 65  # Engagements
trials = 100    # Total users

# Perform Bayesian A/B test
decision = bayesian_ab_test(alpha_prior, beta_prior, successes, trials, threshold=0.95)
print(decision)


Roll out the feature: Posterior Probability = 1.00


It indicates an extremely high level of confidence (essentially 100%) that the new feature is better than the existing one based on the data.

# Exercise 6: Implementing Adaptive Experimentation

You’re running a test with three different website layouts to increase user engagement. Initially, each layout gets 33% of the traffic. After the first week, Layout C shows higher engagement.

Explain how you would adjust the traffic allocation after the first week.
Describe how you would continue to adapt the experiment in the following weeks.
What challenges might you face with adaptive experimentation, and how would you address them?

---



Adjusting Traffic Allocation After the First Week
After the first week, when Layout C shows higher engagement,it is better implement adaptive experimentation by dynamically allocating more traffic to the better-performing layout(s). One common method is Thompson Sampling, a Bayesian approach that balances exploration (testing underperforming layouts) and exploitation (favoring the best-performing layout).

To take Steps to Adjust Traffic:

Update Performance Beliefs:
Calculate the posterior probability for each layout’s performance based on engagement data (e.g., using a Beta distribution).

Reallocate Traffic:
Assign traffic proportionally to the probability of success for each layout. For example, if Layout C has a 70% chance of being the best, it should receive 70% of the traffic.

Continue Monitoring:
Reassess traffic allocation weekly or after collecting sufficient data to update posterior probabilities.

---
Adapting the Experiment in Following Weeks
As the experiment progresses:

Frequent Updates:
Weekly or bi-weekly updates to traffic allocation based on the latest performance data.
Explore Alternatives:
Ensure underperforming layouts receive minimal, but non-zero, traffic to allow for potential performance changes over time.
Dynamic Stopping:
Stop testing layouts with consistently poor performance to focus traffic on top candidates.
Combine Metrics:
Use engagement and secondary metrics (e.g., conversions) to refine allocation if multiple outcomes are important.

---




In [None]:
import numpy as np
from scipy.stats import beta

def adaptive_allocation(successes, trials, total_traffic=100):
    """
    Allocate traffic adaptively using Thompson Sampling.

    Parameters:
        successes (list): List of successes (engagements) for each layout.
        trials (list): List of trials (users tested) for each layout.
        total_traffic (int): Total traffic to allocate.

    Returns:
        list: Traffic allocation percentages for each layout.
    """
    # Sampling from Beta distributions for each layout
    sampled_means = [np.random.beta(s + 1, t - s + 1) for s, t in zip(successes, trials)]

    # Calculating allocation proportions
    probabilities = np.array(sampled_means) / np.sum(sampled_means)

    # Distributing total traffic based on probabilities
    allocation = (probabilities * total_traffic).astype(int)
    return allocation

# Example: Data after Week 1
successes = [30, 25, 50]  # Layout A, B, C
trials = [100, 100, 100]

# Calculating new traffic allocation
allocation = adaptive_allocation(successes, trials, total_traffic=100)
print(f"Traffic Allocation: {allocation}")


Traffic Allocation: [23 24 52]


**Challenges of Adaptive Experimentation**
1. Exploration vs. Exploitation Trade-Off:
Challenge: Over-allocating traffic to the leading layout early may prevent learning about underperforming layouts that could improve later.
Solution: Use algorithms like Thompson Sampling or UCB (Upper Confidence Bound) that ensure a balance between exploration and exploitation.
2. Statistical Validity:
Challenge: Frequent changes in traffic allocation can make traditional statistical methods (e.g., p-values) invalid.
Solution: Use Bayesian methods or simulation-based approaches to calculate confidence in layout performance.
3. Noise and Variability:
Challenge: Early performance differences might be due to random chance rather than true superiority.
Solution: Set a minimum data threshold before adapting traffic allocation (e.g., 500 users per layout).
4. Implementation Complexity:
Challenge: Adaptive experiments require real-time data processing and automated traffic allocation.
Solution: Use tools like multi-armed bandit frameworks or experimentation platforms (e.g., Google Optimize, Optimizely).
