# A/B testing and Hypothesis Testing

https://www.youtube.com/watch?v=DUNk4GPZ9bw

## Prerequisite: P-Values

https://www.youtube.com/watch?v=vemZtEM63GY&list=PLblh5JKOoLUIcdlgu78MnlATeyx4cEVeR&index=25

- Comparing two distributions (e.g. comparing two drugs, with different means for how many people it cured):
    - H_0 = distributions are the same (drugs are the same)
    - H_1 = distributions are different (drugs are different)
    
- p-values are between 0 and 1.
- The closer the p-value is to 0, the more confidence we have in H_1.
- Generally, a p-value of 0.05 is used.
- This means: 
    - **if there is no difference between the two distributions (the two drugs), and if we did the experiment many times, then only 5% of those experiments would result in the wrong decision (saying they are different)**

### Same Distribution - usually large p-values:
    
![p_val_same_dist](p_values_same_dist_high_value.png)
    
### Same distribution - 5% of the time small p-values

![p_val_same_dist_low](p_values_same_dist_low_value.png)   

So, if there is no difference between the two drugs (H_0 true), 5% of the time we do the experiment, we will get a p-value less than 0.05, and we would incorrectly reject H_0. (false positive).

If we perform the experiment, and the p-value < 0.05, we decide the drugs are different.

**Note: p-value variations:**
- If extremely important to correctly conclude the drugs are different, we can use smaller thresholds for the p-value, e.g. 0.00001. (1 out of 100,000 experiments we get false positive)
- The opposite is also true, if correct conclusions of rejecting H_0 and accepting H_1 are not important, we can have large p-values, e.g. 0.2.

**Note: how different the distributions actually are to each other:**
- A small p-value helps us decide if the drugs are different, but does not tell us **how different (effect size)** they are.
- We can have small p-values but with tiny or huge distribution differences
    - The size of the sample changes p-values, and how correlated p-value is to effect size.
    - The percentage cured may be very different and still have a large p-value
    
### p-value calculation

**1. A type of distribution is chosen for the hypothesis test (e.g. t-test/ normal)**

####  Choosing the Right Test and Distribution

| What You’re Testing            | Distribution              | Common Test                        | Notes                                  |
|-------------------------------|---------------------------|------------------------------------|----------------------------------------|
| Mean (σ unknown, small n)     | **t-distribution**        | One-sample or two-sample **t-test**| Small sample or population σ unknown   |
| Mean (σ known or large n)     | **Normal (Z) distribution** | **Z-test**                        | Use when population σ is known or n is large |
| Proportions                   | **Normal (Z) distribution** | **Proportion Z-test**             | Use when sample size is large (normal approximation) |          |
| Categorical data              | **Chi-squared distribution** | **Chi-square test**             | Test of independence                   |
| Variances (between groups)    | **F-distribution**         | **ANOVA**, variance comparison     | Comparing multiple group variances     |

**Note:**
- 2-sampled t-test is used when there is no population distribution and we have two independent samples and we want to compare means. (assuming both groups have equal variance, otherwise use Welch's test).
- 1-sample is when we want to compare the mean of a single sample to a known population mean (not knowing population standard deviation).   

**2. A distribution score (test statistic) is calculated (e.g. t-score/z-score), telling you:**
- How many standard deviations your sample result is away from the mean that is expected under the null hypothesis (negative or positive) .
e.g.

**t-score:**

- Use when you have small sample or s.d is unknown. 

$$
t = \frac{\bar{x} - \mu_0}{s / \sqrt{n}}
$$

Where:

- $\bar{x}$ = sample mean  
- $\mu_0$ = population mean under the null hypothesis  
- $s$ = sample standard deviation (use population s.d for normal distribution)
- $n$ = sample size


**Note:** a t-distribution is similar to a z/normal - distribution but it has fatter tails because samples are smaller so rare cases are more likely. The larger the sample size in t-test, the thinner the tails and the closer it is to the normal distribution. 

### Degrees of Freedom

When calculating the p-value for a t-test, you need to know the degrees of freedom, as this is an input into the python formula.

> **Degrees of freedom** = number of independent values that can vary without breaking a constraint.

---

#### 🔧 Everyday Analogy

- Packing 5 items with total weight = 100 kg
- You choose weights for 4 items freely
- The 5th item’s weight is fixed to make total 100

**Degrees of freedom** = 5 (items) − 1 (constraint) = 4

---

#### 📊 In Statistics

Sample variance formula:


$$s^2 = \frac{1}{n-1} \sum (x_i - \bar{x})^2$$


Why divide by \(n-1\) instead of \(n\)?

- After calculating the sample mean, one value is fixed.
- Only \(n-1\) values can vary freely.

**Degrees of freedom** = \(n - 1\)

- This gives an unbiased estimate of the population variance.
- Mostly relevant when n is small, like in t-tests.


**Note:** The z-score is the same as above but we use the population standard deviation rather than the sample, as we have a large sample size. 

$$
z = \frac{\bar{x} - \mu_0}{\sigma / \sqrt{n}}
$$

### Division by $\sqrt{n}$:

Because of the square root in the denominator:

$\sqrt{n}$ grows **slower** than  $n$ — this is **non-linear growth**.

This means that:
- Gains in precision (i.e., smaller standard error - the denominator) **get smaller and smaller** as your sample size increases
- This phenomenon is called **diminishing returns** in sampling

![root_n](root_n.png)

#### 📉 Intuition:
- Small \( n \) → each new data point **greatly reduces error**
- Large \( n \) → each new data point **only slightly improves precision**

**3. The score is translated to a percentage (the p-value), representing what percentage of the distribution is outside the stated number of standard deviations.**

- Use normal distribution/ z-score when sample is large and s.d is known

![p_value_on_graph](p_value_on_graph.png)



In [1]:
# Use scipy to get p-value (normal distribution):
from scipy.stats import norm
z_score = -2.5
print(f'z-score = {z_score}')
# 1-tailed dist (like above)
p_value = (1 - norm.cdf(abs(z_score)))
print(f'p_value for 1 tailed test with normal distribution: {p_value}')
# 2-tailed dist (H_1: distributions are different and the mean(or statistic) is either larger or smaller than the original)
p_value = 2 * (1 - norm.cdf(abs(z_score)))
print(f'p_value for 2 tailed test with normal distribution: {p_value}')

## norm.cdf uses the cumulative distribution function F(x)=P(X≤x) (based on the Gaussian equation)
## The probability that the variable X (random variable) takes on a value less than or equal to a value x


z-score = -2.5
p_value for 1 tailed test with normal distribution: 0.006209665325776159
p_value for 2 tailed test with normal distribution: 0.012419330651552318


## A/B Test Purpose

To determine whether a change in a metric is because of random chance or because of the change you have implemented. 


## 1. Problem Statement

What is the business goal of the experiment. What is the success metric? Define User funnel.  
**Choosing a success metric:**
- **Measurable:** can it be measured with the data we have?
- **Attributable:** Can you assign the behaviour (effect of change) to the treatment(cause - the actual change)
- **Sensitive:** Does the metric have low variability, meaning it will be relatively consistent, so we can distinguish between the treatment and the control group. 
- **Timely:** Can the success metric be measured in short-term (not cost effective if it takes a long time to measure)

## 🧪 Problem for Our A/B Test
**Business Goal:** Increase the conversion rate from clicking “Get Started” and viewing the Get Started page to completing the Get Started questionnaire. (Improving the CTA - Call To Action button)  
**Product:** OurRitual Website - Couples therapy   
**Test Feature:** Change the button label from “Get Started” to something more emotionally engaging, e.g., “Begin Your Journey”   
**Success Metric:** Conversion rate   

**Version A:** Get Started button  
**Version B:** Begin Your Journey button

Success Metric:
$$Conversion\_rate =  \frac{total\_questionnaire\_completions}{total\_landings\_on\_page}.$$

## User Funnel:

| Step | Funnel Stage Description              | Conversion Event                                    |
|------|----------------------------------------|----------------------------------------------------|
| 1️⃣   | Landing on the Homepage or Ad          | User arrives via ad, organic link, etc.            |
| 2️⃣   | Click “Get Started” CTA                | User clicks “Get Started” or equivalent CTA        |
| 3️⃣   | Views Get Started Page                 | User is routed to the onboarding form/questionnaire|
| 4️⃣   | Begins Questionnaire (optional)        | User interacts with the form (e.g., fills first field) |
| 5️⃣   | Completes Questionnaire ✅             | User submits all answers and completes onboarding  |

Our Success Metric is the conversion from step 3 to step 5. 

## User Funnel Example Marketing Example: (funnels down)


![User Funnel](user_funnel.png)


## 2. Hypothesis Testing

What result do you hypothesise from the experiment?  
What is the Null and Alternate Hypotheses.  
Set up some parameter values such as the significance level and statistical power    



**Null Hypothesis:**  
The new wording for the CTA button has no effect on the questionnaire completion rate:  
$$H\_0: p_{A} = p_{B}$$  
**Alternate Hypothesis:**  
The new wording for the CTA button increases the questionnaire completion rate:  
$$H\_0: p_{B} > p_{A}$$

As the alternate hypothesis is only greater than rather than $\ne$, this is a one-tailed test.


**Significance Level (α):** 0.05  

**Statistical Power:** 0.8

**Minimum Detectable Effect (MDE):** A 2% absolute lift in the conversion rate is meaningful. (e.g. 25%-27%)

## Significance Level

- The threshold you set for rejecting the null hypothesis.
- If the probability of observing a specific event (our sample data) is very low, then it is statistically significant.
- It represents the **Probability of making a Type I Error - rejecting the null hypothesis when it is actually true** (a false positive)
- 0.05 means there is a 5% chance of the above
- If the p-value is less than 0.05, we reject H0 knowing we have at most a 5% chance of a false positive. 

![significance_level](significance_level_graph.png)

## Statistical Power (1-β)

https://www.youtube.com/watch?v=Rsc5znwR5FA&list=PLblh5JKOoLUIcdlgu78MnlATeyx4cEVeR&index=98

**What is it?:**
- The probability of correctly detecting a real effect/difference is 80% (H1 is true)
- Power is the probability we will correctly reject the Null hypothesis, given that H1 is true. (accept H1 when it is true)
- It is the probability (β) of avoiding a **Type II Error - accepting the Null hypothesis when it is actually false** (a false negative)
    - So we have 20% chance of a false negative, this is higher than a false positive (type 1 error) as having a false negative usually less influential than having a false positive as we are remaining at what we were before. 

**Why is it important?**  
A study with low statistical power might fail to detect a real effect, leading to a false negative conclusion. Conversely, a study with high power is more likely to correctly identify a true effect.

**Factors Affecting Power:**
- **Effect Size:** Larger effect sizes are easier to detect, increasing power. 
- **Sample Size:** Larger sample sizes reduce variability and increase power. 
- **Significance Level (Alpha):** A higher alpha (e.g., 0.05) increases power but also increases the risk of a Type I error (false positive). 
- **Variability:** Lower variability in the data increases power. 


When we are comparing two distributions that have little overlap, the power is highest, as the average of samples taken from each distribution will result in low p-values when comparing the means (large difference in sample means compared to population mean so large test variable and small p-value). This leads to high probability of rejecting H0.   

Below, H0 is that the distributions are the same and H1 is they are different

## High Statistical Power
![High Statistical Power](statistical_power_little_overlap.png)

- If distributions overlap a lot and we have a small sample size, the power will be low
- This can happen when the chosen 'Test Feature' makes a small change to the success metric.
- In this case, the p-value is likely to be high, and thus more likely to accept H0

## Low Statistical Power
![High Statistical Power](statistical_power_large_overlap.png)

Watch power analysis: https://www.youtube.com/watch?v=VX_M3tIyiYk&list=PLblh5JKOoLUIcdlgu78MnlATeyx4cEVeR&index=95

Used to detemine the appropriate sample size (uses statistical power and MDE)


## P-hacking

https://www.youtube.com/watch?v=HDCOUXE3HMM

- p-hacking refers to the misuse and abuse of analysis techniques and results in being fooled by false positives.  

**How to avoid False Positives?**


### 1st p-hack: Multiple Testing Problem:
Doing a lot of hypothesis tests, and ending up/focusing on the ones with **False Positives** is called the **Multiple Testing Problem** - p-hacking
- With a confidence interval of 5%, we expect that 5% of tests will result in a false positive 
- There are many ways to reduce the number of false positives (e.g. the False Discovery Rate)
- The methods include inputting ALL p-values from ALL tests I do and the method outputs an adjusted p-value, usually larger than the original
- Don't cherry pick my tests and only pick ones that look good, use ALL!

### 2nd p-hack: Increasing sample size after small p-value calculated:
When you have a p-value close to the required, e.g. 0.06 and you want it to be less than 0.05, increasing sample size can decrease the p-value to what we want. e.g. 0.02. - p-hacking
- To avoid the above, calculate the sample size **before** the experiment. 
- This is a **power analysis**

## Power Analysis

https://www.youtube.com/watch?v=VX_M3tIyiYk&list=PLblh5JKOoLUIcdlgu78MnlATeyx4cEVeR&index=96

How to avoid the 2nd p-hack (above)

### What is it?
A **Power Analysis** determines what sample size will ensure a high probability that we **correctly** reject the **Null Hypothesis** that there is no difference between the groups
- So, when using the **Sample size** recommended by the **Power Analysis** we know that regardless of the p-value we calculate (e.g. 0.06), we have used enough data to be confident in the result.

The main factors effecting the power are explained in statistical power above. The two main ones are:
1. **Effect Size:** Difference in final success metrics. How much overlap there is between the two distributions
    - The less overlap (e.g. the change made a big difference), the larger the power
    - The more overlap, the smaller the power
2. **The Sample Size:** the number of measurements we collect from each group
    - The larger the sample size, the larger the power
    - the smaller the sample size, the smaller the power
    
Combinations of the two factors above can provide the optimal power, e.g. 0.8 or 80%. (probability of accepting H1 when it is true)
- If we have **more overlap**, we a can **increase the sample size**
- If we have a **small sample size**, we can **decrease the overlap**

The more measurements we use to estimate the population mean (larger sample) extreme measurements have less effect on how far the **estimated mean** is from the **Population mean**.  
So the more measurements we have for the estimated mean, the more confidence we have in the estimated mean

### Performing a Power Analysis
Aim is to find the sample size needed to achieve a defined power.
1. Decide how much **Power (1-β)** we want 
    - 0.8 is common: 80% chance of correctly rejecting the Null hypothesis 
2. Determine the **Significance Level** (alpha)
    - 0.05 is common: 5% chance of incorrectly accepting the null hypothesis (False negative)
3. Estimate the overlap between the two distributions (can use MDE)
    - Effected by the difference in the population **means** AND the **Standard Deviations**
    - One way to combine the above, is to calculate an **Effect Size (d)** (many ways to do this including below)
    - Generally, the mean and standard deviations can be estimated with prior data/ research
![effect](effect_size.png)
4. Plug in the Power, significance level and the effect size into a statistical power calculater and find the sample

## Minimum Detectable Effect (MDE) and Lift

**What is it?:**
- The smallest change in the success metric to make it **Practically Significant**.
- Defines the sensitivity of the experiment, indicating the smallest effect size needed for practical significance.

**Statistically Significant vs Practically Significant:**
- The experiment may be statistically significant, with a small p-value (reject H0), but then not practically significant as the change in the success metric is small (accept H0). 
- Here, we would reject H0 but we would not make the change

**Lift:**
- The practical improvement of the success metric (e.g. conversion rate)
- You compare the lift to the MDE.

| 📊 **Lift Type**     | 📐 **Formula**                                       | 💡 **Example Calculation**          |
|----------------------|------------------------------------------------------|-------------------------------------|
| 🔸 **Absolute Lift** | `Conversion_B − Conversion_A`                        | `22% − 20% = 2%`                    |
| 🔹 **Relative Lift** | `(Conversion_B − Conversion_A) / Conversion_A`       | `(22% − 20%) / 20% = 10%`           |





## 3. Design the Experiment

What are your experiment paramaters?  
What is the randomisation unit?  
Which user type will we target for the experiment?  

1. Set the **Randomisation Unit:** Assigning users to treatment group or control group
2. **Target Population:** (visitors who clicked the CTA button - 3 in the user funnel)
3. Determine **Sample Size**: 
    - Can use power analysis

In [1]:
from statsmodels.stats.power import NormalIndPower
from statsmodels.stats.proportion import proportion_effectsize

# Define baseline conversion rate and desired minimum effect
p1 = 0.25  # baseline conversion rate
p2 = 0.27  # expected improved rate (MDE = 2%)

# Calculate effect size (Cohen's h)
effect_size = proportion_effectsize(p1, p2)

# Set parameters
alpha = 0.05  # significance level
power = 0.8   # statistical power

# Perform sample size calculation (per group)
analysis = NormalIndPower()
sample_size = analysis.solve_power(effect_size=effect_size, power=power, alpha=alpha, ratio=1)

print(f"Required sample size per group: {round(sample_size)}")

Required sample size per group: 7548


- **proportion_effectsize()** computes the standardized difference in proportions (Cohen’s h).
    -  It converts the raw difference between two proportions (e.g., a conversion rate of 10% vs. 15%) into a standardized unit, making it comparable across different experiments. 

- **solve_power()** tells you how many users per group are needed to detect that difference with given alpha/power.
    - ratio = 1 assigns equal ratio to the control and the varient.

## 4. Run the Experiment

Reuirements for running the experiment?
Implementation to collect data and analyse the result 

## 5. Validity Checks

Did the experiment run soundly without errors or bias?
Sanity check before launching decision

## 6. Interpret the result

In which direction is the metric significant statistically and practically?
What is the lift that is saw?
What is the p-value confidence interval 

## Launch Decision

Based off the results, should the change be launched?

Unnamed: 0,a,b,c
0,1,hello,True
1,2,bye,True
2,3,morning,False
3,4,evening,False
