# A/B testing and Hypothesis Testing

https://www.youtube.com/watch?v=DUNk4GPZ9bw

## Prerequisite: P-Values

https://www.youtube.com/watch?v=vemZtEM63GY&list=PLblh5JKOoLUIcdlgu78MnlATeyx4cEVeR&index=25

- Comparing two distributions (e.g. comparing two drugs, with differnt means for how many people it cured):
    - H_0 = distributions are the same (drugs are the same)
    - H_1 = distributions are different (drugs are different)
    
- p-values are between 0 and 1.
- The closer the p-value is to 0, the more confidence we have in H_1.
- Generally, a p-value of 0.05 is used.
- This means: 
    - **if there is no difference between the two distributions (the two drugs), and if we did the experiment many times, then only 5% of those experiments would result in the wrong decision (saying they are different)**

### Same Distribution - usually large p-values:
    
![p_val_same_dist](p_values_same_dist_high_value.png)
    
### Same distribution - 5% of the time small p-values

![p_val_same_dist_low](p_values_same_dist_low_value.png)   

So, if there is no difference between the two drugs (H_0 true), 5% of the time we do the experiment, we will get a p-value less than 0.05, and we would incorrectly reject H_0. (false positive).

If we perform the experiment, and the p-value < 0.05, we decide the drugs are different.

**Note: p-value variations:**
- If extremely important to correctly conclude the drugs are different, we can use smaller thresholds for the p-value, e.g. 0.00001. (1 out of 100,000 experiments we get false positive)
- The opposite is also true, if correct conclusions of rejecting H_0 and accepting H_1 are not important, we can have large p-values, e.g. 0.2.

**Note: how different the distributions actually are to each other:**
- A small p-value helps us decide if the drugs are different, but does not tell us **how different (effect size)** they are.
- We can have small p-values but with tiny or huge distribution differences
    - The size of the sample changes p-values, and how correlated p-value is to effect size.
    
### p-value calculation

**1. A type of distribution is chosen for the hypothesis test (e.g. t-test/ normal)**

####  Choosing the Right Test and Distribution

| What You’re Testing            | Distribution              | Common Test                        | Notes                                  |
|-------------------------------|---------------------------|------------------------------------|----------------------------------------|
| Mean (σ unknown, small n)     | **t-distribution**        | One-sample or two-sample **t-test**| Small sample or population σ unknown   |
| Mean (σ known or large n)     | **Normal (Z) distribution** | **Z-test**                        | Use when population σ is known or n is large |
| Proportions                   | **Normal (Z) distribution** | **Proportion Z-test**             | Use when sample size is large (normal approximation) |          |
| Categorical data              | **Chi-squared distribution** | **Chi-square test**             | Test of independence                   |
| Variances (between groups)    | **F-distribution**         | **ANOVA**, variance comparison     | Comparing multiple group variances     |

**2. A distribution score (test statistic) is calculated (e.g. t-score/z-score), telling you:**
- How many standard deviations your sample result is away from the mean that is expected under the null hypothesis (negative or positive) .
e.g.

**t-score:**

- Use when you have small sample or s.d is unknown. 

$$
t = \frac{\bar{x} - \mu_0}{s / \sqrt{n}}
$$

Where:

- $\bar{x}$ = sample mean  
- $\mu_0$ = population mean under the null hypothesis  
- $s$ = sample standard deviation (use population s.d for normal distribution)
- $n$ = sample size


**Note:** a t-distribution is similar to a z/normal - distribution but it has fatter tails because samples are smaller so rare cases are more likely. The larger the sample size in t-test, the thinner the tails. 

### Degrees of Freedom

When calculating the p-value for a t-test, you need to know the degrees of freedom, as this is an input into the python formula.

> **Degrees of freedom** = number of independent values that can vary without breaking a constraint.

---

#### 🔧 Everyday Analogy

- Packing 5 items with total weight = 100 kg
- You choose weights for 4 items freely
- The 5th item’s weight is fixed to make total 100

**Degrees of freedom** = 5 (items) − 1 (constraint) = 4

---

#### 📊 In Statistics

Sample variance formula:


$$s^2 = \frac{1}{n-1} \sum (x_i - \bar{x})^2$$


Why divide by \(n-1\) instead of \(n\)?

- After calculating the sample mean, one value is fixed.
- Only \(n-1\) values can vary freely.

**Degrees of freedom** = \(n - 1\)

- This gives an unbiased estimate of the population variance.
- Mostly relevant when n is small, like in t-tests.


**Note:** The z-score is the same as above but we use the population standard deviation rather than the sample, as we have a large sample size. 

$$
z = \frac{\bar{x} - \mu_0}{\sigma / \sqrt{n}}
$$

### Division by $\sqrt{n}$:

Because of the square root in the denominator:

$\sqrt{n}$ grows **slower** than  $n$ — this is **non-linear growth**.

This means that:
- Gains in precision (i.e., smaller standard error) **get smaller and smaller** as your sample size increases
- This phenomenon is called **diminishing returns** in sampling

![root_n](root_n.png)

#### 📉 Intuition:
- Small \( n \) → each new data point **greatly reduces error**
- Large \( n \) → each new data point **only slightly improves precision**

**3. The score is translated to a percentage (the p-value), representing what percentage of the distribution is outside the stated number of standard deviations.**

- Use normal distribution/ z-score when sample is large and s.d is known

![p_value_on_graph](p_value_on_graph.png)



In [4]:
# Use scipy to get p-value (normal distribution):
from scipy.stats import norm
z_score = -2.5
print(f'z-score = {z_score}')
# 1-tailed dist (like above)
p_value = (1 - norm.cdf(abs(z_score)))
print(f'p_value for 1 tailed test with normal distribution: {p_value}')
# 2-tailed dist (H_1: distributions are different and the mean(or statistic) is either larger or smaller than the original)
p_value = 2 * (1 - norm.cdf(abs(z_score)))
print(f'p_value for 2 tailed test with normal distribution:+ {p_value}')

## norm.cdf uses the cumulative distribution function F(x)=P(X≤x) (based on the Gaussian equation)
## The probability that the variable X (random variable) takes on a value less than or equal to a value x


z-score = -2.5
p_value for 1 tailed test with normal distribution: 0.006209665325776159
p_value for 2 tailed test with normal distribution:+ 0.012419330651552318


## A/B Test Purpose

To determine whether a change in a metric is because of random chance or because of the change you have implemented. 


## 1. Problem Statement

What is the business goal of the experiment. What is the success metric?

## 🧪 Problem for Our A/B Test
**Product:** An e-commerce website  
**Test Feature:** A new "Buy Now" button design  
**Goal Metric:** Conversion rate (i.e., % of visitors who make a purchase)  

We want to determine whether a redesigned 'Buy Now' button (Version B) increases the conversion rate on our e-commerce site compared to the old design (Version A).

Success Metric:
$$Conversion\_rate =  \frac{number\_of\_purchases}{number\_of\_visitors}.$$

## User Funnel Example: (funnels down)


![User Funnel](user_funnel.png)


## 2. Hypothesis Testing

What result do you hypothesise from the experiment?
What is the Null and Alternate Hypotheses.
Set up some parameter values such as the significance level and statistical power 

**Null Hypothesis:**  
There is no difference between the new button and the original:  
$$H\_0: p_{A} = p_{B}$$  
**Alternate Hypothesis:**  
The new button has a larger conversion rate compared to the original:  
$$H\_0: p_{B} > p_{A}$$

As the alternate hypothesis is only greater than rather than $\ne$, this is a one-tailed test.


**Significance Level (α):** 0.05  
- The threshold you set for rejecting the null hypothesis.
- It represents the **Probability of making a Type I Error - rejecting the null hypothesis when it is actually true** (a false positive)
- 0.05 means there is a 5% chance of the above
- If the p-value is less than 0.05, the chance of the error is less than 5% and we accept the null

![significance_level](significance_level_graph.png)

**Statistical Power:** 0.8
- Power is the probability we will correctly reject the Null hypothesis
- When we are comparing two distributions that have little overlap, the power is highest, as the average of samples taken from each distribution will result in high p-values when comparing the averages. This leeds to high probability of rejecting H0.
- If distributions overlap a lot and we have a small sample size, the power will be low
- Power can be increased by increasing the sample size
**Minimum Detectable Effect (MDE):** Let’s say a 10% relative lift is meaningful.

## Statistical Power (β)

- Power is the probability we will correctly reject the Null hypothesis and avoid a **Type II Error - accepting the Null hypothesis when it is actually false** (a false negative) 
- The probability of committing a type II error is equal to one minus the power of the test, also known as beta.
- When we are comparing two distributions that have little overlap, the power is highest, as the average of samples taken from each distribution will result in low p-values when comparing the means. This leeds to high probability of rejecting H0.  
- Below, H0 is the distributions are the same and H1 is they are different

## High Statistical Power
![High Statistical Power](statistical_power_little_overlap.png)

- If distributions overlap a lot and we have a small sample size, the power will be low
- This can happen when the chosen 'Test Feature' makes a small change to the success metric.
- In this case, the p-value is likely to be high, and thus more likely to accept H0

## Low Statistical Power
![High Statistical Power](statistical_power_large_overlap.png)

## 3. Design the Experiment

What are your experiment paramaters?
What is the randomisation unit?
Which user type will we target for the experiment

## 4. Run the Experiment

Reuirements for running the experiment?
Implementation to collect data and analyse the result 

## 5. Validity Checks

Did the experiment run soundly without errors or bias?
Sanity check before launching decision

## 6. Interpret the result

In which direction is the metric significant statistically and practically?
What is the lift that is saw?
What is the p-value confidence interval 

## Launch Decision

Based off the results, should the change be launched?

Unnamed: 0,a,b,c
0,1,hello,True
1,2,bye,True
2,3,morning,False
3,4,evening,False
