$$\text{Multiple Hypothesis Testing}$$

**What is P-hacking?**

If we conducted multiple hypothesis tests but just report one with the smallest p-value, the multiple comparisons problem is called *p-hacking.* When there are multiple tests, and we choose the lowest p-value, our estimates of the p-value and the effect size are likely to be biased.

如果我们开展多次假设性检验但是仅仅报道哪些P值最小的试验结果，这种试验比较被称作p-hacking。在该背景下，试验结果对于p值的估计以及effect size很有可能存在误差。而p-hacking就出现于multiple hypothesis testing的情景下。

#### What is Multiple Hypothesis Testing?

If we compute 100 metrics for our experiment, how many metircs would you see as statistically significant even if our feature does nothing? With the significance level at 5%, the anwer is around five. The problem worsens when examing hundreds of experiments and multiple iterations per experiment. When testing multiple things in parallel, the number of false discoveries increases. **This is called the "multiple testing" problem.**

Therefore, we need to look at two types of errors when conducting multiple hypothesis testing:

- **Type I error** refers to that we reject the null hypothesis when there is actually no effect.
- **Type II error** refers to that we fail to reject the null hypothesis when there is actual effect.



**When to conduct multiple hypothesis testing?**

Multiple hypothesis testing can be manifested in the following ways:

- Looking at multiple metrics.
- Looking at p-values across time.
- Looking at segments of the population.
- Looking at multiple iterations of an experiment.
  - For example, if an experiment truly does nothing, running it 20 time may result in a p-value smaller than 0.05 by chance.

#### How to deal with multiple testing problem?

There are two ways to deal with multiple testing problem:

- Bonferroni Correction
- Benjamini-Hochberg

##### Bonferroni Correction

The conservative Bonferroni correction suggests that the p value threshold for each test should be equal to its alpha level divided by the number of tests performed.

$$\alpha_{\text{Bonferroni}} = \frac{\alpha}{n}$$

- alpha is the specified significance level, usually set to 0.05
- n is the number of tests performed

For example, if we perform 100 tests at $\alpha=0.05$, the adjusted signigicance level is: $0.05 \div 100 = 0.0005$.

$P(\text{at least one significant}) = 1 - P(\text{no significance})\\
 = 1 - (1-0.0005)^{50} \\
 = 0.024$

##### Benjamini-Hochberg

Benjamini-Hochberg procedure uses varying p-value thresholds for different tests, and it is complex and less accessible.

A simple two-step rule-of-thumb:

1. Separate all metrics into three groups
   - First-order metrics: those you expect to be impacted by the experiment
   - Second-order metrics: those potentially to be impacted by the experiment
   - Third-order metrics: those unlikely ton be impacted.
2. Applied tiered significance levels to each group (e.g., 0.05, 0.01 and 0.001 respectively).

These rules-of-thumb are based on an interesting Bayesian interpretation: How much do you believe the Null hypothesis (H0) is true before you even run the experiment? The stronger the belief, the lower the significance level you should use.

#### When not to use multiple hypothesis tetsing corrections?

The p-value correction aims to reduce the number of false discoveries. However, the smaller p-value also increases the type two error rate, meaning that we may fail to reject a number of tests. Hence we will have more false negatives. If false negatives are very expensive or important, we should avoid correcting p-value during multiple hypothesis testing.

#### Lab

In [14]:
import numpy as np
import pandas as pd
from scipy.stats import norm
from statsmodels.stats.multitest import multipletests

# make 1000 randon variates
rand_nums = norm.rvs(loc=0,scale=1,size=1000)

# calculate the p value for the 1000 random variates
pvals = 1- norm.cdf(rand_nums)

# calculate the number of tests with significant res
alpha = 0.05
sig_nums = sum(pvals<alpha)
print(f"Significant Results Number: {sig_nums}")

# now we correct the p value using bonferrroni method
alpha /= 1000
sig_nums = sum(pvals<alpha)
print(f"Significant Results Number after Bonferroni Correction: {sig_nums}")

Significant Results Number: 43
Significant Results Number after Bonferroni Correction: 0
