<center> 

# Multiple Testing

## Dr. Lange- University of Chicago
## Data 11800 - Winter 2024 

<img src="https://raw.githubusercontent.com/amandakube/Data118LectureImages/main/UChicago_DSI.png" alt="UC-DSI" width="500" height="600">
    
</center>

In [None]:
import numpy as np
import pandas as pd

%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

# Multiple Testing

## Hypotheses Testing (Recap)

Ingredients:
* A null hypothesis H0
* An alternative hypothesis HA
* A test statistic (in many situations associated to a model)
* A decision or a measure of significance (P-value = P (A|H0))

<img src="https://raw.githubusercontent.com/amandakube/Data118LectureImages/main/Errors.png" alt="Type1 and Type2 Errors" width="700">


* A Type 1 Error is rejecting the H0 when it is true.
* A Type 2 Error is failing to reject the H0 when it is false.

## Decisions: Type 1 and Type 2 Errors

Four possible results of a test (Q: What are false positive and false negatives?)

|             | H0 Accepted                   | H0 Rejected                   |   |   |
|-------------|-------------------------------|-------------------------------|---|---|
| H0 is True  | Correct decision, Prob = 1 − α | Type 1 error, Prob = α         |   |   |
| H0 is False | Type 2 error, Prob = β         | Correct decision, Prob = 1 − β |   |   |

Power of a significance test measures its ability to detect an
alternative hypothesis: 1 − β.

<center>
<img src="https://raw.githubusercontent.com/SusannaLange/Data_118_images/main/images/Probability/error_graph.png" alt="error_dist" width="600">
    </center>

False positive - medical trial was effective...when it really was not ---> Type I error
(rejecting the null when it is really true)


False negative - medical trial was not effect ... when it really was ---> Type II error

#### Multiple hypothesis testing...

is the scenario that we are conducting several hypothesis tests at the same time.

<img src="https://raw.githubusercontent.com/SusannaLange/Data_118_images/main/images/Probability/jelly_beans.png" alt="jelly_beans" width="700">

Find jelly bean example here: https://xkcd.com/882/ 

### Another example

Suppose a company Malp Pharmaceuticals is testing a new drug.


 - The company thinks the drug might help support the liver, so they perform trials on 7 groups of patients with each group having one of 7 different types of liver disease. 


 - Each groups was split into a treatment group and a control, a one sided hypothesis test was carried out to see if the treatment group had improved biomarkers over the control

### The following p-values were found

|             | Group 1 | Group 2 | Group 3 | Group 4 | Group 5 | Group 6 | Group 7 |
| ----------- | :------:| :------:| :------:| :------:| :------:| :------:| :------:|
| **p-value** | 0.255014| 0.375766| 0.852235| 0.034189| 0.187971| 0.183354| 0.771669|

Note, Group 4, had improved biomarkers over the control with p-value **0.034 < 0.05**

 - the result is significant at the level 0.05. 

<code style="background:Thistle;color:black"> **Question:** Is this good evidence that the new drug works?
</code>


### Issue

Even thought there's only a 5% chance of a Type I error happening with Group 4, **there's a greater than 5% chance of a Type I error occuring to one of the treatment groups.**

Because we are performing **Multiple Hypothesis Tests**

### Simulation

 - We'll assume that the drug does nothing.
 
More techically - For each group the biomarker level for the treatment group differs randomly from the control according to a normal distribution with mean 0. 

Note: The normal distribution here can be replaced with any distribution and we get the same result.


We will simulate how often this random fluctuation would lead to the above study finding that the drug helps treat at least 1 of the 7 seven diseases. 

Simulate this by...
 1. graphing the null distribution (in blue)
 2. plotting 7 randomly chosen test-statistics (purple points)
 3. Calculates 'Are any of those randomly chosen values significant?'
 4. If yes ---> adding one to counter

In [None]:
#this plots a histogram with 7 test_stats plotted!
def hist_plot(test_differences):
    '''Visualizes the corresponding graph with 7  test_stats'''
    plt.hist(normdist)

    for i in test_differences:
        plt.scatter(i, -14, color='purple', s=40)
    plt.title('7 test statistics')
    plt.ylabel('Frequency')
    plt.xlabel('Difference in Means')
    plt.show()

In [None]:
#hide code...i.e. not really important for this example
#this would not be tested on 
# Generate a normal distribution, with mu = 0 and sigma = 1
normdist = np.random.normal(0,1,10000)

# compute the p-value for our 1-sided test.
# It returns the proportion of values in normdist that are less than r
def pValue(r):
    '''Calculcates the p value'''
    numLess = (normdist<r).sum()
    return numLess/10000

# run 1000 trials each trial generate 7 random numbers from normdist
significantCount = 0
for i in range(0,1000):
    test_differences = np.random.normal(0,1,7) #generating 7 sample means
    test_pvalues = []
    # compute p-values for the corresponding sample means
    for r in test_differences:
        test_pvalues.append(pValue(r))   
    #this graphs one of the 1000 graphs to visualize    
    if i == 500:
        hist_plot(test_differences)
    # Check if at least 1 of the trials would have found 
    # a result significant at the 5% level
    if min(test_pvalues) < 0.05:
        significantCount += 1
        
print('Pobability of at least one Type I error (assuming all nulls are true):', significantCount/1000)

 - We see that we would obtain a result this significant around 30.6% of the time. 



 - Therefore, the trials did not give good statistical evidence that the drug does anything at all. 

### How likely is a Type I error in general?

In general, if we perform $m$ hypothesis tests, what is the probability of at least 1 false positive? Assume a statistical significince level of $\alpha = 0.05$.

**Goal:** P(Making at least 1 error in $m$ tests) = 1 - P(Not making an error in $m$ tests)


P(Making an error assuming the null hypothesis is true) $ = \alpha$

P(Not making an error) $ = 1 - \alpha$

P(Not making an error in $m$ tests) $ = (1 - \alpha)^m$


P(Making at least 1 error in m tests) $= 1 - (1 - \alpha)^m$

For our example above with $m=7$ and $\alpha = 0.05$ we have $1-(1-0.05)^7 = .30$ 


Thus a 30% chance of making a Type I error!

In [None]:
1-(1-0.05)**7

## False positives


<img src="https://raw.githubusercontent.com/SusannaLange/Data_118_images/main/images/Probability/False_error_graph.png" alt="False_error" width="700">

## Multiple Hypothesis Testing

Suppose we are testing $H_1$, $H_2$, ....$H_m$ hypotheses

|       | Fail to reject $H_0$  | Reject $H_0$ | Total|
| --------------------- |:--------------:| :-------------:| :-------------:|
| **$H_0$ True**      |  U             |     V          |    $m_0$         |
| **$H_0$ NOT True**  |     T          |   S           |       $m$-$m_0$    |
| Total               |       $m$-R         |   R           |      $m$         |


 - $m_0$ total True Hypotheses
 
 - R total rejections
  
 - V is the number of false positives (Type I error) (also called "false discoveries")
 
 - S is the number of true positives (also called "true discoveries")
 
 - T is the number of false negatives (Type II error)

 - U is the number of true negatives

Error Rates we can control:
* Family-wise error rate (FWER): P(V > 0)
* Per family error rate: E(V)
* False Discovery Rate (FDR): E(V/R)

**Idea** We want to somehow control how many 'false positives' we have from multiple hypothesis testing.

## How do we do this? 

## Controlling for Type I error.

We can adjust the significance level in a certain way to account for the number of hypothesis tests performed

 - so that the probability of observing at least one significant result due to chance remains below your desired significance level.

Controlling FWER:
    
* Sidak: $\displaystyle \gamma = 1 - (1 - \alpha)^{1/m}$
* Bonferroni correction: $\displaystyle \gamma = \alpha/m$


## Sidak correction

We adjust our significance level so that our new level of significance is $\displaystyle{1 - (1 - \alpha)^{1/m}}$


That is, we reject $H_i$ if $p_i <1 - (1 - \alpha)^{1/m}$

## Bonferroni correction


We adjust our significance level so that our new level of significance is $\displaystyle{\frac{\alpha}{m}}$


That is, we reject $H_i$ if $p_i < \frac{\alpha}{m}$



**For example,** if we want to have an experiment wide Type I
error rate of 0.05 when we perform 10,000 hypothesis tests,
we’d need a p-value of 0.05/10000 = $5 \times 10^{-6}$ to declare
significance using **Bonferroni correction**

Using **Sidak correction** we'd need a p-value of $1-(1-0.05)^{\frac{1}{10,000}} = 5.129 \times 10^{-6}$

<code style="background:Thistle;color:black"> **In our example,** if we want to have an experiment wide Type I
error rate of 0.05 when we perform 7 hypothesis tests, what would we need the p-value to be using Bonferroni?
</code>


Recall, we had the following p-values:

|             | Group 1 | Group 2 | Group 3 | Group 4 | Group 5 | Group 6 | Group 7 |
| ----------- | :------:| :------:| :------:| :------:| :------:| :------:| :------:|
| **p-value** | 0.255014| 0.375766| 0.852235| 0.034189| 0.187971| 0.183354| 0.771669|

## P-hacking

Idea that we can find significant associations if we perform enough experiments....and then report those results!


<img src="https://raw.githubusercontent.com/SusannaLange/Data_118_images/main/images/Probability/p_hacking.png" alt="p_hacking" width="700">

**Note:** there are 1066 food questions and 26 other variables with information on subjects. You can potentially look for $1,066\times 26=27,716$ associations.

More information here: https://fivethirtyeight.com/features/you-cant-trust-what-you-read-about-nutrition/ if interested

**To recap:** Adjusting the rate helps to control for the fact that sometimes small p-values (less than 5%) happen by chance, which could lead you to incorrectly reject the true null hypotheses