# Bayesian Multiple Regression for GWAS

# Rohan L. Fernando

# June, 25, 2018

## Outline

- Controlling false positives

  - genomewise error rate (GER)

  - posterior error rate (PER)

- PER can be controlled by using Bayesian posterior probabilities


## Hypothesis testing: classical approach



- Want a test to reject $H_0$ in favor of $H_a$

   - Example: 
   
       - $H_0$: region $R$ of genome has no effect on the trait
       
       - $H_a$: region $R$ of genome has an effect on the trait
       
- Test is constructed such that:       
       
    - If $H_0$ is true, probability of rejection is low (usually  < 0.05) 
    
        -  This is the type I error rate

    - If $H_a$ is true, probability of rejection is high
    
        -  This is the power of the test 
   

## Distribution of test statistic:

$T$ is the value of the test statistic


In [1]:
using Plots
#plotlyjs()
To = randn(10000)
Ta = randn(10000) + 6;

typeIa = sum(abs.(To) .> 2.5)/size(To,1)
powera = sum(Ta      .> 2.5)/size(Ta,1)
println("typeI error rate = ",typeIa," Power = ", powera)

histogram([To Ta],
title="",
xlabel = "Value of T", 
ylabel = "Frequency",
legend = :best,
label=["Ho" "Ha"])
png("testStat1")

typeI error rate = 0.0124 Power = 1.0


<img src="testStat1.png" />

## Type I error rate and power

If $H_0$ is rejected when $|T| > 2.5$

  - type I error rate is: 0.0124

  - power is: 1.0

## Distribution of test statistic:

$T$ is the value of the test statistic


In [4]:
To = randn(10000)
Ta = randn(10000) + 2;

typeIb = sum(abs.(To) .> 2.5)/size(To,1)
powerb = sum(Ta       .> 2.5)/size(Ta,1)
println("typeI error rate = ",typeIb," Power = ", powerb)

typeI error rate = 0.0118 Power = 0.3048


In [5]:
histogram([To Ta],
title="",
xlabel = "Value of T", 
ylabel = "Frequency",
legend = :best,
label=["Ho" "Ha"])
png("testStat2")

<img src="testStat2.png" />

## Type I error rate and power

If $H_0$ is rejected when $|T| > 2.5$

  - type I error rate is: 0.0118

  - power is: 0.3048
  
Can increase sample size to increase power  

## Type I errors in multiple tests

Consider $n$ independent tests, each with a type I error rate of 0.05 




In [52]:
using Distributions
using Rsvg

[1m[36mINFO: [39m[22m[36mPrecompiling module Rsvg.
[39m

Distributions.Binomial{Float64}(n=1, p=0.05)

In [75]:
y = [1 - cdf(Binomial(n,0.05),0) for n = 1:100]
plot(1:100,y, legend = false,
title="Probability of one or more rejections",
xlabel = "Number of tests", 
ylabel = "Probabiliry")

In [76]:
png("multipleTest")

<img src="multipleTest.png" />

## Two solutions:

- Control probability of one or more false positives among all tests

    - Bonferroni correction 
    - Multiple test penalty

- Control proportion of false positives among rejections 

    - PER
    - PFP
    - No multiple test penalty

## Posterior type I error rate (PER)

- Probability $H_0$ is true given it has been rejected

- Can think of PER as the proportion of false positives among rejections (PFP)

Let 

- $\alpha$ = type I error rate
- $(1-\beta)$ = power 

Then

$$
\text{PER} = \frac{\alpha\times\Pr(H_0)}{\alpha\times\Pr(H_0) + (1-\beta)\times[1 - \Pr(H_0)]}
$$

## Test of linkage for monogenic trait

- $\Pr(H_0) = 21/22 \approx 0.95$ (autosomal locus in humans)

- Suppose $1-\beta = 0.95$

$$
\begin{align}
\text{PER} &= \frac{0.05\times0.95}{0.05\times0.95 + 0.95\times0.05}\\
           &= 0.5
\end{align}
$$

To reduce PER to 0.05, take $\alpha = 0.05/19 = 0.0026$

$$
\begin{align}
\text{PER} &= \frac{0.0026\times0.95}{0.0026\times0.95 + 0.95\times0.05}\\
           &= 0.05
\end{align}
$$

## FDR and PFP

- $F$: the number of false positives
- $T$: the total number of positives
$$
\text{FDR} = \text{E}(\frac{F}{T}|T>0)\Pr(T>0)
$$
and
$$
\text{PFP} = \frac{\text{E}(F)}{\text{E}(T)}
$$

- Can show that controlling PER for each test to some level results in a PFP of the same value for the experiment. 

<img src="RLF.png" />

## Bayesian approach

- $\Pr(H_0)$ and $\beta$ are treated as unknown 

- Inference based on $\Pr(H_0|\mathbf{y})$

- Typically, $\Pr(H_0|\mathbf{y})$ is estimated by counting the number of MCMC samples where $\Pr(H_0)$ was true


<img src="PFP.png" />

<img src="b995c995cpi.jpeg" />

<img src="title.png" />

<img src="tab1.png" />

<img src="tab2.png" />

<img src="tab3.png" />

## Summary

- When PER is used to manage false positives, no multiple-test
  penalty
  
- Bayesian posterior probabilities can be used to control PER

    - Pr(H0), and power of test can be treated as unknown
    - Do not need to know the distribution of test statistic
    - Simple to determine significance threshold
    
- Genomic window based inference multiple regression models    
    

In [1]:
; jupyter nbconvert --to slides LICSeminar.ipynb

[NbConvertApp] Converting notebook LICSeminar.ipynb to slides
[NbConvertApp] Writing 265836 bytes to LICSeminar.slides.html
