-
Notifications
You must be signed in to change notification settings - Fork 2
/
selective_inference.Rmd
133 lines (106 loc) · 3.49 KB
/
selective_inference.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
---
title: "selective_inference"
author: "Matthew Stephens"
date: "2019-04-19"
output: workflowr::wflow_html
---
## Introduction
The aim here is to illustrate selective inference with correlated predictors.
```{r}
library("selectiveInference")
```
This function simulates matrix x with highly columns, and then $y=x\beta+e$ with $\beta$ having
one non-zero element ($b$).
```{r}
simdata = function(n,p,b,signal_index,cor_sd,sigma=1){
z = rnorm(n)
x = matrix(rep(z,p),nrow=n) + matrix(rnorm(n*p,sd=cor_sd),nrow=n)
beta = rep(0,p)
beta[signal_index] = b
y = x %*% beta + sigma*rnorm(n)
return(list(y=y,x=x,beta=beta))
}
```
## Very high correlation
Here the correlation is very high ($>0.999$ on average). The true non-zero element is $beta[5]=3$. Here the method selects the wrong variable. It was unexpected to me that the method considered the first (wrong) variable entered highly significant. I thought the presence of other highly correlated variables would mean
that the estimate of that coefficient be highly uncertain. I guess that
maybe the method is estimating the coefficient of the selected variable in a univariate
regression, rather than the multiple regression coefficient?
```{r}
set.seed(33)
dat= simdata(50,10,3,5,0.01,1)
mean(cor(dat$x))
# run forward stepwise, plot results
fsfit = fs(dat$x,dat$y)
plot(fsfit)
# compute sequential p-values and confidence intervals
# (sigma estimated from full model)
out = fsInf(fsfit)
print(out)
```
## High correlation
Here I try with a less extreme, but still high correlation ($0.986$ average).
Here it picks the right variable first.
```{r}
set.seed(33)
dat= simdata(50,10,3,5,0.1,1)
mean(cor(dat$x))
# run forward stepwise, plot results
fsfit = fs(dat$x,dat$y)
plot(fsfit)
# compute sequential p-values and confidence intervals
# (sigma estimated from full model)
out = fsInf(fsfit)
print(out)
```
## High correlation, smaller effect
Try reducing effect size to $b=1$.
```{r}
set.seed(33)
dat= simdata(50,10,1,5,0.1,1)
mean(cor(dat$x))
# run forward stepwise, plot results
fsfit = fs(dat$x,dat$y)
plot(fsfit)
# compute sequential p-values and confidence intervals
# (sigma estimated from full model)
out = fsInf(fsfit)
print(out)
```
## High correlation, smaller effect
Try lower correlation (about 0.95), and repeat 10 times. Each time
we store the first selected variable and the corresponding $p$ value.
We also run susie. Note that it outputs large CSs (because the variables are correlated)
and 96 out of 100 include the true effect variable.
```{r}
set.seed(32)
n_iter = 100
out_select = rep(0,n_iter)
out_pv = rep(0,n_iter)
out_susie_cs_size =rep(0,n_iter) # size of first CS
out_susie_cs_inc = rep(0,n_iter) # whether CS includes true effect variable
out_susie_cs_num = rep(0,n_iter) # number of CSs reported (just to check not more than 1)
for(i in 1:n_iter){
dat= simdata(50,10,1,5,0.2,1)
mean(cor(dat$x))
# run forward stepwise, plot results
fsfit = fs(dat$x,dat$y)
# compute sequential p-values and confidence intervals
# (sigma estimated from full model)
out = fsInf(fsfit)
out_select[i] = out$vars[1]
out_pv[i] = out$pv[1]
s.fit = susieR::susie(dat$x,dat$y,estimate_prior_variance = TRUE)
s.cs = susieR::susie_get_cs(s.fit)$cs
out_susie_cs_num[i] = length(s.cs)
s.cs = s.cs[[1]]
out_susie_cs_size[i] = length(s.cs)
out_susie_cs_inc[i] = is.element(5,s.cs)
}
hist(out_select)
summary(out_pv)
mean(out_select[out_pv<0.05]==5)
hist(out_susie_cs_size,breaks=seq(0.5,10.5,length=11))
mean(out_susie_cs_inc)
mean(out_susie_cs_num)
```