# Experiment on identifiability of SuSiE

This notebook is meant to address to a concern from a referee.

It says the following:

> Suppose
we have another predictor $x_5$, which is both correlated with $(x_1,
x_2)$ and $(x_3, x_4)$. Say $\mathrm{cor}(x_1, x_5) = 0.9$,
$\mathrm{cor}(x_2, x_5) = 0.7$, and $\mathrm{cor}(x_5, x_3)
= \mathrm{cor}(x_5, x_4) = 0.8$. Does the current method assign $x_5$
to the $(x_1, x_2)$ group or the $(x_3, x_4)$ group?

First we simulate an $X$ matrix of 500 samples and 5 variables with properties as outlined above,

In [1]:
h = 1
l = 0.5
cormat = rbind(c(1.0, h,   l,   l,   0.9),
               c(h,   1.0, l,   l,   0.7),
               c(l,   l,   1.0, h,   0.8),
               c(l,   l,   h,   1.0, 0.8),
               c(0.9, 0.7, 0.8, 0.8, 1.0))

where we additionally assume high correlation between $x_1$ and $x_2$, and $x_3$ and $x_4$. We further assume that $(x_1, x_2)$ and $(x_3, x_4)$ are correlated with correlation 0.5.

In [2]:
covmat = Matrix::nearPD(cormat)$mat

In [3]:
covmat

5 x 5 Matrix of class "dpoMatrix"
          [,1]      [,2]      [,3]      [,4]      [,5]
[1,] 1.0415955 0.9727553 0.5071754 0.5071754 0.8723721
[2,] 0.9727553 1.0178450 0.4953001 0.4953001 0.7180960
[3,] 0.5071754 0.4953001 1.0012378 1.0012378 0.7952340
[4,] 0.5071754 0.4953001 1.0012378 1.0012378 0.7952340
[5,] 0.8723721 0.7180960 0.7952340 0.7952340 1.0183505

We now simulate $X$ and access the empirical correlation of columns of $X$,

In [4]:
n = 500
set.seed(1)
X = MASS::mvrnorm(n=n, rep(0,nrow(covmat)), covmat)

In [5]:
cor(X)

0,1,2,3,4
1.0,0.9476961,0.473013,0.4730053,0.8339206
0.9476961,1.0,0.4448778,0.4448681,0.6857495
0.473013,0.4448778,1.0,1.0,0.7892324
0.4730053,0.4448681,1.0,1.0,0.7892302
0.8339206,0.6857495,0.7892324,0.7892302,1.0


It roughly agrees with our simulation settings.

Now let's expand `X` matrix to having 1000 variables,

In [6]:
p = 1000
X = cbind(X, matrix(rnorm(n * (p - ncol(X))), nrow=n, ncol=(p - ncol(X))))

## Experiment 1

First we assume $x_2$, $x_3$ and $x_5$ are effect variables we simulate response `y`,

In [7]:
b = rep(0,p)
b[c(2,3,5)] = 1
y = X %*% b + rnorm(n)

And we analyze with SuSiE,

In [8]:
res <- susieR::susie(X,y,L=10,max_iter=1000)

In [9]:
res$sets

Unnamed: 0_level_0,min.abs.corr,mean.abs.corr,median.abs.corr
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>
L1,1,1,1
L2,1,1,1


In [10]:
res$pip[c(1,2,3,4,5)]

In this analysis 3 95% credible sets are identified. The first two CS contain effect variables $x_2$ and $x_5$ respectively and no other variables, even though $x_1$ is highly correlated with $x_2$ (correlation 0.948) its posterior inclusion probability is nearly zero (1E-13). The last CS contains two variables $x_3$ and $x_4$ each with 0.5 posterior inclusion probability because they are perfectly correlated.

## Experiment 2

Here we assume $x_2$ and $x_3$ are effect variable, but not $x_5$,

In [10]:
b = rep(0,p)
b[c(2,3)] = 1
y = X %*% b + rnorm(n)

And we analyze with SuSiE,

In [11]:
res <- susieR::susie(X,y,L=10,max_iter=1000)
res$sets

Unnamed: 0_level_0,min.abs.corr,mean.abs.corr,median.abs.corr
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>
L2,1,1,1
L1,1,1,1


In [12]:
res$pip[c(1,2,3,4,5)]

As expected, $x_5$ was dropped out with posterior inclusion probability zero.