# Experiment on identifiability of SuSiE

This notebook is meant to address to a concern from a referee.

It says the following:

> Suppose
we have another predictor $x_5$, which is both correlated with $(x_1,
x_2)$ and $(x_3, x_4)$. Say $\mathrm{cor}(x_1, x_5) = 0.9$,
$\mathrm{cor}(x_2, x_5) = 0.7$, and $\mathrm{cor}(x_5, x_3)
= \mathrm{cor}(x_5, x_4) = 0.8$. Does the current method assign $x_5$
to the $(x_1, x_2)$ group or the $(x_3, x_4)$ group?

Here we investigate the problem using simulations for given correlation structure over multiple replicates.

## Correlation structure between variables

First we determine covariance matrix that satisfies correlation structure as outlined above, 

In [1]:
h = 0.92
l = 0.7
cormat = rbind(c(1.0, h,   l,   l,   0.9),
               c(h,   1.0, l,   l,   0.7),
               c(l,   l,   1.0, h,   0.8),
               c(l,   l,   h,   1.0, 0.8),
               c(0.9, 0.7, 0.8, 0.8, 1.0))

where we assume high correlation between $x_1$ and $x_2$, and $x_3$ and $x_4$ ($R_{12} = R_{34} = 0.92$). 
We set $R_{13} = R_{13} = R_{23} = R_{24} = 0.7$ and find the nearest 
positive definite matrix for this correlation structure:

In [2]:
covmat = as.matrix(Matrix::nearPD(cormat)$mat)

In [3]:
cov2cor(covmat)

0,1,2,3,4
1.0,0.9166551,0.6995185,0.6995185,0.8965043
0.9166551,1.0,0.6993032,0.6993032,0.7003526
0.6995185,0.6993032,1.0,0.9200059,0.7991611
0.6995185,0.6993032,0.9200059,1.0,0.7991611
0.8965043,0.7003526,0.7991611,0.7991611,1.0


It still satisfy the structure as outlined at the beginning of the document. We use the nearest PD found as the covariance matrix for simulation below.

## Simulation of features and response variables

We simulate an $X$ matrix of $N=1000$ samples and $P_1=5$ variables, 
$$X_{p} \sim MVN(0, \Sigma)$$

where $\Sigma$ is covariance matrix as defined above.

We then expand $X$ to having a total of $P=2000$ variables where the other 1995 variables are independent --- they come from multivariate normal $MVN(0,I)$ with $I$ being the identity matrix. 

The sample size and number of variables mimics a relatively small sample genetic association fine-mapping application.

We simulate response $y$ using a linear model $y = Xb + e$, $e \sim N(0, I_n)$, and effect size $b$ a length $p$ vector with zero elsewhere except:

- Scenario 1: $x_2$ and $x_3$ are effect variables and $x_5$ is not effect variable: $b_2=b_3=1$.
- Scenario 2: $x_2$, $x_3$ and $x_5$ are all effect variables: $b_2=b_3=b_5=1$.

## Analysis plan

We run 500 replicates for the above 2 scenarios. We focus our evaluation on $x_5$ and ask:

1. When $x_5$ is not a effect variable, how often is it dropped from the results, or grouped with $(x_1, x_2)$ or $(x_3,x_4)$.
2. When $x_5$ is an effect variable, how often is it considered as a set on its own, or grouped with $(x_1, x_2)$ or $(x_3,x_4)$.

The code for the experiments can be found at: .

## Results