# Experiment on identifiability of SuSiE

This notebook is meant to address to a concern from a referee.

It says the following:

> Suppose
we have another predictor $x_5$, which is both correlated with $(x_1,
x_2)$ and $(x_3, x_4)$. Say $\mathrm{cor}(x_1, x_5) = 0.9$,
$\mathrm{cor}(x_2, x_5) = 0.7$, and $\mathrm{cor}(x_5, x_3)
= \mathrm{cor}(x_5, x_4) = 0.8$. Does the current method assign $x_5$
to the $(x_1, x_2)$ group or the $(x_3, x_4)$ group?

Here we investigate the problem using simulations for given correlation structure over multiple replicates.

## Correlation structure between variables

First we determine covariance matrix that satisfies correlation structure as outlined above, 

In [1]:
h = 0.92
l = 0.7
cormat = rbind(c(1.0, h,   l,   l,   0.9),
               c(h,   1.0, l,   l,   0.7),
               c(l,   l,   1.0, h,   0.8),
               c(l,   l,   h,   1.0, 0.8),
               c(0.9, 0.7, 0.8, 0.8, 1.0))

where we assume high correlation between $x_1$ and $x_2$, and $x_3$ and $x_4$ ($R_{12} = R_{34} = 0.92$). 
We set $R_{13} = R_{14} = R_{23} = R_{24} = 0.7$ and find the nearest 
positive definite matrix for this correlation structure:

In [2]:
covmat = as.matrix(Matrix::nearPD(cormat)$mat)

In [3]:
cov2cor(covmat)

0,1,2,3,4
1.0,0.9166551,0.6995185,0.6995185,0.8965043
0.9166551,1.0,0.6993032,0.6993032,0.7003526
0.6995185,0.6993032,1.0,0.9200059,0.7991611
0.6995185,0.6993032,0.9200059,1.0,0.7991611
0.8965043,0.7003526,0.7991611,0.7991611,1.0


It still satisfy the structure as outlined at the beginning of the document. We use the nearest PD found as the covariance matrix for simulation below.

## Simulation of features and response variables

We simulate an $X$ matrix of $N=1000$ samples and $P_1=5$ variables, 
$$X_{p} \sim MVN(0, \Sigma)$$

where $\Sigma$ is covariance matrix as defined above.

We then expand $X$ to having a total of $P=2000$ variables where the other 1995 variables are independent --- they come from multivariate normal $MVN(0,I)$ with $I$ being the identity matrix. 

The sample size and number of variables mimics a relatively small sample genetic association fine-mapping application.

We simulate response $y$ using a linear model $y = Xb + e$, $e \sim N(0, I_n)$, and effect size $b$ a length $p$ vector with zero elsewhere except:

- Scenario 1: $x_2$ and $x_3$ are effect variables and $x_5$ is not effect variable: $b_2=b_3=1$.
- Scenario 2: $x_2$, $x_3$ and $x_5$ are all effect variables: $b_2=b_3=b_5=1$.

**Notice: here we intentially set $x_2$ but not $x_1$ the effect variable for the first set of highly correlated variables, because we expect that for scenario 2 when $x_1$ has higher correlation with $x_5$ than $x_2$ with $x_5$ then SuSiE might have trouble detecting $x_2$**.

## Analysis plan

We run 500 replicates for the above 2 scenarios. We focus our evaluation on $x_5$ and ask:

1. When $x_5$ is not a effect variable, how often is it dropped from the results, or grouped with $(x_1, x_2)$, or with $(x_3,x_4)$.
2. When $x_5$ is an effect variable, how often is it considered as a set on its own, or grouped with $(x_1, x_2)$, or with $(x_3,x_4)$.

The code for the experiments can be found at: https://github.com/stephenslab/susie-paper/tree/master/identifiability_dsc

## Results

In [6]:
setwd('../identifiability_dsc/')
res = dscrutils::dscquery('identifiability',targets = c('simulate.b5','evaluate','evaluate.has_2','evaluate.has_3','evaluate.has_5','evaluate.in_12','evaluate.in_34', 'evaluate.own', 'evaluate.mixed'), module.output.files='evaluate')
saveRDS(res, 'identifiability_20200120.rds')

Calling: dsc-query identifiability -o /tmp/Rtmpbv1sZ3/file129f2e9808bb.csv --target "simulate.b5 evaluate.has_2 evaluate.has_3 evaluate.has_5 evaluate.in_12 evaluate.in_34 evaluate.own evaluate.mixed" --force 
Loaded dscquery output table with 1 rows and 9 columns.


dscquery is returning a list because one or more outputs are complex; consider converting the list to a tibble using the "tibble" package


In [61]:
res = readRDS('identifiability_20200120.rds')
res$evaluate.mixed = sapply(1:length(res$evaluate.mixed), function(l) length(res$evaluate.mixed[[l]]))
res = tibble::as_tibble(res)
head(res)

DSC,simulate.b5,evaluate.has_2,evaluate.has_3,evaluate.has_5,evaluate.in_12,evaluate.in_34,evaluate.own,evaluate.mixed,evaluate.output.file
<int>,<int>,<list>,<list>,<list>,<dbl>,<dbl>,<dbl>,<int>,<chr>
1,0,"FALSE, TRUE, FALSE","TRUE, FALSE, FALSE","FALSE, FALSE, TRUE",0,1,0,0,evaluate/simulate_1_susie_1_evaluate_1
1,1,"FALSE, FALSE","TRUE, FALSE","FALSE, FALSE",0,0,0,0,evaluate/simulate_2_susie_1_evaluate_1
2,0,"TRUE, FALSE","FALSE, TRUE","FALSE, FALSE",0,0,0,0,evaluate/simulate_3_susie_1_evaluate_1
2,1,"FALSE, FALSE","FALSE, TRUE","FALSE, FALSE",0,0,0,0,evaluate/simulate_4_susie_1_evaluate_1
3,0,"TRUE, FALSE","FALSE, TRUE","FALSE, FALSE",0,0,0,0,evaluate/simulate_5_susie_1_evaluate_1
3,1,"FALSE, FALSE","FALSE, TRUE","FALSE, FALSE",0,0,0,0,evaluate/simulate_6_susie_1_evaluate_1


### When $x_5$ is not effect variable

In [57]:
b5 = 0
rep = nrow(res[which(res$simulate.b5==b5),])
in_12 = nrow(res[which(res$simulate.b5==b5 & res$evaluate.in_12>0),])
in_34 = nrow(res[which(res$simulate.b5==b5 & res$evaluate.in_34>0),])
own = nrow(res[which(res$simulate.b5==b5 & res$evaluate.own>0),])
mixed = nrow(res[which(res$simulate.b5==b5 & res$evaluate.mixed>0),])
total_cs = sum(sapply(1:length(res[which(res$simulate.b5==b5),]$evaluate.has_2), function(l) length(res[which(res$simulate.b5==b5),]$evaluate.has_2[[l]])))
has_2 = sum(sapply(1:length(res[which(res$simulate.b5==b5),]$evaluate.has_2), function(l) any(res[which(res$simulate.b5==b5),]$evaluate.has_2[[l]])))
has_3 = sum(sapply(1:length(res[which(res$simulate.b5==b5),]$evaluate.has_3), function(l) any(res[which(res$simulate.b5==b5),]$evaluate.has_3[[l]])))
out = c(rep, total_cs, has_2, has_3, in_12, in_34, own, mixed)
names(out) = c('replicates', 'total CS reported', 'replicates X2 is detected', 'replicates X3 is detected', 'X5 in CS with X1 and X2', 'X5 in CS with X3 and X4', 'X5 on its own', '(X1,X2) and (X3,X4) are mixed up')

In [58]:
out

The result looks good.

### When $x_5$ is effect variable

In [59]:
b5 = 1
rep = nrow(res[which(res$simulate.b5==b5),])
in_12 = nrow(res[which(res$simulate.b5==b5 & res$evaluate.in_12>0),])
in_34 = nrow(res[which(res$simulate.b5==b5 & res$evaluate.in_34>0),])
own = nrow(res[which(res$simulate.b5==b5 & res$evaluate.own>0),])
mixed = nrow(res[which(res$simulate.b5==b5 & res$evaluate.mixed>0),])
total_cs = sum(sapply(1:length(res[which(res$simulate.b5==b5),]$evaluate.has_2), function(l) length(res[which(res$simulate.b5==b5),]$evaluate.has_2[[l]])))
has_2 = sum(sapply(1:length(res[which(res$simulate.b5==b5),]$evaluate.has_2), function(l) any(res[which(res$simulate.b5==b5),]$evaluate.has_2[[l]])))
has_3 = sum(sapply(1:length(res[which(res$simulate.b5==b5),]$evaluate.has_3), function(l) any(res[which(res$simulate.b5==b5),]$evaluate.has_3[[l]])))
out = c(rep, total_cs, has_2, has_3, in_12, in_34, own, mixed)
names(out) = c('replicates', 'total CS reported', 'replicates X2 is detected', 'replicates X3 is detected', 'X5 in CS with X1 and X2', 'X5 in CS with X3 and X4', 'X5 on its own', '(X1,X2) and (X3,X4) are mixed up')

In [60]:
out

The result is slightly problematic but is somewhat expected: 

1. Many CS incorrectly capture the non-effect variable $x_1$ instead of the true effect variable $x_2$.
    - expected because the non-effect variable $x_1$ is both in correlation with effect variables $x_2$ and $x_5$, and has stronger correlation with $x_5$ than correlation between $x_2$ and $x_5$.
2. The effect variable $x_5$ is often not detected in 95% CS
    - expected because it is highly correlated with other effect variables but its contribution after accounting for other variables are not enough to stand out as a 95% CS.
    - in this setting $x_5$ still don't end up either with $(x_1,x_2)$ or with $(x_3,x_4)$.