<a href="https://colab.research.google.com/github/yue-wu-1/615group-project/blob/main/Biostat615_group2_simulation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# BIOSTAT615 Final Project Simulation Results: Three Trials
This notebook demonstrates how to use the functions within our "Qlocalstat" package on three specific trials from our simulation design - demonstrating the (often drastic) effect of instability and different ways that it needs to be cleaned up.
Author: Yue Wu/Jack Li/Zhuoyu Wang

## Preamble - Prepare simulation data and R package

In [2]:
## Download the simulation data
system("gdown --id 1isIHjT8Wwq1YybW9M1aGo7PCnj2qwLS-", intern=TRUE)
system("gdown --id 14IOvot30vbr1GB8x5Sm3EUyvyBRGNTSZ", intern=TRUE)
system("gdown --id 1VPByhUbifuGxff3YjmMemPhi2HD74Wfi", intern=TRUE)
## Download the test package
system("gdown --id 18R9JSsX_aZdMSHJA5c5Yi5FBWfePo41L", intern=TRUE)
## check if the file is successfully downloaded
print(system("ls -l", intern=TRUE))

[1] "total 1240"                                                      
[2] "-rw-r--r-- 1 root root 330155 Dec 15 06:24 19.RDS"               
[3] "-rw-r--r-- 1 root root 485475 Dec 15 06:24 500.RDS"              
[4] "-rw-r--r-- 1 root root 427977 Dec 15 06:24 620.RDS"              
[5] "-rw-r--r-- 1 root root  14665 Dec 15 06:24 Qlocalstat_1.0.tar.gz"
[6] "drwxr-xr-x 1 root root   4096 Dec 13 14:22 sample_data"          


Each of these trials (19.RDS), (500.RDS), (620.RDS) is a list that includes the following components:
1.   A 200x200 LD matrix of all variants included within the locus, EUR superpopulation `(ld200_EUR)`
2.   A 200x200 LD matrix of all variants included within the locus, EAS superpopulation `(ld200_EAS)`
3.   The summary statistics for all 200 variants in the locus, EUR exposure `(summary_EUR_exp)`
4. The summary statistics for all 200 variants in the locus, EUR outcome `(summary_EUR_out)`
5. The summary statistics for all 200 variants in the locus, EAS outcome `(summary_EAS_out)`
6. The position of the causal variant `(causal_var)`
7. The position of the index variant in the summary statistics `(index_var)`
8. The position of the "proxies" to the index variant (variants within LD $R^2$ of 0.64) `(proxies)`
9. The LD matrix of the "locus" in the EUR superpopulation (that is, the matrix of the index variant and its proxies) `(ldlocus_EUR)`
10. The LD matrix of the "locus" in the EAS superpopulation (that is, the matrix of the index variant and its proxies) `(ldlocus_EAS)`
11. A trial run of the score statistic between EUR -> EUR that uses the mean Wald ratio as the estimate for $\gamma$, with no measures taken for robust pseudoinverse `(EUR_EUR_Q)`
12. A trial run of the score statistic between EUR -> EAS that uses the mean Wald ratio as the estimate for $\gamma$, with no measures taken for robust pseudoinverse `(EUR_EAS_Q)`

Each trial's summary statistics were generated with a constant standard error of 0.05 for both the exposure and outcome (which is reflected in the demonstration) - in reality, this standard error should vary between the variants!


In [1]:
## Install the test package
install.packages("Qlocalstat_1.0.tar.gz",repos = NULL)
library(Qlocalstat)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

“installation of package ‘Qlocalstat_1.0.tar.gz’ had non-zero exit status”


ERROR: ignored

## Simulation 1 - Trial 500: a "best case" situation:

Load simulation data 500.RDS

In [None]:
trial500 <- readRDS(file = "500.RDS")
attach(trial500)

Perform Qstat function between European and East Asian dataset

In [None]:
#EUR to EUR
Qstat(center = "index", bx = summary_EUR_exp[proxies], by = summary_EUR_out[proxies],
      se_bx = rep(0.05, length(proxies)), se_by = rep(0.05, length(proxies)), ldlocus_EUR,
      weak_filter = TRUE, weak_thresh = 2,
      SVD = FALSE, SVD_thresh = NA)

Perform Qstat function between European and East Asian summary statistics

In [None]:
#EUR to EAS
Qstat(center = "index", bx = summary_EUR_exp[trial500$proxies], by = summary_EAS_out[proxies],
      se_bx = rep(0.05, length(proxies)), se_by = rep(0.05, length(proxies)), ldlocus_EUR,
      weak_filter = TRUE, weak_thresh = 2,
      SVD = FALSE, SVD_thresh = NA)

detach(trial500)

*Note that even without any regularization, these statistics run without issue and work pretty well. As expected, EUR -> EAS is much more heterogeneous than EUR -> EUR.

## Simulation 2 - Trial 19: a situation with robust inverse issues that the eigenvalue-based pseudoinverse resolves


Load simulation data 19.RDS

In [None]:
trial19 <- readRDS(file = "19.RDS")
attach(trial19)

Perform Qstat function between European and European dataset

In [None]:
#EUR to EUR
Qstat(center = "index", bx = summary_EUR_exp[trial19$proxies], by = summary_EUR_out[trial19$proxies],
      se_bx = rep(0.05, length(trial19$proxies)), se_by = rep(0.05, length(trial19$proxies)), ldlocus_EUR,
      weak_filter = TRUE, weak_thresh = 2,
      SVD = FALSE, SVD_thresh = NA)

Perform Qstat function between European and East Asian dataset


In [None]:
#EUR to EAS
Qstat(center = "index", bx = summary_EUR_exp[trial19$proxies], by = summary_EAS_out[trial19$proxies],
      se_bx = rep(0.05, length(trial19$proxies)), se_by = rep(0.05, length(trial19$proxies)), ldlocus_EUR,
      weak_filter = TRUE, weak_thresh = 2,
      SVD = FALSE, SVD_thresh = NA)

“Q-statistic is less than zero, consider using the pseudoinverse!”


*While the EUR -> EUR result is still OK, the EUR -> EAS result is extremely unstable and leads to an absurd result - let's try applying the eigenvalue-based pseudoinverse.


In [None]:
Qstat(center = "index", bx = summary_EUR_exp[trial19$proxies], by = summary_EAS_out[trial19$proxies],
      se_bx = rep(0.05, length(trial19$proxies)), se_by = rep(0.05, length(trial19$proxies)), ldlocus_EUR,
      weak_filter = TRUE, weak_thresh = 2,
      SVD = TRUE, SVD_thresh = "eigen")

detach(trial19)

*The result is improved - and we once again see that EUR -> EAS shows much more heterogeneity than its EUR -> EUR counterpart.


## Simulation 3 - Trial 620: a situation where the eigenvalue-based pseudoinverse fails to resolve, but a threshold-based pseudoinverse fixes the problem

Load simulation data 620.RDS

In [None]:
trial620 <- readRDS(file = "620.RDS")
attach(trial620)

Perform Qstat function between European and European dataset

In [None]:
#EUR to EUR
Qstat(center = "index", bx = summary_EUR_exp[proxies], by = summary_EUR_out[proxies],
      se_bx = rep(0.05, length(proxies)), se_by = rep(0.05, length(proxies)), ldlocus_EUR,
      weak_filter = TRUE, weak_thresh = 2,
      SVD = FALSE, SVD_thresh = NA)

Perform Qstat function on summary statistics from the European and East Asian dataset

In [None]:
#EUR to EAS
Qstat(center = "index", bx = summary_EUR_exp[proxies], by = summary_EAS_out[proxies],
      se_bx = rep(0.05, length(proxies)), se_by = rep(0.05, length(proxies)), ldlocus_EUR,
      weak_filter = TRUE, weak_thresh = 2,
      SVD = FALSE, SVD_thresh = NA)

“Q-statistic is less than zero, consider using the pseudoinverse!”


Once again, the EUR -> EAS result is extremely incorrect - let's try using the eigenvalue-based pseudoinverse.


In [None]:
#eigenvalue-based pseudoinverse
Qstat(center = "index", bx = summary_EUR_exp[proxies], by = summary_EAS_out[proxies],
      se_bx = rep(0.05, length(proxies)), se_by = rep(0.05, length(proxies)), ldlocus_EUR,
      weak_filter = TRUE, weak_thresh = 2,
      SVD = TRUE, SVD_thresh = "eigen")

*That did not fix the result - let's instead try applying the threshold-based pseudoinverse.

In [None]:
#threshold-based pseudoinverse
Qstat(center = "index", bx = summary_EUR_exp[proxies], by = summary_EAS_out[proxies],
      se_bx = rep(0.05, length(proxies)), se_by = rep(0.05, length(proxies)), ldlocus_EUR,
      weak_filter = TRUE, weak_thresh = 2,
      SVD = TRUE, SVD_thresh = 1e-4)
detach(trial620)

*Now we see the results are stable once again - interestingly, we see that results from EUR -> EAS do not necessarily have to be more heterogeneous than EUR -> EUR:
the amount of mismatch is certainly dependent on the location within the genome.

## Summary



In summary, a robust inverse for $\Omega$ is often necessary for proper calculation of the score statistic. Importantly, there is no "one-size fits all" option for this robust inverse: in practice, we find that while the eigenvalue-based pseudoinverse works as a general initial choice for most situations where the result is unstable, occasionally manual tweaking is necessary to get a good result.

Overall, we recommend trying multiple (plausible) parameters for these methods to ensure that conclusions from using one set of parameters remain consistent across multiple methods.