# Multivariate TWAS using joint-tissue imputation with Mendelian Randomization

## Introduction

This notebook shows the workflow for `MR.MASH`, which utilizes imputation to perform trait prediction according to multi-tissue relationship gene expression. `MR.MASH` acheives a higher prediction accuracy by leveraging multi-tissue information and also performes Causal Inference on trait and gene expression.

> 

We use data previously contributed to this course by Dr. Dan Zhou from Eric R. Gamazon's lab. For those interested in alternative multivariate TWAS methods, you may find the following references helpful. 

> Zhou, Dan, et al. "A unified framework for joint-tissue transcriptome-wide association and Mendelian randomization analysis." Nature Genetics (2020)

> Alvaro N., et al. "Exploring the phenotypic consequences of tissue specific gene expression variation inferred from GWAS summary statistics." Nature communications (2018)

> Gamazon, Eric R., et al. "A gene-based association method for mapping traits using reference transcriptome data." Nature genetics (2015)

## Analysis outline

1. Build multi-tissue gene expression prediction model 
2. Imputation / prediction of gene expression for each tissue
3. Perform association testing with imputed expression for each tissue
4. Causal Inference between Trait and imputed Gene Expression

## Step 1: Prediction model training (Joint-Tissue Imputation, JTI)

## 1. Input files preparing
### 1.1 Genotype data preparing (similar with GWAS)
QC and filtering: MAF, HWE, call rate, R2(imputation quality), etc.

Example file: `jti_example_geno.bed/bim/fam`

### 1.2 Expression data preparing
Expression normalization and residualization (age, gender, PCs, PEERs, etc.)

Example file: `jti_example_exp.txt`

### 1.3 Tissue-tissue similarity estimation
Expression, DHS, etc.

Example file: `jti_example_exp.txt`

### 1.4 Gene annotation file
An annotation 

Example file: `gencode.v32.GRCh37.txt`

## 2. Software and input Options
### 2.1 Script
https://github.com/gamazonlab/MR-JTI/blob/master/model_training/JTI/JTI.r

### 2.2 Input Options
* `--tissue`, target tissue name 
* `--geneid`, gene id The ENSG gene ID. Provide the real ENSG gene ID which will be used to find the chromosome and position for the gene.
* `--genotype_path`, Genotype file in plink bfile format `.bed/.bim/.fam`, used for data preprocessing in QC, filtering, calculating MAF, call rate, and R-sq,
it contains a snp matrix for snp name and family information. The example file here is like:  {genotype_path}.bed/fam/bim

      jti_example_geno.bed
      jti_example_geno.bim
      jti_example_geno.fam
* `--expression_path`, expression data that will be used to normalize and residulize for age, gender, PCs, PEERs, and so on. It contains tissue name, sample id and expression level
* `--gencode_path`, gene annotation file in .txt format, includes each gene's gene id, name, strand and so on. We use the colum 'geneid' as the list to iterate our workflow through all the genes.
* `--plink_path`, the path to plink software. It's been implanted in the docker image.

## 3. An usage example

Let do the Joint-Tissue Imputation for tissue "Adipose_Subcutaneous" and gene "ENSG00000182957"

In [None]:
mkdir result

In [None]:
Rscript /opt/MR_JTI/JTI.r \
    --tissue Adipose_Subcutaneous \
    --geneid ENSG00000182957 \
    --genotype_path data/jti_example_geno \
    --expression_path data/jti_example_exp.txt \
    --gencode_path data/gencode.v32.GRCh37.txt \
    --tmp_folder tmp \
    --plink_path /usr/local/bin/plink \
    --out_path result

## 4. Output

The output file (weight file) `{out_path}/{_geneid}_{tissue}.txt`contains the following columns
* gene: geneid
* rsid: snpid
* chr_bp: chromosome_position
* ref_allele: reference allele (uncounted allele when generating the dosage file.)
* counted_allele: counted_allele (counted allele when generating the dosage file.)
* weight: weight for each counted allele
* r2: cross-validation r2. The square of the correlation between the predicted and observed expression levels.
* p: cross-validation p-value. The significance of the correlation test (correlation between the predicted and observed expression levels)
* lambda: The final hyperparameter.

In [None]:
cat result/ENSG00000182957_Adipose_Subcutaneous.txt

## Step 2: Association test

MetaXcan is concerned with obtaining gene-level association tests from ordinary GWAS data. Let's use MetaXcan to do a summary statistics based gene-level association test

## 1. Input files
### 1.1 Prediction model
Prediction model includes the genetic variants and their effect allele, reference allele, and weight. It usually packaged into SQLitle file with postfix ‘.db’.

Example file `JTI_Liver.db`
https://zenodo.org/record/3842289/files/JTI_Liver.db

Connect to a local SQLitle database like this

In [None]:
library(RSQLite)
con <- dbConnect(RSQLite::SQLite(), dbname='data/JTI_Liver.db')             #establish connections
dbListTables(con)  #datasets
dbListFields(con, 'weights')   #cols
weights = dbReadTable(con,"weights")
dbDisconnect(con) #disconnect

### 1.2 SNP-SNP covariance matrix
SNP-SNP covariance matrix is always estimated from a reference dataset (e.g., 1000g, GTEx). The covariance matrix is needed for association test using GWAS summary statistics.

Example file `JTI_Liver.txt.gz`
https://zenodo.org/record/3842289/files/JTI_Liver.txt.gz


### 1.3 Pre-trained prediction models
https://zenodo.org/record/3842289


### 1.4 GWAS summary statistics
For each variant, the rsid, effect allele, reference allele, estimated effect size (beta) and its standard error are needed. Z-score, p-value, and se are convertible from one to the others.

Example file `LDLq.txt.gz`
(GWAS for LDL-C from UK Biobank, generated by Ben Neale Lab http://www.nealelab.is/uk-biobank)
Dropbox link: https://www.dropbox.com/sh/i9elg3m4wav4o5g/AAABdxZbVyBclbfa_1KKVftDa?dl=0


### 1.5 Individual level genotype data, phenotype data, and covariates
The genotype file should be converted to dosage format (coded as 0, 1, 2). Covariates may include age, gender, PCs, batch, etc.

Example file `jti_example_geno.bed/bim/fam`(1000g project phase 1, 1.2Gb)
https://www.dropbox.com/s/k9ptc4kep9hmvz5/1kg_phase1_all.tar.gz

use the following command to covert the binary file to dosage format.
Reference: https://github.com/hakyimlab/PrediXcan

In [None]:
plink --bfile data/jti_example_geno --recode A --out result/dosage

## 2. Software and input Options
### 2.1 Script
https://github.com/hakyimlab/MetaXcan/tree/v0.5.0/software/MetaXcan.py

### 2.2 Input Options

* `--model_db_path` Path to tissue transriptome model
* `--covariance` Path to file containing covariance information. This covariance should have information related to the tissue transcriptome model.
* `--gwas_folder` Folder containing GWAS summary statistics data.
* `--beta_column` Tells the program the name of a column containing -phenotype beta data for each SNP- in the input GWAS files.
* `--pvalue_column `Tells the program the name of a column containing -PValue for each SNP- in the input GWAS files.
* `--output_file` Path where results will be saved to.

## 3. An example running

In [None]:
python /opt/MR_JTI/MetaXcan/software/MetaXcan.py \
    --model_db_path data/JTI_Liver.db \
    --covariance data/JTI_Liver.txt.gz \
    --gwas_file data/LDLq.txt.gz \
    --snp_column rsid \
    --effect_allele_column eff_allele \
    --non_effect_allele_column ref_allele \
    --beta_column beta \
    --se_column se \
    --output_file result/LDLq_JTI_Liver.csv

### Output
The output file `{out_path}/{trait}_{model}_{tissue}.csv`contains the following columns
* gene: a gene's id: as listed in the Tissue Transcriptome model. Ensemble Id for some, while some others (mainly DGN Whole Blood) provide Genquant's gene name
* gene_name: gene name as listed by the Transcriptome Model, generally extracted from Genquant
* zscore: MetaXcan's association result for the gene
* effect_size: MetaXcan's association effect size for the gene
* pvalue: P-value of the aforementioned statistic.
* pred_perf_r2: R2 of tissue model's correlation to gene's measured transcriptome (prediction performance)
* pred_perf_pval: pval of tissue model's correlation to gene's measured transcriptome (prediction performance)
* pred_perf_qval: qval of tissue model's correlation to gene's measured transcriptome (prediction performance)
* n_snps_used: number of snps from GWAS that got used in MetaXcan analysis
* n_snps_in_cov: number of snps in the covariance matrix
* n_snps_in_model: number of snps in the model
* var_g: variance of the gene expression, calculated as W' * G * W (where W is the vector of SNP weights in a gene's model, W' is its transpose, and G is the covariance matrix)

Show the top 10 genes in the MetaXcan's association test

In [None]:
asso_stat = read.csv("result/LDLq_JTI_Liver.csv", header = T) 
head(asso_stat,10)

## Step 3: Mendelian Randomization (MR-JTI)

### 1. Input files
A dataframe of GWAS and eQTL summary statistics from step 2

Example file `mrjti_example.txt`
https://github.com/gamazonlab/MR-JTI/blob/master/mr/mrjti_example.txt


## 2. Software and packages for this step
### 2.1 Script
https://github.com/gamazonlab/MR-JTI/blob/master/mr/MR-JTI.r

### 2.3 Input options
* `--df_path`, Path to dataframe of GWAS and eQTL summary statistics. This input file contains six elements, as listed below (The headers are required)
    * rsid: rsid. SNPs need to be clumped (plink --clump) before running MR-JTI. 
    * effect_allele: The effect allele. Harmonization needs to be performed to make sure the effect alleles of eQTL and GWAS are correctly aligned.
    * ldscore: The LD score of each SNP. GCTA could be used to generate LD score based on reference dataset (e.g. 1000g, GTEx). gcta64 --bfile test --ld-score --ld-wind 1000 --ld-rsq-cutoff 0.01 --out test
    * eqtl_beta: the marginal effect of SNP. Available on GTEx portal
    * eqtl_se: SE of eQTL effect size
    * eqtl_p: eQTL p-value
    * gwas_beta: GWAS effect size
    * gwas_p: GWAS p-value
* `--n_genes` Total number of genes tested (Bonferroni correction will be applied). n_genes=1 denotes user requires only nominal significance level (i.e., p<0.05 will be considered as significant). 
* `--out_path` Output path. 


## 3. A typical run

In [None]:
Rscript  /opt/MR_JTI/MR-JTI.r \
    --df_path data/mrjti_example.txt \
    --n_genes 1 \
    --out_path result/mrjti_example.csv

### Output
MR-JTI generates the upper and lower estimates of the gene's effect on GWAS trait as well as the heterogeneity estimates.

The output file `{out_path}/*.csv` contains the following columns
* variable: Variables including the gene's effect and the heterogeneity effects
* beta: Point estimate of the effect size
* beta_CI_lower: Bonferroni adjusted confidence interval (CI), lower
* beta_CI_upper: Bonferroni adjusted CI, upper
* CI_significance: Significant if the CI does not overlap the null hypothesis (i.e., 0).

In [None]:
mrjti_stat = read.csv("result/mrjti_example.csv", header = T) 
head(mrjti_stat)

In [None]:
mrjti_stat[mrjti_stat$CI_significance=="sig",]

Note: MR-JTI performs causal inference by modeling the heterogeneity (extra effect) which mainly due to horizontal pleiotropy and unobserved confounding factors. The output of 'CI_significant' tells you whether it is significant ('sig' or 'nonsig'). Here, the significance is not defined by p-value but by the confidence interval (CI) estimated from the bootstrap in a non-parametric way. The Bonferroni-adjusted CI includes 0 mean not significant. 'Bonferroni-adjusted CI' means, when 100 genes were tested, 1-0.05/100 CI (99.95% CI) is applied.

The significance of "expression" (2nd row in result file) is the primary result of MR-JTI, indicating the significance of the causality between the gene expression and trait.The ld-score is considered as a covariate here.

The significance for each IV (SNP) indicates whether the 'extra effect' of the IVs is significantly different from 0. The 'extra effect' denotes the effect from IV to trait but not mediated by the target gene's expression.