# Multivariate TWAS using joint-tissue imputation with Mendelian Randomization

Copyrighted © 2021 Guangyou Li and Dan Zhou

This MR-JTI exercise comes from [Eric R. Gamazon's lab](https://github.com/gamazonlab/MR-JTI). It's written in SoS notebook here and uses a pre-build docker image. This notebook shows the workflow for MR-JTI, which utilizes imputation to perform trait prediction according to multi-tissue relationship gene expression. MR-JTI acheives a higher prediction accuracy by leveraging multi-tissue information and also performes Causal Inference on trait and gene expression.

> [Zhou, Dan, et al. "A unified framework for joint-tissue transcriptome-wide association and Mendelian randomization analysis." Nature Genetics (2020)](https://www.nature.com/articles/s41588-020-0706-2)

> Alvaro N., et al. "Exploring the phenotypic consequences of tissue specific gene expression variation inferred from GWAS summary statistics." Nature communications (2018)

> Gamazon, Eric R., et al. "A gene-based association method for mapping traits using reference transcriptome data." Nature genetics (2015)

## Analysis outline

1. Build multi-tissue gene expression prediction model 
2. Imputation / prediction of gene expression for each tissue
3. Perform association testing with imputed expression for each tissue
4. Causal Inference between Trait and imputed Gene Expression

## JTI based phenotype prediction
 
### Input

* `--tissue`, target tissue name 
* `--geneid`, gene id The ENSG gene ID. Provide the real ENSG gene ID which will be used to find the chromosome and position for the gene.
* `--genotype_path`, Genotype file in plink bfile format `.bed/.bim/.fam`, used for data preprocessing in QC, filtering, calculating MAF, call rate, and R-sq,
it contains a snp matrix for snp name and family information. The example file here is like:  {genotype_path}.bed/fam/bim

      jti_example_geno.bed
      jti_example_geno.bim
      jti_example_geno.fam

* `--expression_path`, expression data that will be used to normalize and residulize for age, gender, PCs, PEERs, and so on. It contains tissue name, sample id and expression level
* `--gencode_path`, gene annotation file in .txt format, includes each gene's gene id, name, strand and so on. We use the colum 'geneid' as the list to iterate our workflow through all the genes.
* `--plink_path`, the path to plink software. It's been implanted in the docker image.
    
### Output

The output file (weight file) `{out_path}/{_geneid}_{tissue}.txt`contains the following columns
* gene: geneid
* rsid: snpid
* chr_bp: chromosome_position
* ref_allele: reference allele (uncounted allele when generating the dosage file.)
* counted_allele: counted_allele (counted allele when generating the dosage file.)
* weight: weight for each counted allele
* r2: cross-validation r2. The square of the correlation between the predicted and observed expression levels.
* p: cross-validation p-value. The significance of the correlation test (correlation between the predicted and observed expression levels)
* lambda: The final hyperparameter.

## Association testing

### Input

* `--model_db_path` Path to tissue transriptome model
* `--covariance` Path to file containing covariance information. This covariance should have information related to the tissue transcriptome model.
* `--gwas_folder` Folder containing GWAS summary statistics data.
* `--beta_column` Tells the program the name of a column containing -phenotype beta data for each SNP- in the input GWAS files.
* `--pvalue_column `Tells the program the name of a column containing -PValue for each SNP- in the input GWAS files.
* `--output_file` Path where results will be saved to.

### Output
The output file `{out_path}/{trait}_{model}_{tissue}.csv`contains the following columns
* gene: a gene's id: as listed in the Tissue Transcriptome model. Ensemble Id for some, while some others (mainly DGN Whole Blood) provide Genquant's gene name
* gene_name: gene name as listed by the Transcriptome Model, generally extracted from Genquant
* zscore: MetaXcan's association result for the gene
* effect_size: MetaXcan's association effect size for the gene
* pvalue: P-value of the aforementioned statistic.
* pred_perf_r2: R2 of tissue model's correlation to gene's measured transcriptome (prediction performance)
* pred_perf_pval: pval of tissue model's correlation to gene's measured transcriptome (prediction performance)
* pred_perf_qval: qval of tissue model's correlation to gene's measured transcriptome (prediction performance)
* n_snps_used: number of snps from GWAS that got used in MetaXcan analysis
* n_snps_in_cov: number of snps in the covariance matrix
* n_snps_in_model: number of snps in the model
* var_g: variance of the gene expression, calculated as W' * G * W (where W is the vector of SNP weights in a gene's model, W' is its transpose, and G is the covariance matrix)

## MR 
### Input

* `--df_path`, Path to dataframe of GWAS and eQTL summary statistics. This input file contains six elements, as listed below (The headers are required)
    * rsid: rsid. SNPs need to be clumped (plink --clump) before running MR-JTI. 
    * effect_allele: The effect allele. Harmonization needs to be performed to make sure the effect alleles of eQTL and GWAS are correctly aligned.
    * ldscore: The LD score of each SNP. GCTA could be used to generate LD score based on reference dataset (e.g. 1000g, GTEx). gcta64 --bfile test --ld-score --ld-wind 1000 --ld-rsq-cutoff 0.01 --out test
    * eqtl_beta: the marginal effect of SNP. Available on GTEx portal
    * eqtl_se: SE of eQTL effect size
    * eqtl_p: eQTL p-value
    * gwas_beta: GWAS effect size
    * gwas_p: GWAS p-value
* `--n_genes` Total number of genes tested (Bonferroni correction will be applied). n_genes=1 denotes user requires only nominal significance level (i.e., p<0.05 will be considered as significant). 
* `--out_path` Output path. 

### Output
The output file `{out_path}/*.csv` contains the following columns
* variable: Variables including the gene's effect and the heterogeneity effects
* beta: Point estimate of the effect size
* beta_CI_lower: Bonferroni adjusted confidence interval (CI), lower
* beta_CI_upper: Bonferroni adjusted CI, upper
* CI_significance: Significant if the CI does not overlap the null hypothesis (i.e., 0).

## MWE

The minimal working example files can be downloaded from [Google Drive](https://drive.google.com/drive/folders/1Yv_wipKz0jP0pd-gEhFBZzO4jU9m-mqs?usp=sharing).
In here we assume the folder structure is the same as this link.

## Step 1: Prediction model training (Joint-Tissue Imputation, JTI)

In [17]:
sos run MR_JTI.ipynb JTI \
    --tissue Adipose_Subcutaneous \
    --geneid ENSG00000182957 \
    --genotype_path data/jti_example_geno \
    --expression_path data/jti_example_exp.txt \
    --gencode_path data/gencode.v32.GRCh37.txt \
    --container "statisticalgenetics/twas" \
    --out_path result

INFO: Running [32mJTI[0m: Part 1 Prediction model training (Joint-Tissue Imputation, JTI)
INFO: [32mJTI[0m (index=0) is [32mignored[0m due to saved signature
INFO: [32mJTI[0m output:   [32mresult/ENSG00000182957_Adipose_Subcutaneous.txt[0m
INFO: Workflow JTI (ID=w61fee8a1d90a2377) is ignored with 1 ignored step.


In [11]:
head result/ENSG00000182957_Liver.txt

gene	rsid	chr_bp	ref_allele	counted_allele	weight	r2	p	lambda
ENSG00000182957	rs73455777	13_24879413	T	A	-0.280248818627714	0.147611877506574	4.06941769922718e-16	0.153641011307056
ENSG00000182957	rs73455781	13_24884598	T	C	-0.108027909059419	0.147611877506574	4.06941769922718e-16	0.153641011307056
ENSG00000182957	rs9507302	13_24893568	C	T	-0.0520308509795799	0.147611877506574	4.06941769922718e-16	0.153641011307056
ENSG00000182957	rs56314653	13_24962979	T	C	-0.0154395733291027	0.147611877506574	4.06941769922718e-16	0.153641011307056


## Step 2: Association test

In [14]:
sos run MR_JTI.ipynb association \
    --db_path data/JTI_Liver.db \
    --genotype_path data/jti_example_geno \
    --input_folder data \
    --container "statisticalgenetics/twas" \
    --out_path result

INFO: Running [32massociation[0m: Part 2 Association test
HINT: Pulling docker image statisticalgenetics/twas
HINT: Docker image statisticalgenetics/twas is now up to date
INFO: [32massociation[0m is [32mcompleted[0m.
INFO: [32massociation[0m output:   [32mresult/LDLq_JTI_Liver.csv[0m
INFO: Workflow association (ID=wcb65c2b94acb7707) is executed successfully with 1 completed step.


In [15]:
head result/LDLq_JTI_Liver.csv

gene,gene_name,zscore,effect_size,pvalue,var_g,pred_perf_r2,pred_perf_pval,pred_perf_qval,n_snps_used,n_snps_in_cov,n_snps_in_model
ENSG00000143126,CELSR2,-42.30416059384626,-0.13424459985966838,0.0,0.26893507546188206,0.38689762880709166,1.1569946316677617e-23,7.149017436823504e-22,5,5,5
ENSG00000187244,BCAM,45.713683456250074,0.4493915104248329,0.0,0.030827889220000557,0.034484524780590185,0.0072429262005854625,0.022196934693451447,6,7,7
ENSG00000134243,SORT1,-41.97254744508827,-0.5058718693908261,0.0,0.018351693484898984,0.5584770598624752,1.9930967434199894e-38,6.85552803491261e-36,4,4,4
ENSG00000134222,PSRC1,-42.4925634626502,-0.125248472285708,0.0,0.3173165322628121,0.423336551426874,2.0106974450869275e-26,1.5849322610897707e-24,8,8,8
ENSG00000105726,ATP13A1,-21.520920063766447,-0.44611033803918193,9.917490808077277e-103,0.006385253790336975,0.05160172865637287,0.0009680533861656873,0.0038874196475233436,3,3,3
ENSG00000122008,POLK,18.347020954415058,0.22066026893028257,3.48715226

## Step 3: Mendelian Randomization (MR-JTI)

In [18]:
sos run MR_JTI.ipynb MR \
  --df_path data/mrjti_example.txt \
  --n_genes 1 \
  --out_path result

INFO: Running [32mMR[0m: Part 3 Mendelian Randomization (MR-JTI)
HINT: Pulling docker image statisticalgenetics/twas
HINT: Docker image statisticalgenetics/twas is now up to date
INFO: [32mMR[0m is [32mcompleted[0m.
INFO: [32mMR[0m output:   [32mresult/mrjti_example.csv[0m
INFO: Workflow MR (ID=w1f80581b365d8b45) is executed successfully with 1 completed step.


In [19]:
head result/mrjti_example.csv

variable,beta,beta_CI_lower,beta_CI_upper,CI_significance
expression,-0.710254392485374,-0.824646380720042,-0.602677624454241,sig
ldsc,-0.060532183500575,-0.165689662168061,0.0518012673488733,nonsig
rs3902354,0,0,0,nonsig
rs68104325,0,0,0,nonsig
rs585362,0,0,0,nonsig
rs17035665,0,0,0,nonsig
rs651649,0,0,0,nonsig
rs6677122,0,0,0,nonsig
rs4970835,0,0,0,nonsig


## Workflow implementation

In [None]:
[global]
# path to the working directory
parameter: cwd = path(".")
cwd = f"{cwd:a}" 
# path to output file
parameter: out_path = path
# container 
parameter: container = "statisticalgenetics/twas"

In [None]:
# Part 1 Prediction model training (Joint-Tissue Imputation, JTI)
[JTI]
# target tissue name
parameter: tissue = str
# expression file
parameter: expression_path = path
# path for temperary files, will be clened up after model training
parameter: tmp_folder = f'{cwd}/tmp'
# genotype file in PLINK format 
parameter: genotype_path = path
# provide the geneid of interest to focus on
parameter: geneid = []
# the gene annotation file
parameter: gencode_path = path
# To get all the geneid in the list, use this option for geneid
## It would take a long time to run the whole list and not all genes have snps
# import pandas as pd
# n = 5 
# geneid = pd.read_csv(gencode_path,header=0,sep="\t")["geneid"].to_list()[:5]
input: for_each = "geneid"
output: f'{out_path}/{_geneid}_{tissue}.txt'  
bash: expand = "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout',container = container 
    Rscript /opt/MR_JTI/JTI.r \
      --tissue ${tissue} \
      --geneid ${_geneid} \
      --genotype_path ${genotype_path} \
      --expression_path ${expression_path} \
      --tmp_folder ${tmp_folder} \
      --gencode_path ${gencode_path} \
      --out_path ${out_path} 

In [None]:
# Part 2 Association test
[association]
# a pre-downloaded database with prediction models
parameter: db_path = path
# genotype file in PLINK format 
parameter: genotype_path = path
# model used in AssoTest
parameter: model = 'JTI'
parameter: trait = 'LDLq'
parameter: tissue = 'Liver'
# data folder
parameter: input_folder = f'{cwd}/data'
input: db_path
output: f'{out_path}/{trait}_{model}_{tissue}.csv'
R: expand = "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout',container = container
    # Prediction model includes the genetic variants and their effect allele, reference allele, and weight. It usually packaged into SQLitle file with postfix ‘.db’.
    library(RSQLite)
    con <- dbConnect(RSQLite::SQLite(), dbname='${db_path}')             #establish connections
    dbListTables(con)  #datasets
    dbListFields(con, 'weights')   #cols
    weights = dbReadTable(con,"weights")
    dbDisconnect(con) #disconnect

# convert to dosage file
bash: expand = "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout',container = container
    plink --bfile ${genotype_path} --recode A --out ${out_path}/dosage
    MetaXcan.py \
        --model_db_path ${input_folder}/${model}_${tissue}.db \
        --covariance ${input_folder}/${model}_${tissue}.txt.gz \
        --gwas_file ${input_folder}/${trait}.txt.gz \
        --snp_column rsid \
        --effect_allele_column eff_allele \
        --non_effect_allele_column ref_allele \
        --beta_column beta \
        --se_column se \
        --output_file ${out_path}/${trait}_${model}_${tissue}.csv

In [11]:
# Part 3 Mendelian Randomization (MR-JTI)
[MR]
# Path to dataframe of GWAS and eQTL summary statistics 
parameter: df_path = path
# Total number of genes tested 
parameter: n_genes = int
input: f"{df_path}"
output: f"{out_path}/{df_path:nb}.csv"
bash: expand = "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout',container = container
    Rscript  /opt/MR_JTI/MR-JTI.r \
        --df_path ${df_path} \
        --n_genes ${n_genes} \
        --out_path ${_output}