# HDL PRS example

Here we show an example of our pipeline for HDL PRS on UK Biobank samples. We use both effects estimates from MVP lipid traits analysis as well as posterior effects generated by `mashr` package.

## Data used

### Reference panel

Obtained via `download_1000G()` in `bigsnpr`. 

Including 503 (mostly unrelated) European individuals and ~1.7M SNPs in common with either HapMap3 or the UK Biobank. Classification of European population can be found at [IGSR](https://www.internationalgenome.org/category/population/). European individuals ID are from [IGSR data portal](https://www.internationalgenome.org/data-portal/sample).

### GWAS summary statistics data

From MVP. We have the original GWAS summary data as well as multivariate posterior estimate of HDL effects using [mashr](https://github.com/stephenslab/mashr). In brief, we have two versions of summary statistics (effect estimates) for HDL.

### Target test data: UK biobank

We select randomly from UK Biobank 2000 individuals with covariates and HDL phenotype (medication adjusted, inverse normalized). Their genotypes are extracted. See `UKB.QC.*` PLINK file bundle. 

### PRS Models

Auto model runs the algorithm for 30 different $p$ (the proportion of causal variants) values range from 10e-4 to 0.9, and heritability $h^2$ from LD score regression as initial value.

Grid model tries a grid of parameters $p$, ranges from 0 to 1 and three $h^2$ which are 0.7/1/1.4 times of initial $h^2$ estimated by LD score regression.

## Analysis of MVP GWAS data

### Step 1: QC on reference panel

Here we assume the target data QC has been already performed. We perform here QC for reference panel,

In [1]:
work_dir=mvp_gwas
cd ~/Documents/PRS_MASH

In [2]:
sos run ldpred.ipynb snp_qc \
    --cwd $work_dir \
    --genoFiles 1000G.EUR/1000G.EUR.bed

sos run ldpred.ipynb snp_qc \
    --cwd $work_dir \
    --genoFiles 1000G.EUR/1000G.EUR.bed


### Step 2: Intersect SNPs among summary stats, reference panel and target data

In [3]:
work_dir=mvp_gwas
lipid=tg
data=gwas
cd ~/Documents/PRS_MASH

In [4]:
sos run ldpred.ipynb snp_intersect \
    --cwd $work_dir \
    --ss mvpdata/$data"_"$lipid.rds \
    --genoFiles $work_dir/1000G.EUR.$work_dir.bed UKBB_broad/ukbb_merged.5000_subset.bed -s force

INFO: Running [32msnp_intersect_1[0m: SNP intersect of summary stats and genotype data
INFO: [32msnp_intersect_1[0m is [32mcompleted[0m.
INFO: [32msnp_intersect_1[0m output:   [32m/home/surbut/Documents/PRS_MASH/mvp_gwas/gwas_tg.intersect.rds /home/surbut/Documents/PRS_MASH/mvp_gwas/gwas_tg.intersect.snplist[0m
INFO: Running [32msnp_intersect_2[0m: 
INFO: [32msnp_intersect_2[0m is [32mcompleted[0m (pending nested workflow).
INFO: Running [32mpreprocess_1[0m: Filter SNPs and select individuals
INFO: [32mpreprocess_1[0m (index=0) is [32mcompleted[0m.
INFO: [32mpreprocess_1[0m (index=1) is [32mcompleted[0m.
INFO: [32mpreprocess_1[0m output:   [32m/home/surbut/Documents/PRS_MASH/mvp_gwas/1000G.EUR.mvp_gwas.snp_intersect.extracted.bed /home/surbut/Documents/PRS_MASH/mvp_gwas/ukbb_merged.5000_subset.snp_intersect.extracted.bed in 2 groups[0m
INFO: Running [32mconvert PLNIK to bigsnpr format with missing data mean imputed[0m: 
INFO: [32mconvert PLNIK to bigsnp

In [5]:
tail -1 $work_dir/$data"_"$lipid.intersect.stdout

[1] "There are 448077 shared SNPs."


### Step 3: Harmonize alleles for shared SNPs

To handle major/minor allele, strand flips and consequently possible flips in sign for summary statistics.

In [6]:
sos run ldpred.ipynb snp_match \
    --cwd $work_dir \
    --reference_geno $work_dir/1000G.EUR.$work_dir.snp_intersect.extracted.rds \
    --ss mvpdata/$data"_"$lipid.rds -s force

INFO: Running [32msnp_match[0m: 
INFO: [32msnp_match[0m is [32mcompleted[0m.
INFO: [32msnp_match[0m output:   [32m/home/surbut/Documents/PRS_MASH/mvp_gwas/gwas_tg.snp_matched.rds /home/surbut/Documents/PRS_MASH/mvp_gwas/gwas_tg.snp_matched.snplist[0m
INFO: Workflow snp_match (ID=wf804c0b6f6acbd48) is executed successfully with 1 completed step.


### Step 4: Calculate LD matrix and fit LDSC model

In [7]:
sos run ldpred.ipynb ldsc \
    --cwd $work_dir \
    --ss $work_dir/$data"_"$lipid.snp_matched.rds \
    --reference-geno $work_dir/1000G.EUR.$work_dir.snp_intersect.extracted.rds -s force

INFO: Running [32mldsc[0m: 
INFO: [32mldsc[0m is [32mcompleted[0m.
INFO: [32mldsc[0m output:   [32m/home/surbut/Documents/PRS_MASH/mvp_gwas/gwas_tg.snp_matched.ld.rds[0m
INFO: Workflow ldsc (ID=we8eff4db65db3c9b) is executed successfully with 1 completed step.


### Step 6: Estimate posterior effect sizes and PRS

For original data,

In [8]:
sos run ldpred.ipynb inf_prs \
    --cwd $work_dir \
    --ss $work_dir/$data"_"$lipid.snp_matched.rds \
    --target-geno $work_dir/ukbb_merged.5000_subset.snp_intersect.extracted.rds \
    --ldsc $work_dir/$data"_"$lipid.snp_matched.ld.rds -s force

INFO: Running [32minf_prs[0m: 
INFO: [32minf_prs[0m is [32mcompleted[0m (pending nested workflow).
INFO: Running [32mprs_core[0m: 
INFO: [32mprs_core[0m is [32mcompleted[0m.
INFO: [32mprs_core[0m output:   [32m/home/surbut/Documents/PRS_MASH/mvp_gwas/gwas_tg.snp_matched.inf_prs.rds[0m
INFO: [32minf_prs[0m output:   [32m/home/surbut/Documents/PRS_MASH/mvp_gwas/gwas_tg.snp_matched.inf_prs.rds[0m
INFO: Workflow inf_prs (ID=wd3dd1fe41a7c33a0) is executed successfully with 2 completed steps.


In [9]:
tail -1 mvp_gwas/$data"_"$lipid.snp_matched.inf_prs.stdout

[1] "422921 SNPs are used for PRS calculations"


### Step 7: predict phenotypes

Baseline model: Traits ~ Sex + Age

In [10]:
echo $lipid

tg


In [11]:
sos run ldpred.ipynb pred_eval \
    --cwd $work_dir \
    --phenoFile UKBB_broad/UKB.$lipid.cov \
    --covFile UKBB_broad/UKB.ind.cov \
    --response continuous -s force

INFO: Running [32mpred_eval[0m: 
INFO: [32mpred_eval[0m is [32mcompleted[0m.
INFO: [32mpred_eval[0m output:   [32m/home/surbut/Documents/PRS_MASH/mvp_gwas/UKB.tg.baseline.rds[0m
INFO: Workflow pred_eval (ID=wd5dfe822af5ca266) is executed successfully with 1 completed step.


In [12]:
setwd("~/Documents/PRS_MASH")
lipid="tg"
res = readRDS(paste0("mvp_gwas/UKB.",lipid,".baseline.rds"))
summary(res$fitted)
res$summary


Call:
lm(formula = ., data = dat[train.ind, ])

Residuals:
    Min      1Q  Median      3Q     Max 
-1.5856 -0.6484 -0.2391  0.4010  7.5583 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1.100692   0.112827   9.756  < 2e-16 ***
AGE         0.007420   0.001963   3.780 0.000159 ***
SEX         0.473554   0.032322  14.651  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.9918 on 3785 degrees of freedom
Multiple R-squared:  0.0574,	Adjusted R-squared:  0.0569 
F-statistic: 115.2 on 2 and 3785 DF,  p-value: < 2.2e-16


model,R2,MSE
<chr>,<dbl>,<dbl>
model,0.0569,0.89706


Inf/grid/auto model: Traits ~ Sex + Age + PRS

In [13]:
sos run ldpred.ipynb pred_eval \
    --cwd $work_dir \
    --prs $work_dir/$data"_"$lipid.snp_matched.inf_prs.rds \
    --phenoFile UKBB_broad/UKB.$lipid.cov \
    --covFile UKBB_broad/UKB.ind.cov \
    --response continuous -s force

INFO: Running [32mpred_eval[0m: 
INFO: [32mpred_eval[0m is [32mcompleted[0m.
INFO: [32mpred_eval[0m output:   [32m/home/surbut/Documents/PRS_MASH/mvp_gwas/UKB.tg.gwas_tg.snp_matched.inf_prs.rds[0m
INFO: Workflow pred_eval (ID=wb8eeda1e8a2d713f) is executed successfully with 1 completed step.


In [14]:
data="gwas"
res = readRDS(paste0("mvp_gwas/UKB.",lipid,".",data,"_",lipid,".snp_matched.inf_prs.rds"))
summary(res$fitted)
res$summary


Call:
lm(formula = ., data = dat[train.ind, ])

Residuals:
    Min      1Q  Median      3Q     Max 
-1.8820 -0.6095 -0.2046  0.3603  7.1212 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.259445   0.109326  11.520  < 2e-16 ***
AGE          0.007267   0.001895   3.835 0.000127 ***
SEX          0.469264   0.031201  15.040  < 2e-16 ***
PRS         -0.828117   0.049661 -16.675  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.9574 on 3784 degrees of freedom
Multiple R-squared:  0.1219,	Adjusted R-squared:  0.1212 
F-statistic: 175.1 on 3 and 3784 DF,  p-value: < 2.2e-16


model,R2,MSE
<chr>,<dbl>,<dbl>
model.inf_prs,0.12123,0.8499


## Bonus steps: repeat Steps 6 and 7 using other PRS models

In [None]:
sos run ldpred.ipynb grid_prs \
    --cwd $work_dir \
    --ss $work_dir/$data"_"$lipid.snp_matched.rds \
    --target-geno $work_dir/ukbb_merged.5000_subset.snp_intersect.extracted.rds \
    --ldsc $work_dir/$data"_"$lipid.snp_matched.ld.rds \
    --phenoFile UKBB_broad/UKB.$lipid.cov \
    --covFile UKBB_broad/UKB.ind.cov \
    --response continuous -s force

In [None]:
sos run ldpred.ipynb auto_prs \
    --cwd $work_dir \
    --ss $work_dir/$data"_"$lipid.snp_matched.rds \
    --target-geno $work_dir/ukbb_merged.5000_subset.snp_intersect.extracted.rds \
    --ldsc $work_dir/$data"_"$lipid.snp_matched.ld.rds -s force

In [None]:
sos run ldpred.ipynb pred_eval \
    --cwd $work_dir \
    --prs $work_dir/$data"_"$lipid.snp_matched.grid_prs.rds \
    --phenoFile ukbiobank/UKB.$lipid.cov \
    --covFile ukbiobank/UKB.ind.cov \
    --response continuous -s force

In [None]:
res = readRDS(paste0("mvp_gwas/UKB.",lipid,".",data,"_",lipid,".snp_matched.grid_prs.rds"))
summary(res$fitted)
res$summary

In [None]:
sos run ldpred.ipynb pred_eval \
    --cwd $work_dir \
    --prs $work_dir/$data"_"$lipid.snp_matched.auto_prs.rds \
    --phenoFile ukbiobank/UKB.$lipid.cov \
    --covFile ukbiobank/UKB.ind.cov \
    --response continuous -s force

In [None]:

res = readRDS(paste0("mvp_gwas/UKB.",lipid,".",data,"_",lipid,".snp_matched.auto_prs.rds"))
summary(res$fitted)
res$summary