FAQ

General

Q. Are there any presentations or slides about HDL?

A. Yes! A short presentation about the main ideas and results of HDL is available here from 37 to 49 minutes. The presentation was given at the European Mathematical Genetics Meeting (EMGM) 2020.

A shorter 5-min version is available here. This video was selected to be played on the poster session of the 6th of International Conference of Quantitative Genetics (ICQG6) 2020.

About the HDL software, please find the slides here.

Q. How many samples do I need to estimate r_g between a pair of traits?

A. It largely depends on the traits. For traits with larger signals, such as omics, maybe thousands of individuals are sufficient. However, for complex traits like height and weight, you may need a much larger sample size.

Q. Can HDL perform partitioned heritability analysis?

A. We are actively working on it. It will be coming soon.

Reference panel

Q. How can I compute a reference panel with my own data?

A. Here is the pipeline for reference panel building. There are three key steps:

QC and cutting chromosomes into pieces for parallelization;
Computing LD between SNPs using plink;
Performing eigendecomposition on the LD matrix and taking the leading eigenvalues and correspondent eigenvectors.

Q. A lot of SNPs in the reference panel are absent in my GWAS. Can I believe HDL results in such a scenario? How should I fix this?

A. All the SNPs in the reference panel are carefully QCed common SNPs (MAF > 5%) in European ancestry. Therefore we expect most of them should be available in current GWASs. As a caveat, HDL prints a warning when more than 1% SNPs are absent, which is about 10K SNPs for the imputed panel. The test statistic in HDL is robust against missing SNPs. When there are many missing SNPs, HDL becomes more conservative, which means it loses some power but will not generate false-positive results. Therefore no further treatment is needed if the missing rate is mild. However, if many panel SNPs are absent in the GWAS, the point estimate of genetic correlation can be biased. When there are too many missing SNPs, one may consider switching to a smaller reference panel, such as the one with HapMap2 SNPs.

The above results are based on simulations using the 435 pairs of traits in the HDL paper. We set three different SNPs missing rates (1%, 2% and 5%) in each GWAS. 10 simulations were performed for each pair of traits and each level of missing rate. The median HDL results across the 10 simulations were taken as representatives. The simulation results were summarized in the figure below, where the x-axis shows the HDL results without missing SNP, and the y-axis shows the HDL results with missing SNPs.

HDL.missing.SNPs.sim

Q. Can't HDL solve the missing SNPs by subsetting the reference LD score matrix?

A. We did eigendecomposition on the LD matrices, which makes it difficult to subset. Because a subset of the eigenvector matrix will no longer be a proper eigenvector matrix.

Q. I work in mainland China. The Dropbox links for reference panels do not work for me. Is there any way I can access the files?

A. We have a copy of the reference panels hosted on Baidu Netdisk:

1,029,876 QCed UK Biobank imputed HapMap3 SNPs. Extraction code: qcwe
769,306 QCed UK Biobank imputed HapMap2 SNPs. Extraction code: 86fg
307,519 QCed UK Biobank Axiom Array SNPs. Extraction code: 8wrb

r_g estimation

Q. The r_g is estimated to be above 1 or below -1. What is going on?

A. The estimated r_g is a combination of the real r_g and variation. When the real r_g is close to the boundary (-1 or 1) and/or variation is large, the estimated r_g can go beyond the boundary. In r_g estimation, some common reasons generating large variation are:

at least one of the h² estimates is very low;
small sample size;
many SNPs in the reference panel are absent in one of the two GWASs;
there is a severe mismatch between the GWAS population and the population for computing reference panel.

h² estimation

Q. How do HDL h² estimations compare to those by LDSC or linear mixed model (LMM)?

A. As r_g estimation, h² estimated by HDL is similar with LDSC results but more accurate. As an example, we extracted LMM estimates of h² of 41 continuous traits in UK Biobank from Gene ATLAS. The h² estimates given by HDL and LDSC are similar (A). Nevertheless, the HDL estimates are closer to LMM results (B) than LDSC estimates do (C), suggesting potential underestimated h² by LDSC.

HDL_LDSC_vs_LMM_ukbb_h2_41traits_2

Q. Why do I get h² = 0?

A. This may due to

the true h² is very low;
small sample size;
many SNPs in the reference panel are absent in one of the two GWASs;
there is a severe mismatch between the GWAS population and the population for computing reference panel.

Q. I run multiple HDL between one trait and multiple traits. Why do I get different h² estimates for that trait?

A. This may due to

different SNPs in the reference panel are absent in different GWASs;
different eigen.cut was used. Please see here for more details about eigen.cut.R users can check eigen.use in the output object to find which eigen.cut was used.

Q. Why does the - h2*lam/Nref term appear in llfun function but not in the Supplementary Information of HDL paper?

A. This is something we developed after the published Supplementary Information of the HDL paper. As Nref is normally a large quantity, it is, in theory, negligible. The detailed derivation is given below.

Given the reference population with size $N_{ref}$, we have

$$\sum_{k=1}^M \mathrm{Cov}[\hat{r}_{jk}, \hat{r}_{j'k}] \approx \frac{M}{N_{ref}} r_{jj'},$$

based on Eq. (5) of the Supplementary Notes of the HDL paper. In practice, we can approximate the element $l_{jj'}$ in the $\symbf{L}$ matrix as follows:

$$\begin{aligned} l_{jj'} &= \sum_{k=1}^M r_{jk}r_{j'k} = \sum_{k=1}^M \left[\mathrm{E}[\hat{r}_{jk}]\mathrm{E}[\hat{r}_{j'k}]\right] \\\ &= \sum_{k=1}^M\left[\mathrm{E}[\hat{r}_{jk}\hat{r}_{j'k}] - \mathrm{Cov}[\hat{r}_{jk}, \hat{r}_{j'k}]\right] \\\ &= \mathrm{E}\left[ \sum_{k=1}^M[\hat{r}_{jk} \hat{r}_{j'k}] \right] - \sum_{k=1}^M \mathrm{Cov}[\hat{r}_{jk}, \hat{r}_{j'k}] \\\ & \approx \hat{l}_{jj'} - \frac{M}{N_{ref}}r_{jj'}. \end{aligned}$$

Therefore, the $\symbf{\Sigma}$ matrix in the HDL model becomes

$$\begin{aligned} \symbf{\Sigma} &= \frac{Nh^2}{M}\symbf{L} + \symbf{R} \\\ &=\frac{Nh^2}{M}\symbf{R}'\symbf{R} {\color{red} -\frac{Nh^2}{N_{ref}}\symbf{R}} + \symbf{R}\\\ \end{aligned}$$

where ${\color{red} -\frac{Nh^2}{N_{ref}}\symbf{R}}$ corresponds to the - h2*lam/Nref term in the currently implemented llfun function.

Although this additional term seems to be negligible, we did find that the inclusion of this term could result in even better efficiency for the estimation of e.g., the genetic correlation parameter.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FAQ

General

Reference panel

r_g estimation

h² estimation

Home

Installation and update

Reference panels

Build a reference panel

Format of summary statistics

Syntax and results of HDL

Example: HDL with an array reference panel

Example: HDL with an imputed reference panel

Example: Getting HDL results from raw GWAS summary statistics

FAQ

Clone this wiki locally

FAQ

General

Reference panel

rg estimation

h2 estimation

Clone this wiki locally

r_g estimation

h² estimation