Skip to content
Lan Ao edited this page May 5, 2023 · 22 revisions

General

Q. Are there any presentations or slides about HDL?

A. Yes! A short presentation about the main ideas and results of HDL is available here from 37 to 49 minutes. The presentation was given at the European Mathematical Genetics Meeting (EMGM) 2020.

A shorter 5-min version is available here. This video was selected to be played on the poster session of the 6th of International Conference of Quantitative Genetics (ICQG6) 2020.

About the HDL software, please find the slides here.

Q. How many samples do I need to estimate rg between a pair of traits?

A. It largely depends on the traits. For traits with larger signals, such as omics, maybe thousands of individuals are sufficient. However, for complex traits like height and weight, you may need a much larger sample size.

Q. Can HDL perform partitioned heritability analysis?

A. We are actively working on it. It will be coming soon.

Reference panel

Q. How can I compute a reference panel with my own data?

A. Here is the pipeline for reference panel building. There are three key steps:

  1. QC and cutting chromosomes into pieces for parallelization;
  2. Computing LD between SNPs using plink;
  3. Performing eigendecomposition on the LD matrix and taking the leading eigenvalues and correspondent eigenvectors.

Q. A lot of SNPs in the reference panel are absent in my GWAS. Can I believe HDL results in such a scenario? How should I fix this?

A. All the SNPs in the reference panel are carefully QCed common SNPs (MAF > 5%) in European ancestry. Therefore we expect most of them should be available in current GWASs. As a caveat, HDL prints a warning when more than 1% SNPs are absent, which is about 10K SNPs for the imputed panel. The test statistic in HDL is robust against missing SNPs. When there are many missing SNPs, HDL becomes more conservative, which means it loses some power but will not generate false-positive results. Therefore no further treatment is needed if the missing rate is mild. However, if many panel SNPs are absent in the GWAS, the point estimate of genetic correlation can be biased. When there are too many missing SNPs, one may consider switching to a smaller reference panel, such as the one with HapMap2 SNPs.

The above results are based on simulations using the 435 pairs of traits in the HDL paper. We set three different SNPs missing rates (1%, 2% and 5%) in each GWAS. 10 simulations were performed for each pair of traits and each level of missing rate. The median HDL results across the 10 simulations were taken as representatives. The simulation results were summarized in the figure below, where the x-axis shows the HDL results without missing SNP, and the y-axis shows the HDL results with missing SNPs.

HDL.missing.SNPs.sim

Q. Can't HDL solve the missing SNPs by subsetting the reference LD score matrix?

A. We did eigendecomposition on the LD matrices, which makes it difficult to subset. Because a subset of the eigenvector matrix will no longer be a proper eigenvector matrix.

Q. I work in mainland China. The Dropbox links for reference panels do not work for me. Is there any way I can access the files?

A. We have a copy of the reference panels hosted on Baidu Netdisk:

rg estimation

Q. The rg is estimated to be above 1 or below -1. What is going on?

A. The estimated rg is a combination of the real rg and variation. When the real rg is close to the boundary (-1 or 1) and/or variation is large, the estimated rg can go beyond the boundary. In rg estimation, some common reasons generating large variation are:

  1. at least one of the h2 estimates is very low;
  2. small sample size;
  3. many SNPs in the reference panel are absent in one of the two GWASs;
  4. there is a severe mismatch between the GWAS population and the population for computing reference panel.

h2 estimation

Q. How do HDL h2 estimations compare to those by LDSC or linear mixed model (LMM)?

A. As rg estimation, h2 estimated by HDL is similar with LDSC results but more accurate. As an example, we extracted LMM estimates of h2 of 41 continuous traits in UK Biobank from Gene ATLAS. The h2 estimates given by HDL and LDSC are similar (A). Nevertheless, the HDL estimates are closer to LMM results (B) than LDSC estimates do (C), suggesting potential underestimated h2 by LDSC.

HDL_LDSC_vs_LMM_ukbb_h2_41traits_2

Q. Why do I get h2 = 0?

A. This may due to

  1. the true h2 is very low;
  2. small sample size;
  3. many SNPs in the reference panel are absent in one of the two GWASs;
  4. there is a severe mismatch between the GWAS population and the population for computing reference panel.

Q. I run multiple HDL between one trait and multiple traits. Why do I get different h2 estimates for that trait?

A. This may due to

  1. different SNPs in the reference panel are absent in different GWASs;
  2. different eigen.cut was used. Please see here for more details about eigen.cut.R users can check eigen.use in the output object to find which eigen.cut was used.

Q. Why does the - h2*lam/Nref term appear in llfun function but not in the Supplementary Information of HDL paper?

A. This is something we developed after the published Supplementary Information of the HDL paper. As Nref is normally a large quantity, it is, in theory, negligible. The detailed derivation is given below.

Given the reference population with size $N_{ref}$, we have

$$\sum_{k=1}^M \mathrm{Cov}[\hat{r}_{jk}, \hat{r}_{j'k}] \approx \frac{M}{N_{ref}} r_{jj'},$$

based on Eq. (5) of the Supplementary Notes of the HDL paper. In practice, we can approximate the element $l_{jj'}$ in the $\symbf{L}$ matrix as follows:

$$\begin{aligned} l_{jj'} &= \sum_{k=1}^M r_{jk}r_{j'k} = \sum_{k=1}^M \left[\mathrm{E}[\hat{r}_{jk}]\mathrm{E}[\hat{r}_{j'k}]\right] \\\ &= \sum_{k=1}^M\left[\mathrm{E}[\hat{r}_{jk}\hat{r}_{j'k}] - \mathrm{Cov}[\hat{r}_{jk}, \hat{r}_{j'k}]\right] \\\ &= \mathrm{E}\left[ \sum_{k=1}^M[\hat{r}_{jk} \hat{r}_{j'k}] \right] - \sum_{k=1}^M \mathrm{Cov}[\hat{r}_{jk}, \hat{r}_{j'k}] \\\ & \approx \hat{l}_{jj'} - \frac{M}{N_{ref}}r_{jj'}. \end{aligned}$$

Therefore, the $\symbf{\Sigma}$ matrix in the HDL model becomes

$$\begin{aligned} \symbf{\Sigma} &= \frac{Nh^2}{M}\symbf{L} + \symbf{R} \\\ &=\frac{Nh^2}{M}\symbf{R}'\symbf{R} {\color{red} -\frac{Nh^2}{N_{ref}}\symbf{R}} + \symbf{R}\\\ \end{aligned}$$

where ${\color{red} -\frac{Nh^2}{N_{ref}}\symbf{R}}$ corresponds to the - h2*lam/Nref term in the currently implemented llfun function.

Although this additional term seems to be negligible, we did find that the inclusion of this term could result in even better efficiency for the estimation of e.g., the genetic correlation parameter.