Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recalculating Rsq Imputation Quality Values #26

Open
maegsul opened this issue Nov 8, 2019 · 6 comments
Open

Recalculating Rsq Imputation Quality Values #26

maegsul opened this issue Nov 8, 2019 · 6 comments

Comments

@maegsul
Copy link

maegsul commented Nov 8, 2019

Hi,

First of all, thanks a lot for developing Minimac4, it is a great tool.

I have an issue regarding recalculating (and replicating) Rsq (imputation quality) values. I am interested in this because I want to recalculate/update Rsq values per variant for a subset of samples in a new VCF.

As far as I know, the formula should be: https://genome.sph.umich.edu/wiki/Minimac3_Info_File#Rsq and the actual code is this I think: https://github.com/statgen/Minimac4/blob/8c1641e43b63d005e3a6d12901d29ccb5ad697b1/src/ImputationStatistics.cpp (the part that starts with "double ImputationStatistics::Rsq(int marker)").

I recently started using Minimac4 available on Michigan Imputation Server and realized that different than before, now they also output HDS (Estimated Haploid Alternate Allele Dosage) values. I made use of these values, and come up with below bcftools/awk one-liner to calculate Rsq values:

bcftools query -f "%ID\t%AF\t%R2[\t%HDS]\n" chr1.dose.vcf.gz | sed 's/,/\t/g' | awk '{diffSumSq = "" ; for(i=4; i<=NF; i++) diffSumSq += ($i - ($2 + 1e-30))^2 ; printf $1"\t"$2"\t"$3"\t""%.5f\n",(diffSumSq/(NF-3)/(($2 + 1e-30)*(1 - ($2 + 1e-30))))}'

Here I first extract these information from the VCF: ID, AF and R2 (from INFO field) and HDS values from sample fields. Then I convert comma-separated HDS values into tab-separated values, so for instance if I have 1000 samples (and 1000 HDS values), now I have 2000 tab-separated HDS values for the whole cohort.

Then with awk, I substract AF from each HDS values and take the square of this, followed by summing all these values ("diffSumSq"), then I divide this by number of HDS values (2n) times AF x (1-AF). To avoid problems of AF being 0, I add 1e-30 to AF values as shown in ImputationStatistics.cpp.

I follow a very similar approach in R too and the resulting new Rsq values are the same as in the bcftools/awk approach.

When done on the same samples, I also replicate the original Rsq values (or I reach to a number that is very close to the original Rsq), however when AF is low, I am obtaining very different Rsq values than the originals. This is probably caused by p x (1-p), that is creating a very small number.

In short, is there a way to recalculate Rsq values correctly on a subset of samples? Am I doing a mistake here?

Thanks in advance!

@Santy-8128
Copy link
Collaborator

Santy-8128 commented Nov 8, 2019 via email

@maegsul
Copy link
Author

maegsul commented Nov 8, 2019

Hi Sayathan, thanks a lot for your answer, it is very clear!

Indeed, mainly for rare variants I couldn't replicate the Rsq values I see in the imputation output VCFs. And yes, I had to calculate them based on rounded off (3 decimal places) values I see in the VCF. Speaking of this, just to confirm, for Rsq calculation you use HDS and not DS, right?

At a first glance, for chr22 imputation (n=519190 variants) results of an example n=~3000 samples cohort, I have obtained the below stats between my Rsq and the original Rsq values:

For AF > 0.01 variants, the biggest difference is 0.03232 (occurs only for 1 variant, and the second and the third biggest differences are 0.00118 and 0.00071), the smallest difference is 0. Median difference is as small as 7e-05 and it looks very good overall.

For 0 < AF < 0.01 variants, the max difference is 0.04573. Median difference is 3e-05. It is still good I think.

For AF = 0 variants, I have terrible Rsq scores that are as big as 1e+23 and they are all > 1e+20.

So indeed, the rarer it is, the more problematic it is. Actually for AF=0 variants I have very inflated R2 values as you can see (even bigger than >1). How do you avoid this? Because in the original Rsq formula it is indicated that the divisor is actually p x (1-p) where p is being the alternate allele frequency; and if p is very rare (let's say 1e-5) the divisor will be extremely small, resulting in a highly inflated Rsq score. Isn't it the case?

@Santy-8128
Copy link
Collaborator

Santy-8128 commented Nov 11, 2019 via email

@ChakrabortyShreya
Copy link

Hi,

Can you please tell me how HDS is calculated (the formula) and what exactly estimated haploid allele dosage means?
And what do the two comma-separated numbers reported under HDS mean (please refer to the screenshot)?
Also, in the attached screenshot it can be seen that the posterior probability is maximum for the heterozygous genotype (0.457). Still, why is the imputed genotype reported as homozygous reference?

Thanks in advance.

Screenshot 2022-12-13 at 12 53 04 PM

@jonathonl
Copy link
Contributor

jonathonl commented Dec 13, 2022

HDS is the sum of posterior probabilities (see equation 4 in https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5157836/) for templates with alternate alleles for that variant. This paper is for Minimac3. While Minimac4 has minor differences, the basic concept for calculating HDS is the same. Each value of HDS corresponds to a haplotype for an individual. You can think of each value as the probability of the corresponding haplotype having the alternate allele.

The hard-called genotypes are simply rounded HDS values. The GP values are derived from HDS and not used for calling phased GT (since you need haplotype probabilities, instead of genotype probabilities, to call phased GT). GP is optionally generated for convenience in running downstream analyses that require GP. The formulas for GP can be found at https://github.com/statgen/hds-util/blob/master/README.md#diploid.

@yukt
Copy link
Contributor

yukt commented Dec 13, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants