Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Variants missed in iHS output #64

Closed
psytky03 opened this issue Jul 1, 2021 · 9 comments
Closed

Variants missed in iHS output #64

psytky03 opened this issue Jul 1, 2021 · 9 comments

Comments

@psytky03
Copy link

psytky03 commented Jul 1, 2021

Dear @szpiech,
I attempted to compare the iHS results of two datasets (which I processed separated) but found not all variants were included in the iHS outputs, which made the direct comparison difficult. Is there a way to report all variants?

Thank you very much!

@szpiech
Copy link
Owner

szpiech commented Jul 1, 2021

Hi there,

Since selscan filters low frequency variants based on the sample allele frequencies of the data provided, you'll inevitably find some scores "missing" from one dataset to another. You could set ---maf 0 to avoid low frequency filtering. However, I should note that including low frequency variants can diminish the power of the statistic, so keep that in mind depending on what you're interested in. Another option, would of course be to allow the MAF filtering but only compare sites with scores in both data sets. I don't know what your analysis goals are, or what the relationship is between these two datasets, but you could consider running XPEHH or XPNSL too if that makes sense. Hope this helps!

@psytky03
Copy link
Author

psytky03 commented Jul 5, 2021

Hi, Dr. @szpiech , thanks for the advices. However I feel what I encountered might not a low-frequency variant issue. I should make it clear that the two datasets were used as discovery and replication sets, and the variants with iHS scores in only one dataset are not rare variants (AF ~ 0.40), I am wondering if all variants in a given haplotype are reported? May I extract raw input/output data and send it to you by email for a closer lookup? Another question is will it be recommended to use HG38 build instead of HG19 as the latter contains regions with alternate loci or with fix patches? will such regions a problem for iHS analysis?

@szpiech
Copy link
Owner

szpiech commented Jul 5, 2021

selscan will only report a score at site above the MAF threshold, but scores at sites above the MAF threshold can also be dropped if they are close to a large gap or near the chromosome boundaries. --trunc-ok will report those scores too. I suppose hg38 would be preferred, but either is okay.

@psytky03
Copy link
Author

psytky03 commented Jul 6, 2021

It seems that even with --trunc-ok some of the variants above the MAF threshold were dropped in the output file. I have reproduced this issue with the 1KGP data, and sent the script/results to you in email. Could you kindly have a check (just in case it went to the spam folder). Thank you very much!

@gabyrech
Copy link

Hi @psytky03! Did you manage to know what is going on with the missing variants? I think something simmilar is happening to me.
Thanks!

@szpiech
Copy link
Owner

szpiech commented Jul 15, 2021

I’m still looking into this, but it’ll be a few weeks as I’m on vacation starting today.

@hernrya
Copy link
Collaborator

hernrya commented Jul 16, 2021 via email

@psytky03
Copy link
Author

psytky03 commented Sep 7, 2021

Just wondering if there is any update on this issue @szpiech, thanks!

@szpiech
Copy link
Owner

szpiech commented Sep 9, 2021

Okay, sorry for the delay in figuring this out. This one was tricky.

The sample example files that were sent to me, which contained a segment of human chr7 from TGP, showed that a large number of sites with MAF > 0.05 were not being reported in the genomic interval chr7:142038782-142257749.

After a lot of head-scratching, I believe the problem traces to the genetic map file that is being used. In particular, all sites in chr7:142038782-142257749 are assigned identical genetic map positions (154.2832 cM), and I believe this is causing the computation to hit a condition where they get dropped, perhaps interacting with the 50kb gap that also exists in the region (although I'm unsure exactly which condition). A solution would be to either assign a finer genetic map position to these variants, use nSL, or use the --pmap flag when computing iHS (this will use the physical map as the distance for integrating). Using the example files I was provided, I was able to get scores reported with the following invocation.
selscan --ihs --vcf EAS_chr7.vcf.gz --threads 24 --trunc-ok --out test.eas.chr7 --pmap

Hope this helps!

@szpiech szpiech closed this as completed Sep 14, 2021
@szpiech szpiech mentioned this issue Dec 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants