-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Raise warning when zero inference sites provided. #683
Comments
Looks like the defaults for |
Weird. We shouldn't be allowing any recurrent mutation here, because there's no recombination rate passed in, right? |
But yes, if there are 6314841 mutations but only 262165 sites, that explains why we only have one tree. That seems like a bug if you have simply called |
I've had a look just now. None of the sites are mark for inference, that's why. It looks like the VCF parsing is wrong: it's getting weird characters in the allele string:
gives
So there's a weird |
When trying to perform inference, I wonder if we should raise a warning if there are sites, but none valid for inference? That would have caught this problem earlier. |
Thanks! and interesting - so from the original VCF there is the "AA" INFO tag, and including a line from the original global population Phase 3 VCF I downloaded straight from the 1000G project, it says so:
Odd. |
I don't think we used the AA from the 1000G project, as there are more reliable sources of ancestral allele information than in 1000G. The VCF reading code there is therefore wrong for the 1000G AA format. The demo code assumes something like this:
(from here) If you know what the 1000G AA format means, then you can adjust the VCF reading code as appropriate |
It looks like "." means that it was impossible to call an ancestral allele. We should probably adjust the demo code to stop at "|" and cope with unknown AA states. |
Aha, well tracked down @hyanwong. A warning when we have 0 inference sites sounds like the right thing to do. I've changed the title of this issue so we can close it when this is implemented. |
Hi @evolgenomics - could you see if the updated example |
Actually, see #689 for a version which does actually work! |
Yes - I solved that problem earlier by removing the AA tag. I now tried the tutorial version as well and it works as well. Thanks! Ultimately the #689 solution is the best one. |
Hi team,
I was trying to run tsinfer on the 1000G GBR samples, focusing on a single 1Mbp region.
I imported using a VCF input copied off the tutorial. Essentially:
vcf = cyvcf2.VCF('GBR.variable.vcf.gz')
with tsinfer.SampleData(
path="GBR.samples", sequence_length=chromosome_length(vcf)
) as samples:
add_diploid_sites(vcf, samples)
Do the inference
ts = tsinfer.infer(samples)
ts.dump("GBR.trees")
$ python tsinfer_convert.py
Sample file created for 182 samples (182 individuals) with 262165 variable sites.
Inferred tree sequence: 1 trees over 63.02552 Mb (182 edges)
$ tskit info GBR.trees
╔═════════════════════════╗
║TreeSequence ║
╠═══════════════╤═════════╣
║Trees │ 1║
╟───────────────┼─────────╢
║Sequence Length│ 63025520║
╟───────────────┼─────────╢
║Time Units │ unknown║
╟───────────────┼─────────╢
║Sample Nodes │ 182║
╟───────────────┼─────────╢
║Total Size │237.6 MiB║
╚═══════════════╧═════════╝
╔═══════════╤═══════╤══════════╤════════════╗
║Table │Rows │Size │Has Metadata║
╠═══════════╪═══════╪══════════╪════════════╣
║Edges │ 182│ 5.7 KiB│ No║
╟───────────┼───────┼──────────┼────────────╢
║Individuals│ 182│ 5.4 KiB│ Yes║
╟───────────┼───────┼──────────┼────────────╢
║Migrations │ 0│ 8 Bytes│ No║
╟───────────┼───────┼──────────┼────────────╢
║Mutations │6314841│ 222.8 MiB│ No║
╟───────────┼───────┼──────────┼────────────╢
║Nodes │ 183│ 5.0 KiB│ No║
╟───────────┼───────┼──────────┼────────────╢
║Populations│ 0│ 8 Bytes│ No║
╟───────────┼───────┼──────────┼────────────╢
║Provenances│ 2│1008 Bytes│ No║
╟───────────┼───────┼──────────┼────────────╢
║Sites │ 262165│ 14.8 MiB│ Yes║
╚═══════════╧═══════╧══════════╧════════════╝
I tried doing the whole chromosome, or a single 1Mbp, and got the same results.
I put some of the files on an ftp server. Please see if you could replicate the issue
http://ftp.tuebingen.mpg.de/fml/ag-chan/tskit/
Thanks
The text was updated successfully, but these errors were encountered: