Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Raise warning when zero inference sites provided. #683

Closed
evolgenomics opened this issue Jun 22, 2022 · 12 comments · Fixed by #685
Closed

Raise warning when zero inference sites provided. #683

evolgenomics opened this issue Jun 22, 2022 · 12 comments · Fixed by #685
Milestone

Comments

@evolgenomics
Copy link

Hi team,

I was trying to run tsinfer on the 1000G GBR samples, focusing on a single 1Mbp region.

I imported using a VCF input copied off the tutorial. Essentially:

vcf = cyvcf2.VCF('GBR.variable.vcf.gz')
with tsinfer.SampleData(
path="GBR.samples", sequence_length=chromosome_length(vcf)
) as samples:
add_diploid_sites(vcf, samples)

Do the inference

ts = tsinfer.infer(samples)

ts.dump("GBR.trees")

$ python tsinfer_convert.py

Sample file created for 182 samples (182 individuals) with 262165 variable sites.
Inferred tree sequence: 1 trees over 63.02552 Mb (182 edges)

$ tskit info GBR.trees
╔═════════════════════════╗
║TreeSequence ║
╠═══════════════╤═════════╣
║Trees │ 1║
╟───────────────┼─────────╢
║Sequence Length│ 63025520║
╟───────────────┼─────────╢
║Time Units │ unknown║
╟───────────────┼─────────╢
║Sample Nodes │ 182║
╟───────────────┼─────────╢
║Total Size │237.6 MiB║
╚═══════════════╧═════════╝
╔═══════════╤═══════╤══════════╤════════════╗
║Table │Rows │Size │Has Metadata║
╠═══════════╪═══════╪══════════╪════════════╣
║Edges │ 182│ 5.7 KiB│ No║
╟───────────┼───────┼──────────┼────────────╢
║Individuals│ 182│ 5.4 KiB│ Yes║
╟───────────┼───────┼──────────┼────────────╢
║Migrations │ 0│ 8 Bytes│ No║
╟───────────┼───────┼──────────┼────────────╢
║Mutations │6314841│ 222.8 MiB│ No║
╟───────────┼───────┼──────────┼────────────╢
║Nodes │ 183│ 5.0 KiB│ No║
╟───────────┼───────┼──────────┼────────────╢
║Populations│ 0│ 8 Bytes│ No║
╟───────────┼───────┼──────────┼────────────╢
║Provenances│ 2│1008 Bytes│ No║
╟───────────┼───────┼──────────┼────────────╢
║Sites │ 262165│ 14.8 MiB│ Yes║
╚═══════════╧═══════╧══════════╧════════════╝

I tried doing the whole chromosome, or a single 1Mbp, and got the same results.

I put some of the files on an ftp server. Please see if you could replicate the issue

http://ftp.tuebingen.mpg.de/fml/ag-chan/tskit/

Thanks

@jeromekelleher
Copy link
Member

Looks like the defaults for infer are wrong here and we're inferring gagillions of recurrent mutations instead of allowing recombinations. Seems like a bug to me @hyanwong?

@hyanwong
Copy link
Member

Weird. We shouldn't be allowing any recurrent mutation here, because there's no recombination rate passed in, right?

@hyanwong
Copy link
Member

hyanwong commented Jun 22, 2022

But yes, if there are 6314841 mutations but only 262165 sites, that explains why we only have one tree. That seems like a bug if you have simply called tsinfer.infer(samples)

@hyanwong
Copy link
Member

I've had a look just now. None of the sites are mark for inference, that's why. It looks like the VCF parsing is wrong: it's getting weird characters in the allele string:

sd = tsinfer.load("GBR.samples")
for s in sd.sites():
    print(s)
    break

gives

Site(id=0, position=60828.0, ancestral_state='.|||', metadata={}, time=nan, alleles=('.|||', 'T', 'G'))

So there's a weird .||| in there.

@hyanwong
Copy link
Member

hyanwong commented Jun 22, 2022

When trying to perform inference, I wonder if we should raise a warning if there are sites, but none valid for inference? That would have caught this problem earlier.

@evolgenomics
Copy link
Author

Thanks! and interesting - so from the original VCF there is the "AA" INFO tag, and including a line from the original global population Phase 3 VCF I downloaded straight from the 1000G project, it says so:
20 60343 . G A 100 PASS AC=1;AF=0.000199681;AN=5008;NS=2504;DP=20377;EAS_AF=0;AMR_AF=0.0014;AFR_AF=0;EUR_AF=0;SAS_AF=0;AA=.|||;VT=SNP
Note the AA=.||| near the end of the INFO field.
According to the header, its format is:

##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele. Format: AA|REF|ALT|IndelType. AA: Ancestral allele, REF:Reference Allele, ALT:Alternate Allele, IndelType:Type of Indel (REF, ALT and 
IndelType are only defined for indels)">

Odd.

@hyanwong
Copy link
Member

I don't think we used the AA from the 1000G project, as there are more reliable sources of ancestral allele information than in 1000G. The VCF reading code there is therefore wrong for the 1000G AA format. The demo code assumes something like this:

##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">

(from here)

If you know what the 1000G AA format means, then you can adjust the VCF reading code as appropriate

@hyanwong
Copy link
Member

It looks like "." means that it was impossible to call an ancestral allele. We should probably adjust the demo code to stop at "|" and cope with unknown AA states.

@jeromekelleher
Copy link
Member

Aha, well tracked down @hyanwong. A warning when we have 0 inference sites sounds like the right thing to do. I've changed the title of this issue so we can close it when this is implemented.

@jeromekelleher jeromekelleher changed the title Odd tsinfer behaviour - no topology inferred across entire human chromosome Raise warning when zero inference sites provided. Jun 24, 2022
@jeromekelleher jeromekelleher added this to the Release 0.2.4 milestone Jun 24, 2022
@hyanwong hyanwong mentioned this issue Jun 24, 2022
@mergify mergify bot closed this as completed in #685 Jun 28, 2022
@hyanwong
Copy link
Member

Hi @evolgenomics - could you see if the updated example add_diploid_sites code at https://tsinfer.readthedocs.io/en/latest/tutorial.html#reading-a-vcf works for your VCF now?

@hyanwong
Copy link
Member

Hi @evolgenomics - could you see if the updated example add_diploid_sites code at https://tsinfer.readthedocs.io/en/latest/tutorial.html#reading-a-vcf works for your VCF now?

Actually, see #689 for a version which does actually work!

@evolgenomics
Copy link
Author

Yes - I solved that problem earlier by removing the AA tag. I now tried the tutorial version as well and it works as well. Thanks! Ultimately the #689 solution is the best one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants