-
Notifications
You must be signed in to change notification settings - Fork 78
Output haplotypes with missing data (and discuss indels) #426
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Output haplotypes with missing data (and discuss indels) #426
Conversation
e039523 to
92bb39b
Compare
Codecov Report
@@ Coverage Diff @@
## master #426 +/- ##
==========================================
+ Coverage 87.85% 87.85% +<.01%
==========================================
Files 19 19
Lines 10380 10387 +7
Branches 1897 1898 +1
==========================================
+ Hits 9119 9126 +7
Misses 746 746
Partials 515 515
Continue to review full report at Codecov.
|
fc6d245 to
6c48f8a
Compare
|
OK, before we go any further with this, what's the point in supporting multi-letter alleles here? I.e., does anyone actually need it? Unless the answer is "Yes, I/someone else I know cannot do their work without this feature" then I vote to drop it. We probably want a more general "alignments" method, if we're going down this road. I'm perfectly happy with the Outputting missing data from |
|
FWIW, perf tests show that for large numbers of samples (1M), it is essentially the same speed as both the C version and the non-missing-data version. |
ps. even with an |
A decent proportion of sites in real sequences are small indels. The question is whether we want to force people with real data to use the (yet to be implemented) finite sites version of a tree sequence, or whether we want to enable infinite sites tree sequences to output realistic-looking data. I think we probably need an input from a real biologist here, e.g. @petrelharp ? To summarize - is it useful to be able to output small indels as aligned haplotypes, e.g. where the |
|
I will work this PR into a missing-data-only version. Nevertheless I suspect that the multi-letter version will be useful for many biologists. I asked the HIV group, and about 2-5% of the variable sites in an HIV genome are in short indels, FWIW (mostly in 3bp lengths). For the SNPs in the 1000G, about 5% of the "standard" variants are short indels, so probably about 10% of the letters in a haplotype will be in an indel. The semantics for the haplotype output for indels seem very clear to me, but you are right that this is somewhat of a mouthful to bite off, so spinning it off into a different PR seems sensible |
6c48f8a to
d8fe0b2
Compare
|
p.s. even in a "missing-data-only" version, do we want to allow single letter deletions (i.e. one allele = |
|
I don't think we should keep the semantics of We can discuss the There's nothing stopping people using tree sequences which contain indels, they just can't write out alignments at the moment. This is pretty minor IMO since we support VCF output. |
Shall I move this discussion to another issue? It would IMO be reasonable to output multi-letter alleles from a non-integer-site TS. |
7f0cbff to
623d0a8
Compare
You mean we should keep the semantics? I'm unclear if you think we should allow e.g. I've coded this up in the latest push on this PR, as it seems well-defined and reasonable behaviour, but it may be that you would prefer the presence of a |
6eef35b to
d55e543
Compare
d55e543 to
b4bde65
Compare
|
OK @jeromekelleher - I think this allows missing data in haplotypes, but not single letter deletions, and is thus ready for review/merge when CI has completed |
|
Needs a rebase, but looks to merge after that. Can you update the CHANGELOG with this new feature please, and note any consequences for backward compatability? |
b4bde65 to
664c498
Compare
|
Rebased and added the changelog - hope it's not too verbose. |
Ported from #425. This needs some discussion, and more tests adding, but I think it's the right way to go.
I think we also want to be able to (optionally) output
.characters between the allele strings (see #353 (comment) and discussion below that). I haven't implemented this yet, but it should be trivial to add the option to thehaplotypes()function, and to this code.