Skip to content

Conversation

@hyanwong
Copy link
Member

Ported from #425. This needs some discussion, and more tests adding, but I think it's the right way to go.

I think we also want to be able to (optionally) output . characters between the allele strings (see #353 (comment) and discussion below that). I haven't implemented this yet, but it should be trivial to add the option to the haplotypes() function, and to this code.

@hyanwong hyanwong changed the title Output haplotypes with missing data and multiple letters Output haplotypes with missing data and multiple-letter alleles (e.g. indels) Nov 26, 2019
@hyanwong hyanwong force-pushed the haplotypes-with-missing-data branch from e039523 to 92bb39b Compare November 27, 2019 11:30
@codecov
Copy link

codecov bot commented Nov 27, 2019

Codecov Report

Merging #426 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #426      +/-   ##
==========================================
+ Coverage   87.85%   87.85%   +<.01%     
==========================================
  Files          19       19              
  Lines       10380    10387       +7     
  Branches     1897     1898       +1     
==========================================
+ Hits         9119     9126       +7     
  Misses        746      746              
  Partials      515      515
Flag Coverage Δ
#c_tests 87.85% <100%> (ø) ⬆️
#python_tests 99.24% <100%> (ø) ⬆️
Impacted Files Coverage Δ
python/tskit/trees.py 98.74% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f3f9f64...664c498. Read the comment docs.

@hyanwong hyanwong force-pushed the haplotypes-with-missing-data branch 3 times, most recently from fc6d245 to 6c48f8a Compare November 27, 2019 12:27
@jeromekelleher
Copy link
Member

OK, before we go any further with this, what's the point in supporting multi-letter alleles here? I.e., does anyone actually need it? Unless the answer is "Yes, I/someone else I know cannot do their work without this feature" then I vote to drop it.

We probably want a more general "alignments" method, if we're going down this road. I'm perfectly happy with the alignments method throwing an error if there's non-integer site positions. The semantics of indels etc become clearer then.

Outputting missing data from haplotypes is a good feature, we want this.

@hyanwong
Copy link
Member Author

FWIW, perf tests show that for large numbers of samples (1M), it is essentially the same speed as both the C version and the non-missing-data version.

@jeromekelleher
Copy link
Member

We probably want a more general "alignments" method, if we're going down this road. I'm perfectly happy with the alignments method throwing an error if there's non-integer site positions. The semantics of indels etc become clearer then.

ps. even with an alignments method I don't think we should get into dealing with multi-letter alleles. It's a massive can of worms and would take weeks to work through all the details and corner cases. It's not on the critical path.

@hyanwong
Copy link
Member Author

OK, before we go any further with this, what's the point in supporting multi-letter alleles here?

A decent proportion of sites in real sequences are small indels. The question is whether we want to force people with real data to use the (yet to be implemented) finite sites version of a tree sequence, or whether we want to enable infinite sites tree sequences to output realistic-looking data.

I think we probably need an input from a real biologist here, e.g. @petrelharp ?

To summarize - is it useful to be able to output small indels as aligned haplotypes, e.g.

Sample1: .A.T.---.G.T
Sample2: .A.C.---.G.A
Sample3: .A.T.ATT.G.A

where the . represents non-variable sequence between variants, and hasn't been implemented yet.

@hyanwong
Copy link
Member Author

I will work this PR into a missing-data-only version.

Nevertheless I suspect that the multi-letter version will be useful for many biologists. I asked the HIV group, and about 2-5% of the variable sites in an HIV genome are in short indels, FWIW (mostly in 3bp lengths). For the SNPs in the 1000G, about 5% of the "standard" variants are short indels, so probably about 10% of the letters in a haplotype will be in an indel. The semantics for the haplotype output for indels seem very clear to me, but you are right that this is somewhat of a mouthful to bite off, so spinning it off into a different PR seems sensible

@hyanwong hyanwong force-pushed the haplotypes-with-missing-data branch from 6c48f8a to d8fe0b2 Compare November 27, 2019 13:34
@hyanwong
Copy link
Member Author

hyanwong commented Nov 27, 2019

p.s. even in a "missing-data-only" version, do we want to allow single letter deletions (i.e. one allele = "", the rest single letter or None)? The semantics here are quite clear, I would say

@jeromekelleher
Copy link
Member

I don't think we should keep the semantics of haplotypes as it is: output just the sites that are in the ts. The semantics are clear then. Indels etc can be dealt with later, when someone with an application asks for it.

We can discuss the alignments operation (which would be defined only on integer a TS with integer sites) separately.

There's nothing stopping people using tree sequences which contain indels, they just can't write out alignments at the moment. This is pretty minor IMO since we support VCF output.

@hyanwong
Copy link
Member Author

We can discuss the alignments operation (which would be defined only on integer a TS with integer sites) separately.

Shall I move this discussion to another issue? It would IMO be reasonable to output multi-letter alleles from a non-integer-site TS.

@hyanwong hyanwong force-pushed the haplotypes-with-missing-data branch 2 times, most recently from 7f0cbff to 623d0a8 Compare November 27, 2019 14:30
@hyanwong
Copy link
Member Author

I don't think we should keep the semantics of haplotypes as it is: output just the sites that are in the ts.

You mean we should keep the semantics? I'm unclear if you think we should allow e.g. genotypes = [0, 1, 2, -1], alleles=["", "A", "G", None] to be output as

-
A
G
-

I've coded this up in the latest push on this PR, as it seems well-defined and reasonable behaviour, but it may be that you would prefer the presence of a "" in the allele list to throw an error.

@hyanwong hyanwong changed the title Output haplotypes with missing data and multiple-letter alleles (e.g. indels) Output haplotypes with missing data (and discuss indels) Nov 27, 2019
@hyanwong hyanwong force-pushed the haplotypes-with-missing-data branch 2 times, most recently from 6eef35b to d55e543 Compare November 27, 2019 15:45
@hyanwong hyanwong force-pushed the haplotypes-with-missing-data branch from d55e543 to b4bde65 Compare December 4, 2019 12:17
@hyanwong
Copy link
Member Author

hyanwong commented Dec 4, 2019

OK @jeromekelleher - I think this allows missing data in haplotypes, but not single letter deletions, and is thus ready for review/merge when CI has completed

@jeromekelleher
Copy link
Member

Needs a rebase, but looks to merge after that.

Can you update the CHANGELOG with this new feature please, and note any consequences for backward compatability?

@hyanwong hyanwong force-pushed the haplotypes-with-missing-data branch from b4bde65 to 664c498 Compare December 4, 2019 13:41
@hyanwong
Copy link
Member Author

hyanwong commented Dec 4, 2019

Rebased and added the changelog - hope it's not too verbose.

@jeromekelleher jeromekelleher merged commit 0b74c8b into tskit-dev:master Dec 4, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants