Output haplotypes with missing data (and discuss indels) #426

hyanwong · 2019-11-26T12:10:14Z

Ported from #425. This needs some discussion, and more tests adding, but I think it's the right way to go.

I think we also want to be able to (optionally) output . characters between the allele strings (see #353 (comment) and discussion below that). I haven't implemented this yet, but it should be trivial to add the option to the haplotypes() function, and to this code.

codecov · 2019-11-27T11:30:54Z

Codecov Report

Merging #426 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master     #426      +/-   ##
==========================================
+ Coverage   87.85%   87.85%   +<.01%     
==========================================
  Files          19       19              
  Lines       10380    10387       +7     
  Branches     1897     1898       +1     
==========================================
+ Hits         9119     9126       +7     
  Misses        746      746              
  Partials      515      515

Flag	Coverage Δ
#c_tests	`87.85% <100%> (ø)`	⬆️
#python_tests	`99.24% <100%> (ø)`	⬆️

Impacted Files	Coverage Δ
python/tskit/trees.py	`98.74% <100%> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f3f9f64...664c498. Read the comment docs.

jeromekelleher · 2019-11-27T12:51:02Z

OK, before we go any further with this, what's the point in supporting multi-letter alleles here? I.e., does anyone actually need it? Unless the answer is "Yes, I/someone else I know cannot do their work without this feature" then I vote to drop it.

We probably want a more general "alignments" method, if we're going down this road. I'm perfectly happy with the alignments method throwing an error if there's non-integer site positions. The semantics of indels etc become clearer then.

Outputting missing data from haplotypes is a good feature, we want this.

hyanwong · 2019-11-27T13:08:06Z

FWIW, perf tests show that for large numbers of samples (1M), it is essentially the same speed as both the C version and the non-missing-data version.

jeromekelleher · 2019-11-27T13:09:43Z

We probably want a more general "alignments" method, if we're going down this road. I'm perfectly happy with the alignments method throwing an error if there's non-integer site positions. The semantics of indels etc become clearer then.

ps. even with an alignments method I don't think we should get into dealing with multi-letter alleles. It's a massive can of worms and would take weeks to work through all the details and corner cases. It's not on the critical path.

hyanwong · 2019-11-27T13:13:54Z

OK, before we go any further with this, what's the point in supporting multi-letter alleles here?

A decent proportion of sites in real sequences are small indels. The question is whether we want to force people with real data to use the (yet to be implemented) finite sites version of a tree sequence, or whether we want to enable infinite sites tree sequences to output realistic-looking data.

I think we probably need an input from a real biologist here, e.g. @petrelharp ?

To summarize - is it useful to be able to output small indels as aligned haplotypes, e.g.

Sample1: .A.T.---.G.T
Sample2: .A.C.---.G.A
Sample3: .A.T.ATT.G.A

where the . represents non-variable sequence between variants, and hasn't been implemented yet.

hyanwong · 2019-11-27T13:32:22Z

I will work this PR into a missing-data-only version.

Nevertheless I suspect that the multi-letter version will be useful for many biologists. I asked the HIV group, and about 2-5% of the variable sites in an HIV genome are in short indels, FWIW (mostly in 3bp lengths). For the SNPs in the 1000G, about 5% of the "standard" variants are short indels, so probably about 10% of the letters in a haplotype will be in an indel. The semantics for the haplotype output for indels seem very clear to me, but you are right that this is somewhat of a mouthful to bite off, so spinning it off into a different PR seems sensible

hyanwong · 2019-11-27T13:37:09Z

p.s. even in a "missing-data-only" version, do we want to allow single letter deletions (i.e. one allele = "", the rest single letter or None)? The semantics here are quite clear, I would say

jeromekelleher · 2019-11-27T13:48:16Z

I don't think we should keep the semantics of haplotypes as it is: output just the sites that are in the ts. The semantics are clear then. Indels etc can be dealt with later, when someone with an application asks for it.

We can discuss the alignments operation (which would be defined only on integer a TS with integer sites) separately.

There's nothing stopping people using tree sequences which contain indels, they just can't write out alignments at the moment. This is pretty minor IMO since we support VCF output.

hyanwong · 2019-11-27T14:06:01Z

We can discuss the alignments operation (which would be defined only on integer a TS with integer sites) separately.

Shall I move this discussion to another issue? It would IMO be reasonable to output multi-letter alleles from a non-integer-site TS.

hyanwong · 2019-11-27T14:31:05Z

I don't think we should keep the semantics of haplotypes as it is: output just the sites that are in the ts.

You mean we should keep the semantics? I'm unclear if you think we should allow e.g. genotypes = [0, 1, 2, -1], alleles=["", "A", "G", None] to be output as

-
A
G
-

I've coded this up in the latest push on this PR, as it seems well-defined and reasonable behaviour, but it may be that you would prefer the presence of a "" in the allele list to throw an error.

hyanwong · 2019-12-04T12:18:49Z

OK @jeromekelleher - I think this allows missing data in haplotypes, but not single letter deletions, and is thus ready for review/merge when CI has completed

jeromekelleher · 2019-12-04T13:28:24Z

Needs a rebase, but looks to merge after that.

Can you update the CHANGELOG with this new feature please, and note any consequences for backward compatability?

hyanwong · 2019-12-04T13:42:30Z

Rebased and added the changelog - hope it's not too verbose.

hyanwong mentioned this pull request Nov 26, 2019

New python haplotype generator #425

Merged

hyanwong changed the title ~~Output haplotypes with missing data and multiple letters~~ Output haplotypes with missing data and multiple-letter alleles (e.g. indels) Nov 26, 2019

hyanwong force-pushed the haplotypes-with-missing-data branch from e039523 to 92bb39b Compare November 27, 2019 11:30

hyanwong force-pushed the haplotypes-with-missing-data branch 3 times, most recently from fc6d245 to 6c48f8a Compare November 27, 2019 12:27

hyanwong force-pushed the haplotypes-with-missing-data branch from 6c48f8a to d8fe0b2 Compare November 27, 2019 13:34

hyanwong force-pushed the haplotypes-with-missing-data branch 2 times, most recently from 7f0cbff to 623d0a8 Compare November 27, 2019 14:30

hyanwong changed the title ~~Output haplotypes with missing data and multiple-letter alleles (e.g. indels)~~ Output haplotypes with missing data (and discuss indels) Nov 27, 2019

hyanwong force-pushed the haplotypes-with-missing-data branch 2 times, most recently from 6eef35b to d55e543 Compare November 27, 2019 15:45

hyanwong mentioned this pull request Nov 27, 2019

Multi letter alleles (e.g. indels) in haplotypes() #427

Closed

hyanwong force-pushed the haplotypes-with-missing-data branch from d55e543 to b4bde65 Compare December 4, 2019 12:17

Add ability to output missing data in haplotypes

664c498

hyanwong force-pushed the haplotypes-with-missing-data branch from b4bde65 to 664c498 Compare December 4, 2019 13:41

jeromekelleher merged commit 0b74c8b into tskit-dev:master Dec 4, 2019

Output haplotypes with missing data (and discuss indels) #426

Output haplotypes with missing data (and discuss indels) #426

Uh oh!

Conversation

hyanwong commented Nov 26, 2019

Uh oh!

codecov bot commented Nov 27, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jeromekelleher commented Nov 27, 2019

Uh oh!

hyanwong commented Nov 27, 2019

Uh oh!

jeromekelleher commented Nov 27, 2019

Uh oh!

hyanwong commented Nov 27, 2019

Uh oh!

hyanwong commented Nov 27, 2019

Uh oh!

hyanwong commented Nov 27, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeromekelleher commented Nov 27, 2019

Uh oh!

hyanwong commented Nov 27, 2019

Uh oh!

hyanwong commented Nov 27, 2019

Uh oh!

hyanwong commented Dec 4, 2019

Uh oh!

jeromekelleher commented Dec 4, 2019

Uh oh!

hyanwong commented Dec 4, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov bot commented Nov 27, 2019 •

edited

Loading

hyanwong commented Nov 27, 2019 •

edited

Loading