New python haplotype generator #425

hyanwong · 2019-11-25T21:23:34Z

Fixes #326

Note that this adds two pieces of extra functionality that I would like. Firstly (and less controversially), it allows missing data. Secondly, it also allows multi-letter alleles and deletions. This will help us support indels, but does mean that the number of characters in the output string is not necessarily the same as the total number of sites. However, I have coded it so that the alignments between samples are always the same (and hence the string length for each haplotype is guaranteed to be identical): this means having to make a decision about what to do in the case of multiple alleles at a site that are of different (non zero) lengths. In this case I have chosen to represent the shorter of the two by missing data, since I can't guarantee the alignment. This is perhaps the wrong thing to do - either way, it probably requires some discussion.

The tests aren't that comprehensive yet, especially for this extra functionality. I thought I should submit a preliminary PR before going to town on test suites, in case any of this seemed like a bad idea. Spun off into #425

hyanwong · 2019-11-25T21:53:55Z

This is failing on (at least one) slightly weird test: test_topology.py:TestEmptyTreeSequences.test_one_node_one_sample_sites.

Here we make a TS with a single (sample) node with an ancestral state and a single mutation above it. According to our definition, this is a site with missing data. If we impute the missing data, we say in the docs that we set it to the ancestral state (== '0'), but here, instead, the test expects it to be '1'. Is this because it has a mutation above it? We haven't, AFAIK, decided how we should impute a missing node with a mutation above it yet, have we?

jeromekelleher · 2019-11-26T08:29:57Z

Thanks @hyanwong. Can we do this as a direct replacement of the existing functionality first and add new functionality afterwards please?

hyanwong · 2019-11-26T09:33:51Z

@jeromekelleher I thought you might say that. But what is the correct behaviour in the test_one_node_one_sample_sites example above. The docs say "missing data will be imputed such that all isolated samples are assigned the ancestral state", the existing implementation does not do this. How should I code it?

jeromekelleher · 2019-11-26T09:40:42Z

We should be able to code a direct replacement for the existing C code that doesn't break any tests. It's not a good idea to change several things at once.

hyanwong · 2019-11-26T10:08:59Z

We should be able to code a direct replacement for the existing C code that doesn't break any tests. It's not a good idea to change several things at once.

I see what you mean, but I think the existing code is either broken, or wrongly documented. So should I change the code result, or the docs. I assume you mean the docs? In particular, we do not impute missing data to be the ancestral state if the isolated node has a mutation above it. In this case, we impute a derived state. Is that correct? If so, we should at least say so.

jeromekelleher · 2019-11-26T10:20:07Z

The first thing we want is an exact replica of the existing behaviour, with no extra parameters and no changes in documentation. If there does need to be a change in semantics then we need to understand what the old semantics are, and potentially be able to reproduce them.

hyanwong · 2019-11-26T10:24:08Z

The first thing we want is an exact replica of the existing behaviour, with no extra parameters and no changes in documentation. If there does need to be a change in semantics then we need to understand what the old semantics are, and potentially be able to reproduce them.

Sorry, we seem to be talking at cross purposes. I'm not necessarily suggesting changing the semantics. I have a PR (not yet pushed) which will replicate the current behaviour exactly (passes all the tests), but the documentation is then wrong. Should I correct the docs? At the moment, the existing behaviour of the C code is wrongly documented.

jeromekelleher · 2019-11-26T10:31:27Z

OK, well please push that code so that we can discuss it then. We'll need to do a perf analysis to make sure that what we're doing isn't a regression. This is a big change and needs to be done carefully, and to be honest, I have other things to do.

hyanwong · 2019-11-26T11:12:17Z

Looks like I might have accidentally deleted some C clearup code when tidying. I'll have a look.

codecov · 2019-11-26T11:23:41Z

Codecov Report

Merging #425 into master will increase coverage by 0.05%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master     #425      +/-   ##
==========================================
+ Coverage   86.67%   86.73%   +0.05%     
==========================================
  Files          20       20              
  Lines       14258    14115     -143     
  Branches     2774     2745      -29     
==========================================
- Hits        12358    12242     -116     
+ Misses        978      963      -15     
+ Partials      922      910      -12

Flag	Coverage Δ
#c_tests	`87.85% <100%> (+0.06%)`	⬆️
#python_c_tests	`90.56% <100%> (+0.07%)`	⬆️
#python_tests	`99.23% <100%> (ø)`	⬆️

Impacted Files	Coverage Δ
c/tskit/genotypes.c	`91.51% <ø> (+2.47%)`	⬆️
python/_tskitmodule.c	`83.61% <ø> (+0.03%)`	⬆️
c/tskit/core.c	`83.52% <ø> (-0.13%)`	⬇️
python/tskit/trees.py	`98.73% <100%> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fcebc4d...758d5e3. Read the comment docs.

hyanwong · 2019-11-26T11:42:30Z

Well, this looks like it's OK now, and should be a simple drop-in replacement. I'll do another PR for the proposed changes (edit - now in #426)

jeromekelleher · 2019-11-26T17:17:15Z

OK, thanks. Can we do a perf analysis here? What's the before and after times for a large tree sequence?

hyanwong · 2019-11-26T17:39:09Z

Very good point. What's the standard way to do this? I haven't done so in anger before, as it were.

jeromekelleher · 2019-11-26T18:11:18Z

Can use either timeit module or time.perf_counter() directly. I tend to use the perf_counter method.

hyanwong · 2019-11-26T22:00:50Z

It looks like it's about 2 times slower. I ran the script below with the 2 codebases

import tskit
import msprime
import time

for i in range(10, 20):
    ts = msprime.simulate(10000, recombination_rate=1000, mutation_rate=5000, random_seed=123)
    start = time.perf_counter()
    for h in ts.haplotypes():
        string = h
        
    end = time.perf_counter()
    print(end-start)

Before:

9.241496816277504
9.242490600794554
9.245789850130677
9.243655636906624
9.23997611925006
9.240366514772177
9.24243825674057
9.244240812957287
9.241791695356369
9.238783311098814

After:

19.13179887831211
19.089431881904602
19.54032832942903
19.583404291421175
19.982043754309416
19.10580626502633
19.946577440947294
19.11001042276621
19.101856959983706
19.979458650574088

A little bit of digging shows that the majority of the time in the new haplotypes() method is spent in the line

H[:, var.site.id] = alleles[var.genotypes]

Which allocates genotypes to rows. I'm not sure there is much more optimisation we can do to this bit.

jeromekelleher · 2019-11-27T08:19:14Z

Good stuff, thanks. Would you mind pasting in the top lines of the Python profile for a largish example please? I'd like to see how much time is spent in the h.tostring().decode() step --- there's a shortcut we can do there to potentially speed things up.

jeromekelleher · 2019-11-27T08:19:59Z

Also, what happens when we have a sample size of 1M? 10K is pretty small.

hyanwong · 2019-11-27T11:06:17Z

Also, what happens when we have a sample size of 1M? 10K is pretty small.

It looks like as sample size increases, the differences even out. For example, for sample_size = 1e6 and mutation rate 2000 (other params the same), I get ~31 mins for each

old_version: 1852.089014943689 secs
new version: 1873.1874577719718 secs

However, it seems like the number of sites is more critical. Here's the slowdown as a function of mutation rate (length=Ne=1) and # samples. For smaller samples sizes, with large mutation rates (e.g. this is equivalent to 56463 sites), we see the numpy-based method is ~700 times slower. But perhaps we don't care, since it's still quite fast?

     mut_rate:       5           50          500         5000
n:10    [[  2.90313337  19.42890212 149.14133159 766.20840123]
n:100    [  2.2625131   12.48559536  73.3518402  187.20088055]
n:1000   [  1.93421465   6.02976885   6.36719559  14.54888645]
n:10000  [  1.92183657   2.70183765   2.71642332   2.72070453]]

(NB only 1 replicate of each, but you get the idea)

hyanwong · 2019-11-27T11:08:27Z

Would you mind pasting in the top lines of the Python profile for a largish example please?

Here's 10K samples:

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
  2830                                               def haplotypes(self, impute_missing_data=False):
...
  2876         1         61.0     61.0      0.0          H = np.empty((self.num_samples, self.num_sites), dtype=np.int8)
  2877     78939    2064516.0     26.2     21.7          for var in self.variants(impute_missing_data=impute_missing_data):
  2878                                                       # allow an extra allele at the end for missing data == -1
  2879                                                       # filling with -1 values will ensure an error is raised if unassigned alleles
  2880                                                       # are referenced and converted back to ascii
  2881     78938     880540.0     11.2      9.3              alleles = -np.ones((len(var.alleles) + 1), dtype=np.int8)
  2882    236814     248598.0      1.0      2.6              for i, allele in enumerate(var.alleles):
  2883    157876     103846.0      0.7      1.1                  if allele is not None:
  2884    157876     115512.0      0.7      1.2                      if len(allele) > 1:
  2885                                                                   raise _tskit.LibraryError(
  2886                                                                       "Cannot produce haplotypes from multi-letter alleles")
  2887    157876     242475.0      1.5      2.5                      alleles[i] = ord(allele.encode('ascii'))
  2888     78938    5686557.0     72.0     59.8              H[:, var.site.id] = alleles[var.genotypes]
  2889     10001      10648.0      1.1      0.1          for h in H:
  2890     10000       6123.0      0.6      0.1              try:
  2891     10000     151347.0     15.1      1.6                  yield h.tostring().decode('ascii')

jeromekelleher · 2019-11-27T11:15:17Z

Excellent! Looks like the time is dominated by the decoding the variants for large n, so that's as good as we can do. I don't really care about small n, everything is fast anyway.

This is a great step forward, it'll really help to not have this C code lying around.

hyanwong · 2019-11-27T11:16:49Z

Great. I can save a little time by using np.full rather than -np.ones. Wait a tick and I'll push an update

jeromekelleher

Minor comment above --- I think we can simplify the logic a bit.

python/tskit/trees.py

jeromekelleher · 2019-11-27T12:44:08Z

Great, thanks. Merging.

hyanwong force-pushed the python-haplotypes branch from 18673b5 to 49220d6 Compare November 25, 2019 21:31

hyanwong force-pushed the python-haplotypes branch from 49220d6 to f68a53b Compare November 26, 2019 10:51

hyanwong force-pushed the python-haplotypes branch from f68a53b to 043f956 Compare November 26, 2019 11:23

hyanwong mentioned this pull request Nov 26, 2019

Output haplotypes with missing data (and discuss indels) #426

Merged

hyanwong force-pushed the python-haplotypes branch from 043f956 to f9db7d8 Compare November 27, 2019 11:18

jeromekelleher reviewed Nov 27, 2019

View reviewed changes

python/tskit/trees.py Outdated Show resolved Hide resolved

New python haplotype generator

758d5e3

hyanwong force-pushed the python-haplotypes branch from f9db7d8 to 758d5e3 Compare November 27, 2019 12:02

jeromekelleher merged commit 7950b25 into tskit-dev:master Nov 27, 2019

New python haplotype generator #425

New python haplotype generator #425

Uh oh!

Conversation

hyanwong commented Nov 25, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hyanwong commented Nov 25, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeromekelleher commented Nov 26, 2019

Uh oh!

hyanwong commented Nov 26, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeromekelleher commented Nov 26, 2019

Uh oh!

hyanwong commented Nov 26, 2019

Uh oh!

jeromekelleher commented Nov 26, 2019

Uh oh!

hyanwong commented Nov 26, 2019

Uh oh!

jeromekelleher commented Nov 26, 2019

Uh oh!

hyanwong commented Nov 26, 2019

Uh oh!

codecov bot commented Nov 26, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

hyanwong commented Nov 26, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeromekelleher commented Nov 26, 2019

Uh oh!

hyanwong commented Nov 26, 2019

Uh oh!

jeromekelleher commented Nov 26, 2019

Uh oh!

hyanwong commented Nov 26, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeromekelleher commented Nov 27, 2019

Uh oh!

jeromekelleher commented Nov 27, 2019

Uh oh!

hyanwong commented Nov 27, 2019

Uh oh!

hyanwong commented Nov 27, 2019

Uh oh!

jeromekelleher commented Nov 27, 2019

Uh oh!

hyanwong commented Nov 27, 2019

Uh oh!

jeromekelleher left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jeromekelleher commented Nov 27, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hyanwong commented Nov 25, 2019 •

edited

Loading

hyanwong commented Nov 25, 2019 •

edited

Loading

hyanwong commented Nov 26, 2019 •

edited

Loading

codecov bot commented Nov 26, 2019 •

edited

Loading

hyanwong commented Nov 26, 2019 •

edited

Loading

hyanwong commented Nov 26, 2019 •

edited

Loading