Skip to content

Conversation

@hyanwong
Copy link
Member

@hyanwong hyanwong commented Nov 25, 2019

Fixes #326

Note that this adds two pieces of extra functionality that I would like. Firstly (and less controversially), it allows missing data. Secondly, it also allows multi-letter alleles and deletions. This will help us support indels, but does mean that the number of characters in the output string is not necessarily the same as the total number of sites. However, I have coded it so that the alignments between samples are always the same (and hence the string length for each haplotype is guaranteed to be identical): this means having to make a decision about what to do in the case of multiple alleles at a site that are of different (non zero) lengths. In this case I have chosen to represent the shorter of the two by missing data, since I can't guarantee the alignment. This is perhaps the wrong thing to do - either way, it probably requires some discussion.

The tests aren't that comprehensive yet, especially for this extra functionality. I thought I should submit a preliminary PR before going to town on test suites, in case any of this seemed like a bad idea. Spun off into #425

@hyanwong
Copy link
Member Author

hyanwong commented Nov 25, 2019

This is failing on (at least one) slightly weird test: test_topology.py:TestEmptyTreeSequences.test_one_node_one_sample_sites.

Here we make a TS with a single (sample) node with an ancestral state and a single mutation above it. According to our definition, this is a site with missing data. If we impute the missing data, we say in the docs that we set it to the ancestral state (== '0'), but here, instead, the test expects it to be '1'. Is this because it has a mutation above it? We haven't, AFAIK, decided how we should impute a missing node with a mutation above it yet, have we?

@jeromekelleher
Copy link
Member

Thanks @hyanwong. Can we do this as a direct replacement of the existing functionality first and add new functionality afterwards please?

@hyanwong
Copy link
Member Author

hyanwong commented Nov 26, 2019

@jeromekelleher I thought you might say that. But what is the correct behaviour in the test_one_node_one_sample_sites example above. The docs say "missing data will be imputed such that all isolated samples are assigned the ancestral state", the existing implementation does not do this. How should I code it?

@jeromekelleher
Copy link
Member

We should be able to code a direct replacement for the existing C code that doesn't break any tests. It's not a good idea to change several things at once.

@hyanwong
Copy link
Member Author

We should be able to code a direct replacement for the existing C code that doesn't break any tests. It's not a good idea to change several things at once.

I see what you mean, but I think the existing code is either broken, or wrongly documented. So should I change the code result, or the docs. I assume you mean the docs? In particular, we do not impute missing data to be the ancestral state if the isolated node has a mutation above it. In this case, we impute a derived state. Is that correct? If so, we should at least say so.

@jeromekelleher
Copy link
Member

The first thing we want is an exact replica of the existing behaviour, with no extra parameters and no changes in documentation. If there does need to be a change in semantics then we need to understand what the old semantics are, and potentially be able to reproduce them.

@hyanwong
Copy link
Member Author

The first thing we want is an exact replica of the existing behaviour, with no extra parameters and no changes in documentation. If there does need to be a change in semantics then we need to understand what the old semantics are, and potentially be able to reproduce them.

Sorry, we seem to be talking at cross purposes. I'm not necessarily suggesting changing the semantics. I have a PR (not yet pushed) which will replicate the current behaviour exactly (passes all the tests), but the documentation is then wrong. Should I correct the docs? At the moment, the existing behaviour of the C code is wrongly documented.

@jeromekelleher
Copy link
Member

OK, well please push that code so that we can discuss it then. We'll need to do a perf analysis to make sure that what we're doing isn't a regression. This is a big change and needs to be done carefully, and to be honest, I have other things to do.

@hyanwong
Copy link
Member Author

Looks like I might have accidentally deleted some C clearup code when tidying. I'll have a look.

@codecov
Copy link

codecov bot commented Nov 26, 2019

Codecov Report

Merging #425 into master will increase coverage by 0.05%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #425      +/-   ##
==========================================
+ Coverage   86.67%   86.73%   +0.05%     
==========================================
  Files          20       20              
  Lines       14258    14115     -143     
  Branches     2774     2745      -29     
==========================================
- Hits        12358    12242     -116     
+ Misses        978      963      -15     
+ Partials      922      910      -12
Flag Coverage Δ
#c_tests 87.85% <100%> (+0.06%) ⬆️
#python_c_tests 90.56% <100%> (+0.07%) ⬆️
#python_tests 99.23% <100%> (ø) ⬆️
Impacted Files Coverage Δ
c/tskit/genotypes.c 91.51% <ø> (+2.47%) ⬆️
python/_tskitmodule.c 83.61% <ø> (+0.03%) ⬆️
c/tskit/core.c 83.52% <ø> (-0.13%) ⬇️
python/tskit/trees.py 98.73% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fcebc4d...758d5e3. Read the comment docs.

@hyanwong
Copy link
Member Author

hyanwong commented Nov 26, 2019

Well, this looks like it's OK now, and should be a simple drop-in replacement. I'll do another PR for the proposed changes (edit - now in #426)

@jeromekelleher
Copy link
Member

OK, thanks. Can we do a perf analysis here? What's the before and after times for a large tree sequence?

@hyanwong
Copy link
Member Author

Very good point. What's the standard way to do this? I haven't done so in anger before, as it were.

@jeromekelleher
Copy link
Member

Can use either timeit module or time.perf_counter() directly. I tend to use the perf_counter method.

@hyanwong
Copy link
Member Author

hyanwong commented Nov 26, 2019

It looks like it's about 2 times slower. I ran the script below with the 2 codebases

import tskit
import msprime
import time

for i in range(10, 20):
    ts = msprime.simulate(10000, recombination_rate=1000, mutation_rate=5000, random_seed=123)
    start = time.perf_counter()
    for h in ts.haplotypes():
        string = h
        
    end = time.perf_counter()
    print(end-start)

Before:

9.241496816277504
9.242490600794554
9.245789850130677
9.243655636906624
9.23997611925006
9.240366514772177
9.24243825674057
9.244240812957287
9.241791695356369
9.238783311098814

After:

19.13179887831211
19.089431881904602
19.54032832942903
19.583404291421175
19.982043754309416
19.10580626502633
19.946577440947294
19.11001042276621
19.101856959983706
19.979458650574088

A little bit of digging shows that the majority of the time in the new haplotypes() method is spent in the line

H[:, var.site.id] = alleles[var.genotypes]

Which allocates genotypes to rows. I'm not sure there is much more optimisation we can do to this bit.

@jeromekelleher
Copy link
Member

Good stuff, thanks. Would you mind pasting in the top lines of the Python profile for a largish example please? I'd like to see how much time is spent in the h.tostring().decode() step --- there's a shortcut we can do there to potentially speed things up.

@jeromekelleher
Copy link
Member

Also, what happens when we have a sample size of 1M? 10K is pretty small.

@hyanwong
Copy link
Member Author

Also, what happens when we have a sample size of 1M? 10K is pretty small.

It looks like as sample size increases, the differences even out. For example, for sample_size = 1e6 and mutation rate 2000 (other params the same), I get ~31 mins for each

old_version: 1852.089014943689 secs
new version: 1873.1874577719718 secs

However, it seems like the number of sites is more critical. Here's the slowdown as a function of mutation rate (length=Ne=1) and # samples. For smaller samples sizes, with large mutation rates (e.g. this is equivalent to 56463 sites), we see the numpy-based method is ~700 times slower. But perhaps we don't care, since it's still quite fast?

     mut_rate:       5           50          500         5000
n:10    [[  2.90313337  19.42890212 149.14133159 766.20840123]
n:100    [  2.2625131   12.48559536  73.3518402  187.20088055]
n:1000   [  1.93421465   6.02976885   6.36719559  14.54888645]
n:10000  [  1.92183657   2.70183765   2.71642332   2.72070453]]

(NB only 1 replicate of each, but you get the idea)

@hyanwong
Copy link
Member Author

Would you mind pasting in the top lines of the Python profile for a largish example please?

Here's 10K samples:

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
  2830                                               def haplotypes(self, impute_missing_data=False):
...
  2876         1         61.0     61.0      0.0          H = np.empty((self.num_samples, self.num_sites), dtype=np.int8)
  2877     78939    2064516.0     26.2     21.7          for var in self.variants(impute_missing_data=impute_missing_data):
  2878                                                       # allow an extra allele at the end for missing data == -1
  2879                                                       # filling with -1 values will ensure an error is raised if unassigned alleles
  2880                                                       # are referenced and converted back to ascii
  2881     78938     880540.0     11.2      9.3              alleles = -np.ones((len(var.alleles) + 1), dtype=np.int8)
  2882    236814     248598.0      1.0      2.6              for i, allele in enumerate(var.alleles):
  2883    157876     103846.0      0.7      1.1                  if allele is not None:
  2884    157876     115512.0      0.7      1.2                      if len(allele) > 1:
  2885                                                                   raise _tskit.LibraryError(
  2886                                                                       "Cannot produce haplotypes from multi-letter alleles")
  2887    157876     242475.0      1.5      2.5                      alleles[i] = ord(allele.encode('ascii'))
  2888     78938    5686557.0     72.0     59.8              H[:, var.site.id] = alleles[var.genotypes]
  2889     10001      10648.0      1.1      0.1          for h in H:
  2890     10000       6123.0      0.6      0.1              try:
  2891     10000     151347.0     15.1      1.6                  yield h.tostring().decode('ascii')

@jeromekelleher
Copy link
Member

Excellent! Looks like the time is dominated by the decoding the variants for large n, so that's as good as we can do. I don't really care about small n, everything is fast anyway.

This is a great step forward, it'll really help to not have this C code lying around.

@hyanwong
Copy link
Member Author

Great. I can save a little time by using np.full rather than -np.ones. Wait a tick and I'll push an update

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comment above --- I think we can simplify the logic a bit.

@jeromekelleher
Copy link
Member

Great, thanks. Merging.

@jeromekelleher jeromekelleher merged commit 7950b25 into tskit-dev:master Nov 27, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Remove C support for haplotypes and implement in Python

2 participants