-
Notifications
You must be signed in to change notification settings - Fork 78
New python haplotype generator #425
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
18673b5 to
49220d6
Compare
|
This is failing on (at least one) slightly weird test: Here we make a TS with a single (sample) node with an ancestral state and a single mutation above it. According to our definition, this is a site with missing data. If we impute the missing data, we say in the docs that we set it to the ancestral state (== |
|
Thanks @hyanwong. Can we do this as a direct replacement of the existing functionality first and add new functionality afterwards please? |
|
@jeromekelleher I thought you might say that. But what is the correct behaviour in the |
|
We should be able to code a direct replacement for the existing C code that doesn't break any tests. It's not a good idea to change several things at once. |
I see what you mean, but I think the existing code is either broken, or wrongly documented. So should I change the code result, or the docs. I assume you mean the docs? In particular, we do not impute missing data to be the ancestral state if the isolated node has a mutation above it. In this case, we impute a derived state. Is that correct? If so, we should at least say so. |
|
The first thing we want is an exact replica of the existing behaviour, with no extra parameters and no changes in documentation. If there does need to be a change in semantics then we need to understand what the old semantics are, and potentially be able to reproduce them. |
Sorry, we seem to be talking at cross purposes. I'm not necessarily suggesting changing the semantics. I have a PR (not yet pushed) which will replicate the current behaviour exactly (passes all the tests), but the documentation is then wrong. Should I correct the docs? At the moment, the existing behaviour of the C code is wrongly documented. |
|
OK, well please push that code so that we can discuss it then. We'll need to do a perf analysis to make sure that what we're doing isn't a regression. This is a big change and needs to be done carefully, and to be honest, I have other things to do. |
49220d6 to
f68a53b
Compare
|
Looks like I might have accidentally deleted some C clearup code when tidying. I'll have a look. |
f68a53b to
043f956
Compare
Codecov Report
@@ Coverage Diff @@
## master #425 +/- ##
==========================================
+ Coverage 86.67% 86.73% +0.05%
==========================================
Files 20 20
Lines 14258 14115 -143
Branches 2774 2745 -29
==========================================
- Hits 12358 12242 -116
+ Misses 978 963 -15
+ Partials 922 910 -12
Continue to review full report at Codecov.
|
|
Well, this looks like it's OK now, and should be a simple drop-in replacement. I'll do another PR for the proposed changes (edit - now in #426) |
|
OK, thanks. Can we do a perf analysis here? What's the before and after times for a large tree sequence? |
|
Very good point. What's the standard way to do this? I haven't done so in anger before, as it were. |
|
Can use either timeit module or time.perf_counter() directly. I tend to use the perf_counter method. |
|
It looks like it's about 2 times slower. I ran the script below with the 2 codebases Before: After: A little bit of digging shows that the majority of the time in the new
Which allocates genotypes to rows. I'm not sure there is much more optimisation we can do to this bit. |
|
Good stuff, thanks. Would you mind pasting in the top lines of the Python profile for a largish example please? I'd like to see how much time is spent in the |
|
Also, what happens when we have a sample size of 1M? 10K is pretty small. |
It looks like as sample size increases, the differences even out. For example, for sample_size = 1e6 and mutation rate 2000 (other params the same), I get ~31 mins for each old_version: 1852.089014943689 secs However, it seems like the number of sites is more critical. Here's the slowdown as a function of mutation rate (length=Ne=1) and # samples. For smaller samples sizes, with large mutation rates (e.g. this is equivalent to 56463 sites), we see the numpy-based method is ~700 times slower. But perhaps we don't care, since it's still quite fast? (NB only 1 replicate of each, but you get the idea) |
Here's 10K samples: |
|
Excellent! Looks like the time is dominated by the decoding the variants for large n, so that's as good as we can do. I don't really care about small n, everything is fast anyway. This is a great step forward, it'll really help to not have this C code lying around. |
|
Great. I can save a little time by using np.full rather than -np.ones. Wait a tick and I'll push an update |
043f956 to
f9db7d8
Compare
jeromekelleher
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comment above --- I think we can simplify the logic a bit.
f9db7d8 to
758d5e3
Compare
|
Great, thanks. Merging. |
Fixes #326
Note that this adds two pieces of extra functionality that I would like. Firstly (and less controversially), it allows missing data. Secondly, it also allows multi-letter alleles and deletions. This will help us support indels, but does mean that the number of characters in the output string is not necessarily the same as the total number of sites. However, I have coded it so that the alignments between samples are always the same (and hence the string length for each haplotype is guaranteed to be identical): this means having to make a decision about what to do in the case of multiple alleles at a site that are of different (non zero) lengths. In this case I have chosen to represent the shorter of the two by missing data, since I can't guarantee the alignment. This is perhaps the wrong thing to do - either way, it probably requires some discussion.The tests aren't that comprehensive yet, especially for this extra functionality. I thought I should submit a preliminary PR before going to town on test suites, in case any of this seemed like a bad idea.Spun off into #425