-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use dependency levels rather than epochs #486
Use dependency levels rather than epochs #486
Conversation
NB: at the moment I'm saving the ancestor ids for each level in a normal python dict: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd need some time to digest and process this @hyanwong. In the mean time, how about trying stuff out with just the Python engine? This should be easier to understand and see if there's any assumptions about time in there.
The reason that it's raising an error (e.g. on test_inference.py::TestMetadataRoundTrip::test_from_historical_tree_sequence) is a logic bug (damn), which means that it's possible to get overlapping ancestors that have the same time value, but are at different dependency levels. That means one can become the ancestor of another, but this then (correctly) raises an error because a parent and child can't have the same time value. Unless there is a way to restrict LS matching to a subset of the reference panel, which I understand is quite tricky, I think we'll have to solve this by "boosting" overlapping individuals within the same timeslice to the highest dependency level. In the extreme case, with all ancestors at a given time overlapping, this would revert to something like the old scheme. But I don't think we'll be in that situation. Probably worth re-testing with a newer algorithm, though. |
Nuclear option to fix logic bug (tskit-dev#486 (comment))
Codecov Report
@@ Coverage Diff @@
## main #486 +/- ##
==========================================
+ Coverage 92.92% 92.95% +0.03%
==========================================
Files 17 17
Lines 5043 5041 -2
Branches 922 926 +4
==========================================
Hits 4686 4686
+ Misses 233 231 -2
Partials 124 124
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
I've added the simple approach mentioned above (where we take all ancestors in the same time slice and set their dependency level to be the maximum of any ancestor in the time slice) and now all tests pass - woohoo. I worry slightly that this will adversely affect the ability to parallelise - I'm just testing it now. As a follow up, here's a few suggestions for improving the current solution. The problem is not too bad because I believe this issue only applies to ancestors that have exactly the same time value (which means that the "dependency level" approach should still be effective in solving the problem of multiple similar ancestor times caused by missing data). Perhaps the most sophisticated approach is to identify ancestors that are logically independent across positions - i.e. use some sort of interval library to identify groups of ancestors whose joint intervals do not overlap. Within each group, we can then set the dependency level to the maximum in the group. However, I think this will be computationally intensive. Another approach (which could be combined with the above), would be to allow the construction of ancestors to happen at their initially calculated dependency level, but not actually add the ancestors to the growing tree sequence until all the ancestors in a given level have been built. For example, each ancestor could have a "dependency level (build)" and a "dependency level (add)". We build the ancestors during their "build" level, but don't add them until their "add" level has been reached. This would, however, involve buffering the ancestors somewhere, so is an additional level of complication. |
Here's the result of the new code on a simulation of 10,000 samples for 10Mb. Looks like it is still working OK. I'm running it on a larger example now - although actually working out the dependencies takes enough time that it seems to be a bit slower in total. I suspect this can be avoided with some more efficient code. It would also be worth trying out on the HGDP (missing data) examples, as I suspect this is where the approach will show major benefits (versus the unbinned ancestors). |
Here's the result from a simulation an order of magnitude larger, with 300,000 sites. And the result for Wilder's unified dataset, chr20, without the ancestor binning stage One problem is that finding the dependency levels is quite slow. I think that can be sped up considerably using an interval library such as https://github.com/biocore-ntnu/ncls, although there are some details to be worked out (e.g. ideally we would gradually build up the interval list as we go down the ancestors, whereas the docs for that library seem to imply that the list is constructed in a single pass. |
Nuclear option to fix logic bug (tskit-dev#486 (comment))
0f2c614
to
f20da45
Compare
Currently, on large-scale simulated data, my implementation is significantly slower than the existing one (e.g. 2889 seconds rather than 1124 seconds on the same dataset, both using cycloid with num_threads=64). After a little investigation, I don't think this is really due to the calculation of the dependency levels. Instead, even when there are (say) 30 ancestors within a single dependency level, the CPU usage on cycloid (which should use all 64 procs) is around 180% rather than 3000%. So while I'm sure the logic of the dependency level approach is correct, and we should be able to get speed-ups, there's something whacky about my particular implementation which is making it all slow down. This could be to do with, for example, the need to access the ancestors out-of-order (using the Another thing to note is that the implementation in this PR outputs a slightly different tree sequence from the current main branch (annoyingly in this example, the output TS is slightly larger in my PR implementation). I suspect this is because path compression is operating at slightly different timepoints:
|
Thanks @hyanwong. Let's park this here for now. I'll need to get into the weeds on this, and have other things that need finishing up first. |
Yes, I agree. I feel like I've done as much as necessary on this, have shown it works, and identified some issues. That's enough for the moment. Edit - I just established that without path compression the tree sequences are identical via either method, so I'm pretty sure the general approach is correct. |
@hyanwong I've rebased to main, shall I push here or a new PR? |
Nuclear option to fix logic bug (tskit-dev#486 (comment))
As you like. If a new PR then you're a bit more in control?
Good point, thanks. I didn't try to optimise this in any way, so not surprised there are several wins. |
Nuclear option to fix logic bug (tskit-dev#486 (comment))
Nuclear option to fix logic bug (tskit-dev#486 (comment))
Improved by @benjeffery 's reworking, so no longer relevant. |
OMG @jeromekelleher . I think this might actually work to address #147 . It's only failing on 3 tests, all of them when doing
self.tree_sequence_builder.add_path()
, with the following failureI suspect that's because
add_path()
is checking that stuff is added in time order, but it probably doesn't need to? Perhaps you could take a quick look to confirm. Then I can try perf testing this!!!