Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use dependency levels rather than epochs #486

Closed

Conversation

hyanwong
Copy link
Member

@hyanwong hyanwong commented Mar 19, 2021

OMG @jeromekelleher . I think this might actually work to address #147 . It's only failing on 3 tests, all of them when doing self.tree_sequence_builder.add_path(), with the following failure

_tsinfer.LibraryError: Bad path information: time

I suspect that's because add_path() is checking that stuff is added in time order, but it probably doesn't need to? Perhaps you could take a quick look to confirm. Then I can try perf testing this!!!

@hyanwong hyanwong marked this pull request as draft March 19, 2021 21:16
@hyanwong
Copy link
Member Author

NB: at the moment I'm saving the ancestor ids for each level in a normal python dict: {level1:[ancestor_id1, ancestor_id2,..], level1:[ancestor_id3, ancestor_id4,ancestor_id5..], ...}. But if we have millions of ancestors this might be a bit of a burden, so we could possibly store them e.g. in a zarr array instead, or even re-calculate them on the fly from the intermediate numpy array which I've called dep_level.

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd need some time to digest and process this @hyanwong. In the mean time, how about trying stuff out with just the Python engine? This should be easier to understand and see if there's any assumptions about time in there.

@hyanwong
Copy link
Member Author

hyanwong commented Mar 22, 2021

The reason that it's raising an error (e.g. on test_inference.py::TestMetadataRoundTrip::test_from_historical_tree_sequence) is a logic bug (damn), which means that it's possible to get overlapping ancestors that have the same time value, but are at different dependency levels. That means one can become the ancestor of another, but this then (correctly) raises an error because a parent and child can't have the same time value.

Unless there is a way to restrict LS matching to a subset of the reference panel, which I understand is quite tricky, I think we'll have to solve this by "boosting" overlapping individuals within the same timeslice to the highest dependency level. In the extreme case, with all ancestors at a given time overlapping, this would revert to something like the old scheme. But I don't think we'll be in that situation. Probably worth re-testing with a newer algorithm, though.

hyanwong added a commit to hyanwong/tsinfer that referenced this pull request Mar 23, 2021
@codecov
Copy link

codecov bot commented Mar 23, 2021

Codecov Report

Merging #486 (f20da45) into main (c063d0a) will increase coverage by 0.03%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #486      +/-   ##
==========================================
+ Coverage   92.92%   92.95%   +0.03%     
==========================================
  Files          17       17              
  Lines        5043     5041       -2     
  Branches      922      926       +4     
==========================================
  Hits         4686     4686              
+ Misses        233      231       -2     
  Partials      124      124              
Flag Coverage Δ
C 92.95% <100.00%> (+0.03%) ⬆️
python 96.16% <100.00%> (+0.05%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
tsinfer/formats.py 98.15% <100.00%> (+<0.01%) ⬆️
tsinfer/inference.py 98.55% <100.00%> (+0.25%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c063d0a...f20da45. Read the comment docs.

@hyanwong
Copy link
Member Author

I've added the simple approach mentioned above (where we take all ancestors in the same time slice and set their dependency level to be the maximum of any ancestor in the time slice) and now all tests pass - woohoo. I worry slightly that this will adversely affect the ability to parallelise - I'm just testing it now.

As a follow up, here's a few suggestions for improving the current solution. The problem is not too bad because I believe this issue only applies to ancestors that have exactly the same time value (which means that the "dependency level" approach should still be effective in solving the problem of multiple similar ancestor times caused by missing data).

Perhaps the most sophisticated approach is to identify ancestors that are logically independent across positions - i.e. use some sort of interval library to identify groups of ancestors whose joint intervals do not overlap. Within each group, we can then set the dependency level to the maximum in the group. However, I think this will be computationally intensive.

Another approach (which could be combined with the above), would be to allow the construction of ancestors to happen at their initially calculated dependency level, but not actually add the ancestors to the growing tree sequence until all the ancestors in a given level have been built. For example, each ancestor could have a "dependency level (build)" and a "dependency level (add)". We build the ancestors during their "build" level, but don't add them until their "add" level has been reached. This would, however, involve buffering the ancestors somewhere, so is an additional level of complication.

@hyanwong
Copy link
Member Author

Here's the result of the new code on a simulation of 10,000 samples for 10Mb. Looks like it is still working OK. I'm running it on a larger example now - although actually working out the dependencies takes enough time that it seems to be a bit slower in total. I suspect this can be avoided with some more efficient code. It would also be worth trying out on the HGDP (missing data) examples, as I suspect this is where the approach will show major benefits (versus the unbinned ancestors).

Screenshot 2021-03-23 at 12 21 17

@hyanwong
Copy link
Member Author

hyanwong commented Mar 23, 2021

Here's the result from a simulation an order of magnitude larger, with 300,000 sites.

Screenshot 2021-03-23 at 12 55 45

And the result for Wilder's unified dataset, chr20, without the ancestor binning stage

Screenshot 2021-03-23 at 18 03 30

One problem is that finding the dependency levels is quite slow. I think that can be sped up considerably using an interval library such as https://github.com/biocore-ntnu/ncls, although there are some details to be worked out (e.g. ideally we would gradually build up the interval list as we go down the ancestors, whereas the docs for that library seem to imply that the list is constructed in a single pass.

@hyanwong hyanwong force-pushed the position-aware-match-parallelization branch from 0f2c614 to f20da45 Compare March 23, 2021 14:22
@hyanwong
Copy link
Member Author

hyanwong commented Mar 23, 2021

Currently, on large-scale simulated data, my implementation is significantly slower than the existing one (e.g. 2889 seconds rather than 1124 seconds on the same dataset, both using cycloid with num_threads=64). After a little investigation, I don't think this is really due to the calculation of the dependency levels. Instead, even when there are (say) 30 ancestors within a single dependency level, the CPU usage on cycloid (which should use all 64 procs) is around 180% rather than 3000%. So while I'm sure the logic of the dependency level approach is correct, and we should be able to get speed-ups, there's something whacky about my particular implementation which is making it all slow down. This could be to do with, for example, the need to access the ancestors out-of-order (using the ancestors(indexes=ancestor_ids) method). If so, then I guess one solution would be to re-order the ancestors before we go into the matching loop. Whatever the reason, I suspect this could do with @jeromekelleher having a look at how I've done it, so as to spot any errors that I've made in parallelisation.

Another thing to note is that the implementation in this PR outputs a slightly different tree sequence from the current main branch (annoyingly in this example, the output TS is slightly larger in my PR implementation). I suspect this is because path compression is operating at slightly different timepoints:

>>> orig.num_nodes
276488
>>> pos_aware.num_nodes
276468
>>> sum([1 for n in orig.nodes() if (n.flags & tsinfer.NODE_IS_PC_ANCESTOR)])                                                                         
8702
>>> sum([1 for n in pos_aware.nodes() if (n.flags & tsinfer.NODE_IS_PC_ANCESTOR)])                                                                    
8682
>>> orig.num_edges
915616
>>> pos_aware.num_edges
916136
>>> orig.num_trees
206619
>>> pos_aware.num_trees
206727

@jeromekelleher
Copy link
Member

Thanks @hyanwong. Let's park this here for now. I'll need to get into the weeds on this, and have other things that need finishing up first.

@hyanwong
Copy link
Member Author

hyanwong commented Mar 23, 2021

Thanks @hyanwong. Let's park this here for now. I'll need to get into the weeds on this, and have other things that need finishing up first.

Yes, I agree. I feel like I've done as much as necessary on this, have shown it works, and identified some issues. That's enough for the moment.

Edit - I just established that without path compression the tree sequences are identical via either method, so I'm pretty sure the general approach is correct.

@benjeffery
Copy link
Member

@hyanwong I've rebased to main, shall I push here or a new PR?
I think there are some quick perf wins here (e.g. not using the .ancestors iterator when you only need the times) so I'm just working on those.

benjeffery pushed a commit to benjeffery/tsinfer that referenced this pull request Apr 12, 2023
@hyanwong
Copy link
Member Author

@hyanwong I've rebased to main, shall I push here or a new PR?

As you like. If a new PR then you're a bit more in control?

I think there are some quick perf wins here (e.g. not using the .ancestors iterator when you only need the times) so I'm just working on those.

Good point, thanks. I didn't try to optimise this in any way, so not surprised there are several wins.

benjeffery pushed a commit to benjeffery/tsinfer that referenced this pull request Apr 12, 2023
benjeffery pushed a commit to benjeffery/tsinfer that referenced this pull request May 19, 2023
@hyanwong
Copy link
Member Author

hyanwong commented Jul 3, 2023

Improved by @benjeffery 's reworking, so no longer relevant.

@hyanwong hyanwong closed this Jul 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants