Use dependency levels rather than epochs #486

hyanwong · 2021-03-19T21:13:30Z

OMG @jeromekelleher . I think this might actually work to address #147 . It's only failing on 3 tests, all of them when doing self.tree_sequence_builder.add_path(), with the following failure

_tsinfer.LibraryError: Bad path information: time

I suspect that's because add_path() is checking that stuff is added in time order, but it probably doesn't need to? Perhaps you could take a quick look to confirm. Then I can try perf testing this!!!

hyanwong · 2021-03-19T21:20:24Z

NB: at the moment I'm saving the ancestor ids for each level in a normal python dict: {level1:[ancestor_id1, ancestor_id2,..], level1:[ancestor_id3, ancestor_id4,ancestor_id5..], ...}. But if we have millions of ancestors this might be a bit of a burden, so we could possibly store them e.g. in a zarr array instead, or even re-calculate them on the fly from the intermediate numpy array which I've called dep_level.

jeromekelleher

I'd need some time to digest and process this @hyanwong. In the mean time, how about trying stuff out with just the Python engine? This should be easier to understand and see if there's any assumptions about time in there.

hyanwong · 2021-03-22T16:51:49Z

The reason that it's raising an error (e.g. on test_inference.py::TestMetadataRoundTrip::test_from_historical_tree_sequence) is a logic bug (damn), which means that it's possible to get overlapping ancestors that have the same time value, but are at different dependency levels. That means one can become the ancestor of another, but this then (correctly) raises an error because a parent and child can't have the same time value.

Unless there is a way to restrict LS matching to a subset of the reference panel, which I understand is quite tricky, I think we'll have to solve this by "boosting" overlapping individuals within the same timeslice to the highest dependency level. In the extreme case, with all ancestors at a given time overlapping, this would revert to something like the old scheme. But I don't think we'll be in that situation. Probably worth re-testing with a newer algorithm, though.

Nuclear option to fix logic bug (tskit-dev#486 (comment))

codecov · 2021-03-23T10:34:56Z

Codecov Report

Merging #486 (f20da45) into main (c063d0a) will increase coverage by 0.03%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main     #486      +/-   ##
==========================================
+ Coverage   92.92%   92.95%   +0.03%     
==========================================
  Files          17       17              
  Lines        5043     5041       -2     
  Branches      922      926       +4     
==========================================
  Hits         4686     4686              
+ Misses        233      231       -2     
  Partials      124      124

Flag	Coverage Δ
C	`92.95% <100.00%> (+0.03%)`	⬆️
python	`96.16% <100.00%> (+0.05%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
tsinfer/formats.py	`98.15% <100.00%> (+<0.01%)`	⬆️
tsinfer/inference.py	`98.55% <100.00%> (+0.25%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c063d0a...f20da45. Read the comment docs.

hyanwong · 2021-03-23T11:10:57Z

I've added the simple approach mentioned above (where we take all ancestors in the same time slice and set their dependency level to be the maximum of any ancestor in the time slice) and now all tests pass - woohoo. I worry slightly that this will adversely affect the ability to parallelise - I'm just testing it now.

As a follow up, here's a few suggestions for improving the current solution. The problem is not too bad because I believe this issue only applies to ancestors that have exactly the same time value (which means that the "dependency level" approach should still be effective in solving the problem of multiple similar ancestor times caused by missing data).

Perhaps the most sophisticated approach is to identify ancestors that are logically independent across positions - i.e. use some sort of interval library to identify groups of ancestors whose joint intervals do not overlap. Within each group, we can then set the dependency level to the maximum in the group. However, I think this will be computationally intensive.

Another approach (which could be combined with the above), would be to allow the construction of ancestors to happen at their initially calculated dependency level, but not actually add the ancestors to the growing tree sequence until all the ancestors in a given level have been built. For example, each ancestor could have a "dependency level (build)" and a "dependency level (add)". We build the ancestors during their "build" level, but don't add them until their "add" level has been reached. This would, however, involve buffering the ancestors somewhere, so is an additional level of complication.

hyanwong · 2021-03-23T12:25:31Z

Here's the result of the new code on a simulation of 10,000 samples for 10Mb. Looks like it is still working OK. I'm running it on a larger example now - although actually working out the dependencies takes enough time that it seems to be a bit slower in total. I suspect this can be avoided with some more efficient code. It would also be worth trying out on the HGDP (missing data) examples, as I suspect this is where the approach will show major benefits (versus the unbinned ancestors).

hyanwong · 2021-03-23T14:10:12Z

Here's the result from a simulation an order of magnitude larger, with 300,000 sites.

And the result for Wilder's unified dataset, chr20, without the ancestor binning stage

One problem is that finding the dependency levels is quite slow. I think that can be sped up considerably using an interval library such as https://github.com/biocore-ntnu/ncls, although there are some details to be worked out (e.g. ideally we would gradually build up the interval list as we go down the ancestors, whereas the docs for that library seem to imply that the list is constructed in a single pass.

Nuclear option to fix logic bug (tskit-dev#486 (comment))

hyanwong · 2021-03-23T15:38:03Z

Currently, on large-scale simulated data, my implementation is significantly slower than the existing one (e.g. 2889 seconds rather than 1124 seconds on the same dataset, both using cycloid with num_threads=64). After a little investigation, I don't think this is really due to the calculation of the dependency levels. Instead, even when there are (say) 30 ancestors within a single dependency level, the CPU usage on cycloid (which should use all 64 procs) is around 180% rather than 3000%. So while I'm sure the logic of the dependency level approach is correct, and we should be able to get speed-ups, there's something whacky about my particular implementation which is making it all slow down. This could be to do with, for example, the need to access the ancestors out-of-order (using the ancestors(indexes=ancestor_ids) method). If so, then I guess one solution would be to re-order the ancestors before we go into the matching loop. Whatever the reason, I suspect this could do with @jeromekelleher having a look at how I've done it, so as to spot any errors that I've made in parallelisation.

Another thing to note is that the implementation in this PR outputs a slightly different tree sequence from the current main branch (annoyingly in this example, the output TS is slightly larger in my PR implementation). I suspect this is because path compression is operating at slightly different timepoints:

>>> orig.num_nodes
276488
>>> pos_aware.num_nodes
276468
>>> sum([1 for n in orig.nodes() if (n.flags & tsinfer.NODE_IS_PC_ANCESTOR)])                                                                         
8702
>>> sum([1 for n in pos_aware.nodes() if (n.flags & tsinfer.NODE_IS_PC_ANCESTOR)])                                                                    
8682
>>> orig.num_edges
915616
>>> pos_aware.num_edges
916136
>>> orig.num_trees
206619
>>> pos_aware.num_trees
206727

jeromekelleher · 2021-03-23T16:27:38Z

Thanks @hyanwong. Let's park this here for now. I'll need to get into the weeds on this, and have other things that need finishing up first.

hyanwong · 2021-03-23T16:29:03Z

Thanks @hyanwong. Let's park this here for now. I'll need to get into the weeds on this, and have other things that need finishing up first.

Yes, I agree. I feel like I've done as much as necessary on this, have shown it works, and identified some issues. That's enough for the moment.

Edit - I just established that without path compression the tree sequences are identical via either method, so I'm pretty sure the general approach is correct.

benjeffery · 2023-04-12T13:35:50Z

@hyanwong I've rebased to main, shall I push here or a new PR?
I think there are some quick perf wins here (e.g. not using the .ancestors iterator when you only need the times) so I'm just working on those.

Nuclear option to fix logic bug (tskit-dev#486 (comment))

hyanwong · 2023-04-12T16:34:06Z

@hyanwong I've rebased to main, shall I push here or a new PR?

As you like. If a new PR then you're a bit more in control?

I think there are some quick perf wins here (e.g. not using the .ancestors iterator when you only need the times) so I'm just working on those.

Good point, thanks. I didn't try to optimise this in any way, so not surprised there are several wins.

Nuclear option to fix logic bug (tskit-dev#486 (comment))

hyanwong · 2023-07-03T09:43:47Z

Improved by @benjeffery 's reworking, so no longer relevant.

Use dependency levels rather than epochs

65b5924

hyanwong marked this pull request as draft March 19, 2021 21:16

jeromekelleher reviewed Mar 22, 2021

View reviewed changes

hyanwong added a commit to hyanwong/tsinfer that referenced this pull request Mar 23, 2021

Force ancestors in same epoch to same dep level

0f2c614

Nuclear option to fix logic bug (tskit-dev#486 (comment))

Force ancestors in same epoch to same dep level

f20da45

Nuclear option to fix logic bug (tskit-dev#486 (comment))

hyanwong force-pushed the position-aware-match-parallelization branch from 0f2c614 to f20da45 Compare March 23, 2021 14:22

benjeffery pushed a commit to benjeffery/tsinfer that referenced this pull request Apr 12, 2023

Force ancestors in same epoch to same dep level

7ca4b5c

Nuclear option to fix logic bug (tskit-dev#486 (comment))

benjeffery pushed a commit to benjeffery/tsinfer that referenced this pull request Apr 12, 2023

Force ancestors in same epoch to same dep level

babf038

Nuclear option to fix logic bug (tskit-dev#486 (comment))

jeromekelleher mentioned this pull request May 15, 2023

match_epoch function #825

Open

benjeffery mentioned this pull request May 17, 2023

Extra parallelization possibilities for match_ancestors? #147

Closed

benjeffery pushed a commit to benjeffery/tsinfer that referenced this pull request May 19, 2023

Force ancestors in same epoch to same dep level

f5e52dc

Nuclear option to fix logic bug (tskit-dev#486 (comment))

hyanwong closed this Jul 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use dependency levels rather than epochs #486

Use dependency levels rather than epochs #486

hyanwong commented Mar 19, 2021 •

edited

Loading

hyanwong commented Mar 19, 2021

jeromekelleher left a comment

hyanwong commented Mar 22, 2021 •

edited

Loading

codecov bot commented Mar 23, 2021 •

edited

Loading

hyanwong commented Mar 23, 2021

hyanwong commented Mar 23, 2021

hyanwong commented Mar 23, 2021 •

edited

Loading

hyanwong commented Mar 23, 2021 •

edited

Loading

jeromekelleher commented Mar 23, 2021

hyanwong commented Mar 23, 2021 •

edited

Loading

benjeffery commented Apr 12, 2023

hyanwong commented Apr 12, 2023

hyanwong commented Jul 3, 2023

Use dependency levels rather than epochs #486

Use dependency levels rather than epochs #486

Conversation

hyanwong commented Mar 19, 2021 • edited Loading

hyanwong commented Mar 19, 2021

jeromekelleher left a comment

Choose a reason for hiding this comment

hyanwong commented Mar 22, 2021 • edited Loading

codecov bot commented Mar 23, 2021 • edited Loading

Codecov Report

hyanwong commented Mar 23, 2021

hyanwong commented Mar 23, 2021

hyanwong commented Mar 23, 2021 • edited Loading

hyanwong commented Mar 23, 2021 • edited Loading

jeromekelleher commented Mar 23, 2021

hyanwong commented Mar 23, 2021 • edited Loading

benjeffery commented Apr 12, 2023

hyanwong commented Apr 12, 2023

hyanwong commented Jul 3, 2023

hyanwong commented Mar 19, 2021 •

edited

Loading

hyanwong commented Mar 22, 2021 •

edited

Loading

codecov bot commented Mar 23, 2021 •

edited

Loading

hyanwong commented Mar 23, 2021 •

edited

Loading

hyanwong commented Mar 23, 2021 •

edited

Loading

hyanwong commented Mar 23, 2021 •

edited

Loading