Provide direction argument to edge_diffs. #2120

jeromekelleher · 2022-02-04T14:06:22Z

Adds a "direction" argument to edge_diffs to allow us to sequentially generate trees in reverse.

Moves the implementation to Python resulting in a roughly 2X perf regression. I doubt this will be significant in practise, because people will usually be doing something with the diffs, which will probably be more expensive. Nonetheless, it is a regression so we should be careful.

If we decide we don't mind about the perf loss for this operation, we can delete some of the Python-C interface boilderplate.

I also modernised the tests for this a bit, while I was in there.

cc @astheeggeggs

codecov · 2022-02-04T14:19:05Z

Codecov Report

Merging #2120 (0d47adb) into main (bf1603d) will increase coverage by 0.06%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main    #2120      +/-   ##
==========================================
+ Coverage   93.22%   93.29%   +0.06%     
==========================================
  Files          28       28              
  Lines       26718    26764      +46     
  Branches     1204     1227      +23     
==========================================
+ Hits        24908    24969      +61     
+ Misses       1776     1762      -14     
+ Partials       34       33       -1

Flag	Coverage Δ
c-tests	`92.24% <ø> (ø)`
lwt-tests	`89.05% <ø> (ø)`
python-c-tests	`71.20% <60.86%> (-0.29%)`	⬇️
python-tests	`98.86% <100.00%> (+0.02%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
python/_tskitmodule.c	`90.66% <ø> (+0.04%)`	⬆️
python/tskit/metadata.py	`99.01% <100.00%> (+0.25%)`	⬆️
python/tskit/tables.py	`98.92% <100.00%> (+0.03%)`	⬆️
python/tskit/trees.py	`98.33% <100.00%> (+0.22%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bf1603d...0d47adb. Read the comment docs.

hyanwong · 2022-02-04T22:44:54Z

Moves the implementation to Python resulting in a roughly 2X perf regression. I doubt this will be significant in practise, because people will usually be doing something with the diffs, which will probably be more expensive. Nonetheless, it is a regression so we should be careful.

I'm fairly sure that we use the edge diffs a lot in tsdate, don't we @awohns ? Perhaps it's worth testing if this causes any perf problems there (I suspect not though, as we are doing some maths on each diff)

awohns · 2022-02-04T23:24:12Z

That's right @hyanwong, we use the edge diffs a lot. We iterate over them in both the prior and the inside/outside algorithm. Might be good to check if this has a significant effect on tsdate runtime?

jeromekelleher · 2022-02-05T14:19:52Z

That's a good test I think - if this doesn't have a significant impact on a time of a run of tsdate, then I think getting rid of the C linkage is fine, and a small perf regression is something we can live with. If it makes tsdate 2x slower though, then I think we'll have to keep the current C-based implementation for the forwards direction anyway.

jeromekelleher · 2022-05-20T09:36:58Z

@benjeffery - should be push this in for 0.5.0?

@awohns - what's the simplest way of testing out the perf implications for tsdate?

benjeffery · 2022-05-20T10:18:01Z

If we are doing this (which seems the right thing to do I think) then yes it should be in 0.5.0.

petrelharp · 2022-05-21T04:30:55Z

Oh, this would be nice! We were just wanting this!

awohns · 2022-05-26T01:18:22Z

Results of some perf testing on this:
ts.edge_diffs() is used in two places in tsdate. The more expensive usage is constructing the prior. Here are the runtimes for that operation on an inferred tree sequence from 1000 genomes chromosome 22. Current is the current implementation of edge_diffs() and Direction is the implementation in this PR:

Current: 6 mins 44 seconds. Direction: 7mins 41 seconds.

jeromekelleher · 2022-05-26T05:43:05Z

Right, that's too much of a difference. Will need to think this through again - thanks @awohns!

jeromekelleher · 2022-05-26T10:34:04Z

Hmm, so looks like most of the time for this code is actually spent constructing edge objects and the difference is really about the different code paths we take for that now. Some stats:

C code path, yield None (i.e., don't create any edges): 0.98 seconds
Python code path, yield None: 1.91 seconds
C code path, yield lists of tuples: 1.19 seconds
Python code path, yield lists of tuples: 4.05 seconds
C code path, yield list of locally constructed Edge: 7.16 seconds (This is the current implementation in main)
Python code yield list of Edge using self.edge(): 13.66 (implementation Wilder tried out)
Python code yielding list of locally constructed Edge: 10.55 seconds

So, I think there's room for quite a bit of improvement here and we did have a pretty big perf regression.

I wonder if edge metadata is really worth it here, and if we should do something like this:

# fully locked down dataclass
class DiffEdge:
     left: float
     right: float
     parent: int
     child: int
     id: int
    
     @property
     def metadata(self):
          raise ValueError("Metadata access not supported on edge diffs. Please use ``ts.edge(e.id).metadata`` instead")

I think if we can get back to the performance of yielding lists of tuples, we can switch the the Python code path and get rid of some C complexity, as this will be a perf improvement over what's currently in the library.

@benjeffery, any thoughts?

benjeffery · 2022-06-07T12:10:18Z

I've pushed an additional commit here which removes a function call in the Edge creation code path.

Here are the timings on my machine for 100k trees of 1.6M edges:

main 3s
This PR before commit (ts.edge): 7.12s
Edge(*tuple): 4.8s
Edge(*tuple) with own_init: 3.73s

I tried a few more tricks, but they added complexity for only a percent or so, not worth it.

So we're losing about 25% vs main with this python-only code. Also things should be even better with 3.11. I think that is acceptable and we should wrap this up by removing the C code and documenting the function if you agree?

jeromekelleher · 2022-06-07T12:33:18Z

SGTM. Would the DiffEdge trick I suggested above make any difference or is it basically the same as the current code?

benjeffery · 2022-06-07T21:31:17Z

Using a simple DiffEdge as you described above gets you to 3.55 vs 3.73. It's not nothing, but I don't think that is a big enough win to justify adding a parallel class.

jeromekelleher · 2022-06-08T08:22:51Z

OK then, I guess we should put this to the end-users. @awohns, would you mind trying this version out on your benchmark and letting us know if you're happy with the perf?

jeromekelleher · 2022-06-15T12:12:47Z

OK then, I guess we should put this to the end-users. @awohns, would you mind trying this version out on your benchmark and letting us know if you're happy with the perf?

Bumping this one @awohns - as a key user of edge_diffs we'd like to know if the perf drop here is acceptable for your applications. Would you mind checking your benchmark again? (It's also probably relevant for your "bricking" code). We'd like to release 0.5.0 in the next few days.

awohns · 2022-06-15T17:52:54Z

Sorry, didn't see this until now @jeromekelleher.

So here's the timing for creating the tsdate prior of NYGC 1000 genomes chr22 (same file as before; 453,350 trees):
Current: 6 mins 29 seconds (389 seconds)
Direction: 7 mins 14 seconds (434 seconds)
So it's ~12% slower

And here's the perf for running [naive_split_edges()](https://github.com/awohns/ldgm/blob/a152cb6db42658ab24d973757979af83a1c9c6d5/ldgm/bricks.py#L42) on the first 30 MB of the same tree sequence (148,437 trees):
Current: 8 mins 57 seconds (537 seconds)
Direction: 8 mins 2 seconds (482 seconds)
Which I guess means that edge_diffs() is not really a contributing factor to the overall runtime of this function.

jeromekelleher · 2022-06-16T08:13:55Z

Thanks @awohns. I think you should have the casting vote here on whether we keep the existing C code or not then - 10% is a significant perf drop, but not a major one and probably wouldn't be noticed. I honestly don't mind either way as we have the code in both cases (we can keep the current C-based approach as the default), I just want someone else to make the decision!

benjeffery · 2022-06-16T09:31:16Z

My 2 cents here is that by Python 3.11 we'll probably get that 10% back. If @awohns doesn't want to make the decision I'm happy to!

awohns · 2022-06-16T13:51:15Z

Thanks for seeking my feedback @jeromekelleher and for input @benjeffery!
I think that 10% is acceptable from my perspective, especially given that edge_diffs is not a dominant factor in the runtime of either tsdate or ldgm. So I'm happy for this to be merged!

jeromekelleher · 2022-06-16T16:41:32Z

OK, great! @benjeffery - do you want to do the rest here or will I pick it up?

benjeffery · 2022-06-20T12:47:40Z

OK, great! @benjeffery - do you want to do the rest here or will I pick it up?

I can take this from here!

jeromekelleher · 2022-06-20T13:50:12Z

whoops, took off the AUTOMERGE label there @benjeffery as I'd left some embarassing intermediate commits.Will squash those now.

jeromekelleher · 2022-06-20T13:53:02Z

This also still leaves in the C infrastructure as dead code. Will pull out now.

jeromekelleher · 2022-06-20T13:56:19Z

Documentation isn't finished either. Marking as draft.

benjeffery · 2022-06-20T14:11:57Z

whoops, took off the AUTOMERGE label there @benjeffery as I'd left some embarassing intermediate commits.Will squash those now.

Gah, sorry. I didn't think properly about the C and docs.

Moves the implementation to Python resulting in a roughly 2X perf regression.

jeromekelleher · 2022-06-20T15:19:07Z

OK, should be ready to go now.

jeromekelleher requested review from benjeffery and hyanwong February 4, 2022 16:09

jeromekelleher added this to the Python 0.5.0 milestone May 20, 2022

jeromekelleher force-pushed the generalise-diffs branch from 9ed70ab to b3ef71c Compare May 26, 2022 10:35

jeromekelleher assigned benjeffery Jun 1, 2022

benjeffery force-pushed the generalise-diffs branch from 1f77096 to 9207c06 Compare June 7, 2022 21:30

benjeffery force-pushed the generalise-diffs branch from 9207c06 to 64cd79f Compare June 20, 2022 12:52

benjeffery marked this pull request as ready for review June 20, 2022 13:33

benjeffery approved these changes Jun 20, 2022

View reviewed changes

benjeffery added the AUTOMERGE-REQUESTED label Jun 20, 2022

jeromekelleher force-pushed the generalise-diffs branch from 64cd79f to 586ad8d Compare June 20, 2022 13:36

jeromekelleher removed the AUTOMERGE-REQUESTED label Jun 20, 2022

jeromekelleher force-pushed the generalise-diffs branch from 586ad8d to 0834427 Compare June 20, 2022 13:52

jeromekelleher marked this pull request as draft June 20, 2022 13:56

jeromekelleher and others added 2 commits June 20, 2022 16:14

Provide direction argument to edge_diffs.

06e6c2d

Moves the implementation to Python resulting in a roughly 2X perf regression.

Improve perf of Edge class creation

6ac16e0

jeromekelleher force-pushed the generalise-diffs branch from 96bfce4 to ad9adf6 Compare June 20, 2022 15:18

jeromekelleher marked this pull request as ready for review June 20, 2022 15:18

benjeffery added the AUTOMERGE-REQUESTED label Jun 20, 2022

Remove Python-C edge diff iterator

0d47adb

jeromekelleher force-pushed the generalise-diffs branch from ad9adf6 to 0d47adb Compare June 20, 2022 16:01

mergify bot merged commit 7a1a1a0 into tskit-dev:main Jun 20, 2022

mergify bot removed the AUTOMERGE-REQUESTED label Jun 20, 2022

Provide direction argument to edge_diffs. #2120

Provide direction argument to edge_diffs. #2120

Uh oh!

Conversation

jeromekelleher commented Feb 4, 2022

Uh oh!

codecov bot commented Feb 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

hyanwong commented Feb 4, 2022

Uh oh!

awohns commented Feb 4, 2022

Uh oh!

jeromekelleher commented Feb 5, 2022

Uh oh!

jeromekelleher commented May 20, 2022

Uh oh!

benjeffery commented May 20, 2022

Uh oh!

petrelharp commented May 21, 2022

Uh oh!

awohns commented May 26, 2022

Uh oh!

jeromekelleher commented May 26, 2022

Uh oh!

jeromekelleher commented May 26, 2022

Uh oh!

benjeffery commented Jun 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeromekelleher commented Jun 7, 2022

Uh oh!

benjeffery commented Jun 7, 2022

Uh oh!

jeromekelleher commented Jun 8, 2022

Uh oh!

jeromekelleher commented Jun 15, 2022

Uh oh!

awohns commented Jun 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeromekelleher commented Jun 16, 2022

Uh oh!

benjeffery commented Jun 16, 2022

Uh oh!

awohns commented Jun 16, 2022

Uh oh!

jeromekelleher commented Jun 16, 2022

Uh oh!

benjeffery commented Jun 20, 2022

Uh oh!

jeromekelleher commented Jun 20, 2022

Uh oh!

jeromekelleher commented Jun 20, 2022

Uh oh!

jeromekelleher commented Jun 20, 2022

Uh oh!

benjeffery commented Jun 20, 2022

Uh oh!

jeromekelleher commented Jun 20, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

codecov bot commented Feb 4, 2022 •

edited

Loading

benjeffery commented Jun 7, 2022 •

edited

Loading

awohns commented Jun 15, 2022 •

edited

Loading