Skip to content

Conversation

@mufernando
Copy link
Member

@mufernando mufernando commented Jun 1, 2020

Here I implemented a numpy version of a method to subset a table collection.

There are probably many uses for this, but most importantly:

  • If you subset on a permutation of all existing nodes, you can reorder the Nodes, Individuals, and Populations Tables to your liking. I think @hyanwong was interested in this use?
  • For the new graft method we need a better way of checking for equivalency between Table Collections, and now we can quickly do so by subsetting.
    • We now can get the non-overlapping bits of two tree sequences using this method, so it could be used to rethink the way graft works.

@petrelharp described the motivation and approach here:

The proposed subset-and-reorder method is: tables.subset_nodes(nodes) modifies the tables to retain exactly those:

  • nodes that are listed in nodes and in the order listed
  • mutations whose node entry is in nodes
  • sites with remaining mutations
  • edges whose parent and child entries are in nodes
  • individuals referred to by the nodes in nodes, and in the order encountered in nodes
  • only populations referred to by nodes in nodes, and in the order encountered in nodes
  • migrations whose nodes are in nodes

I also implemented a new function to subset_ragged_col that subsets ragged cols based on a list of indices (or boolean mask). I added it to util.py but not sure that is the right place.

@petrelharp
Copy link
Contributor

To other viewing this issue - we've got a fair bit of work to tidy this up, so comments appreciated especially on the big picture, but we know that it's not ready to go yet.

Maybe the biggest question in my head is what to call this method? It has two main use cases, really: subsetting, and reordering, so maybe the name should reflect reordering also.

@codecov
Copy link

codecov bot commented Jun 2, 2020

Codecov Report

Merging #663 into master will decrease coverage by 0.04%.
The diff coverage is 81.30%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #663      +/-   ##
==========================================
- Coverage   87.75%   87.71%   -0.05%     
==========================================
  Files          23       23              
  Lines       17687    17810     +123     
  Branches     3498     3526      +28     
==========================================
+ Hits        15522    15622     +100     
- Misses       1062     1073      +11     
- Partials     1103     1115      +12     
Flag Coverage Δ
#c_tests 88.95% <78.50%> (-0.09%) ⬇️
#python_c_tests 91.26% <100.00%> (+0.02%) ⬆️
#python_tests 98.99% <100.00%> (+<0.01%) ⬆️
Impacted Files Coverage Δ
c/tskit/tables.c 79.06% <75.78%> (-0.08%) ⬇️
c/tskit/core.c 91.44% <100.00%> (+0.05%) ⬆️
python/_tskitmodule.c 83.91% <100.00%> (+0.05%) ⬆️
python/tskit/tables.py 99.66% <100.00%> (+<0.01%) ⬆️
python/tskit/trees.py 98.21% <100.00%> (+<0.01%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 99c0e18...7cc4503. Read the comment docs.

@petrelharp
Copy link
Contributor

To avoid confusion: right now, there are two implementations, one (.subset()) in numpy, the other (.subset_nodes()) in C. They should be equivalent (except for bugs, of which there are some currently), and it will be interesting to compare them.

@jeromekelleher
Copy link
Member

The main big-picture question for me would be, doesn't simplify do this already? We started out calling simplify subset, and that was its first application before we thought about using it for forwards simulation.

@hyanwong
Copy link
Member

hyanwong commented Jun 2, 2020

The main big-picture question for me would be, doesn't simplify do this already? We started out calling simplify subset, and that was its first application before we thought about using it for forwards simulation.

A good point. Just a quick thought: simplify won't reorder internal nodes arbitrarily (but I'm not sure if the new method will do that either). Also simplify doesn't reorder populations / individuals AFAIK, but I guess it could be extended to do that.

@petrelharp
Copy link
Contributor

The main big-picture question for me would be, doesn't simplify do this already?

Simplify does a lot more. As Yan points out, simplify only lets you reorder samples; but much of the structure above the samples might be changed (nodes removed, edges squashed, etc). I've put a reminder of this in the docs; should I make this confusing point even more explicit? If you passed in just the list of sample nodes to subset in an msprime tree, you'd get back the trivial forest where no-one is related to anyone else. Maybe putting a note like this into the docs would help clarify:

Note that *only* nodes listed are included: so if full trees are to be retained in the tree sequence, all ancestral node IDs must be included in ``nodes``, not just the samples.

@jeromekelleher
Copy link
Member

Ah, that makes sense now. Sure, seems like a useful operation then.

@jeromekelleher
Copy link
Member

Ping me when you'd like detailed feedback - I'll mute this thread for now.

old_inds = n.individual[nodes]
indiv_to_keep, i = np.unique(old_inds, return_index=True)
# maintaining order in which they appeard
indiv_to_keep = indiv_to_keep[np.argsort(i)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very nice

@mufernando mufernando force-pushed the subset branch 2 times, most recently from 41b49cc to 7b4beaf Compare June 3, 2020 13:05
@mufernando
Copy link
Member Author

mufernando commented Jun 3, 2020

I don't think the subset_ragged_columns is working as it should because it is not reordering.

55518db fixed this and added metadata to the tests

@mufernando
Copy link
Member Author

Speed comparisons for the C and Numpy versions:

Here for one tree of 4M nodes, I tested the speed for different sizes of subset:
Screen Shot 2020-06-03 at 2 23 52 PM

Here for trees of varying sizes, I tested the speed to get a same sized subset (40 nodes):
Screen Shot 2020-06-03 at 2 24 27 PM

@petrelharp
Copy link
Contributor

Ok, thanks! I was curious, because if numpy development is much easier, then it'd be a good way to quickly add more functionality. Comparing these two implementations, I'm not sure it is easier (the C code is pretty straightfoward), and the difference betwee 7sec and 0sec is definately noticable (although not a big deal in the bigger scheme of things). Any other thoughts on the comparison, @mufernando? Seems like we want to move your (very nice) python implementation over to testing code, for verification, though.

@mufernando
Copy link
Member Author

add a test with random locations

@mufernando
Copy link
Member Author

I think I addressed all of your comments, @jeromekelleher

I was fighting with codecov bc it says the loops over mutations and migrations are never covered by our tests, but I'm pretty sure they are -- all our examples have mutations and migrations which seem to be correctly subsetted...

@petrelharp
Copy link
Contributor

I can take a look first if you like, @jeromekelleher.

@jeromekelleher
Copy link
Member

I can take a look first if you like, @jeromekelleher.

Sure, that'd be great, thanks @petrelharp. Or are you procrastinating from something...


def test_wf_examples(self):
tables = wf.wf_sim(5, 5, seed=65634)
tables.sort()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the sort?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure, but all the other examples with the WF simulator do that. I think it has to do with the Edges Tables requirements.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, well, since we're just working with tables, not tree sequences, let's not sort!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need tree sequences for some of the tsutil functions (to add metadata, etc).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm: those don't need the tree sequence either, but ok!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you mean? they take in tree sequences and dump the tables internally.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right - they could be rewritten to operate on tables instead of tree sequences, since that's really what's going on. But, I'm not suggesting you do that.

@petrelharp
Copy link
Contributor

petrelharp commented Jun 19, 2020

Looks good - just a few comments. I don't understand codecov either. Could you do something about those comments and ping jerome again?

@mufernando
Copy link
Member Author

I think I addressed all the issues. Not sure about codecov, @jeromekelleher. I don't understand how it works -- sometimes it complains sometimes it doesn't. Pretty sure the parts about mutations and migrations in the C code are covered by our tests, though.

@petrelharp
Copy link
Contributor

Yep, over to you, @jeromekelleher (except my note about removing the sort() above).

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good thanks @mufernando. I think we can make the C code a bit clearer and more efficient. Also, there's some semantic difficulties with migrations. We need to either not support them, or handle (and test) this situtation.

@jeromekelleher
Copy link
Member

We also need to bump up test coverage quite a bit. The C tests don't have any sites or mutations, so that's not being covered. Neither are migrations (but we should just drop that anyway, and test that we behave correctly when we try to subset a including them).

Need to add a test for the TreeSequence.subset method also. This can just be verifying that it returns the same thing as ts.dump_tables().subset()

@mufernando
Copy link
Member Author

thanks for the revisions, @jeromekelleher! I added some mutations to the C side, but codecov isn't being rerun. Can we explicitly ask it to be run again?

@benjeffery
Copy link
Member

benjeffery commented Jun 23, 2020

Codecov isn't reporting as valgrind is reporting a memory leak: https://circleci.com/gh/tskit-dev/tskit/2088?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link
EDIT: Looks like you commit a fix just as I was typing that! :D

@mufernando
Copy link
Member Author

yeah, just fixed that. but either way codecov is not in the list of pending checks.

@benjeffery
Copy link
Member

Codecov will only appear after circle ci is successfully completed, the coverage is measured in the circle CI run.

@jeromekelleher
Copy link
Member

Looking great, thanks @mufernando! Apologies for all the nitpicking, but I think it was worth the effort. The code is super-clear and clean now.

Just need to get codecov happy and squash the commits now I think.

@mufernando
Copy link
Member Author

mufernando commented Jun 23, 2020

Thank you, @jeromekelleher and @petrelharp for the patience and help along the way!

I am not sure how to bump up codecov at this point. The parts of the code not being tested are some ifs on the return of add_row lines (and other error handling sections). What is the proper way of dealing with this?

@jeromekelleher
Copy link
Member

You've tested all the conditions that can reasonably happen @mufernando, so we're ready to merge. Can you squash the commits please?

@mufernando
Copy link
Member Author

woot woot!

thanks, you all!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants