method to subset and reorder table collection #663

mufernando · 2020-06-01T21:01:08Z

Here I implemented a numpy version of a method to subset a table collection.

There are probably many uses for this, but most importantly:

If you subset on a permutation of all existing nodes, you can reorder the Nodes, Individuals, and Populations Tables to your liking. I think @hyanwong was interested in this use?
For the new graft method we need a better way of checking for equivalency between Table Collections, and now we can quickly do so by subsetting.
- We now can get the non-overlapping bits of two tree sequences using this method, so it could be used to rethink the way graft works.

@petrelharp described the motivation and approach here:

The proposed subset-and-reorder method is: tables.subset_nodes(nodes) modifies the tables to retain exactly those:

nodes that are listed in nodes and in the order listed
mutations whose node entry is in nodes
sites with remaining mutations
edges whose parent and child entries are in nodes
individuals referred to by the nodes in nodes, and in the order encountered in nodes
only populations referred to by nodes in nodes, and in the order encountered in nodes
migrations whose nodes are in nodes

I also implemented a new function to subset_ragged_col that subsets ragged cols based on a list of indices (or boolean mask). I added it to util.py but not sure that is the right place.

petrelharp · 2020-06-01T22:33:10Z

To other viewing this issue - we've got a fair bit of work to tidy this up, so comments appreciated especially on the big picture, but we know that it's not ready to go yet.

Maybe the biggest question in my head is what to call this method? It has two main use cases, really: subsetting, and reordering, so maybe the name should reflect reordering also.

codecov · 2020-06-02T05:56:13Z

Codecov Report

Merging #663 into master will decrease coverage by 0.04%.
The diff coverage is 81.30%.

@@            Coverage Diff             @@
##           master     #663      +/-   ##
==========================================
- Coverage   87.75%   87.71%   -0.05%     
==========================================
  Files          23       23              
  Lines       17687    17810     +123     
  Branches     3498     3526      +28     
==========================================
+ Hits        15522    15622     +100     
- Misses       1062     1073      +11     
- Partials     1103     1115      +12

Flag	Coverage Δ
#c_tests	`88.95% <78.50%> (-0.09%)`	⬇️
#python_c_tests	`91.26% <100.00%> (+0.02%)`	⬆️
#python_tests	`98.99% <100.00%> (+<0.01%)`	⬆️

Impacted Files	Coverage Δ
c/tskit/tables.c	`79.06% <75.78%> (-0.08%)`	⬇️
c/tskit/core.c	`91.44% <100.00%> (+0.05%)`	⬆️
python/_tskitmodule.c	`83.91% <100.00%> (+0.05%)`	⬆️
python/tskit/tables.py	`99.66% <100.00%> (+<0.01%)`	⬆️
python/tskit/trees.py	`98.21% <100.00%> (+<0.01%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 99c0e18...7cc4503. Read the comment docs.

petrelharp · 2020-06-02T05:57:26Z

To avoid confusion: right now, there are two implementations, one (.subset()) in numpy, the other (.subset_nodes()) in C. They should be equivalent (except for bugs, of which there are some currently), and it will be interesting to compare them.

python/tests/test_tables.py

python/tskit/tables.py

jeromekelleher · 2020-06-02T07:55:11Z

The main big-picture question for me would be, doesn't simplify do this already? We started out calling simplify subset, and that was its first application before we thought about using it for forwards simulation.

hyanwong · 2020-06-02T07:59:01Z

The main big-picture question for me would be, doesn't simplify do this already? We started out calling simplify subset, and that was its first application before we thought about using it for forwards simulation.

A good point. Just a quick thought: simplify won't reorder internal nodes arbitrarily (but I'm not sure if the new method will do that either). Also simplify doesn't reorder populations / individuals AFAIK, but I guess it could be extended to do that.

petrelharp · 2020-06-02T14:01:39Z

The main big-picture question for me would be, doesn't simplify do this already?

Simplify does a lot more. As Yan points out, simplify only lets you reorder samples; but much of the structure above the samples might be changed (nodes removed, edges squashed, etc). I've put a reminder of this in the docs; should I make this confusing point even more explicit? If you passed in just the list of sample nodes to subset in an msprime tree, you'd get back the trivial forest where no-one is related to anyone else. Maybe putting a note like this into the docs would help clarify:

Note that *only* nodes listed are included: so if full trees are to be retained in the tree sequence, all ancestral node IDs must be included in ``nodes``, not just the samples.

jeromekelleher · 2020-06-02T14:04:03Z

Ah, that makes sense now. Sure, seems like a useful operation then.

python/tskit/tables.py

jeromekelleher · 2020-06-02T14:04:25Z

Ping me when you'd like detailed feedback - I'll mute this thread for now.

python/tskit/tables.py

petrelharp · 2020-06-02T14:09:02Z

python/tskit/tables.py

+        old_inds = n.individual[nodes]
+        indiv_to_keep, i = np.unique(old_inds, return_index=True)
+        # maintaining order in which they appeard
+        indiv_to_keep = indiv_to_keep[np.argsort(i)]


python/tskit/tables.py

python/tskit/util.py

python/tskit/tables.py

mufernando · 2020-06-03T14:26:25Z

I don't think the subset_ragged_columns is working as it should because it is not reordering.

55518db fixed this and added metadata to the tests

mufernando · 2020-06-03T17:25:34Z

Speed comparisons for the C and Numpy versions:

Here for one tree of 4M nodes, I tested the speed for different sizes of subset:

Here for trees of varying sizes, I tested the speed to get a same sized subset (40 nodes):

petrelharp · 2020-06-03T17:38:55Z

Ok, thanks! I was curious, because if numpy development is much easier, then it'd be a good way to quickly add more functionality. Comparing these two implementations, I'm not sure it is easier (the C code is pretty straightfoward), and the difference betwee 7sec and 0sec is definately noticable (although not a big deal in the bigger scheme of things). Any other thoughts on the comparison, @mufernando? Seems like we want to move your (very nice) python implementation over to testing code, for verification, though.

mufernando · 2020-06-03T22:16:17Z

add a test with random locations

python/tests/test_tables.py

mufernando · 2020-06-19T12:35:31Z

I think I addressed all of your comments, @jeromekelleher

I was fighting with codecov bc it says the loops over mutations and migrations are never covered by our tests, but I'm pretty sure they are -- all our examples have mutations and migrations which seem to be correctly subsetted...

petrelharp · 2020-06-19T14:28:31Z

I can take a look first if you like, @jeromekelleher.

jeromekelleher · 2020-06-19T14:38:48Z

I can take a look first if you like, @jeromekelleher.

Sure, that'd be great, thanks @petrelharp. Or are you procrastinating from something...

c/tskit/tables.h

python/tests/test_lowlevel.py

petrelharp · 2020-06-19T14:56:29Z

python/tests/test_tables.py

+
+    def test_wf_examples(self):
+        tables = wf.wf_sim(5, 5, seed=65634)
+        tables.sort()


why the sort?

I'm not sure, but all the other examples with the WF simulator do that. I think it has to do with the Edges Tables requirements.

Ah, well, since we're just working with tables, not tree sequences, let's not sort!

We need tree sequences for some of the tsutil functions (to add metadata, etc).

Hm: those don't need the tree sequence either, but ok!

what do you mean? they take in tree sequences and dump the tables internally.

Right - they could be rewritten to operate on tables instead of tree sequences, since that's really what's going on. But, I'm not suggesting you do that.

python/tskit/trees.py

python/tests/test_tables.py

petrelharp · 2020-06-19T15:04:28Z

Looks good - just a few comments. I don't understand codecov either. Could you do something about those comments and ping jerome again?

mufernando · 2020-06-19T15:55:01Z

I think I addressed all the issues. Not sure about codecov, @jeromekelleher. I don't understand how it works -- sometimes it complains sometimes it doesn't. Pretty sure the parts about mutations and migrations in the C code are covered by our tests, though.

petrelharp · 2020-06-19T16:52:11Z

Yep, over to you, @jeromekelleher (except my note about removing the sort() above).

jeromekelleher

Looks good thanks @mufernando. I think we can make the C code a bit clearer and more efficient. Also, there's some semantic difficulties with migrations. We need to either not support them, or handle (and test) this situtation.

c/tskit/tables.c

python/tests/tsutil.py

jeromekelleher · 2020-06-22T17:56:59Z

We also need to bump up test coverage quite a bit. The C tests don't have any sites or mutations, so that's not being covered. Neither are migrations (but we should just drop that anyway, and test that we behave correctly when we try to subset a including them).

Need to add a test for the TreeSequence.subset method also. This can just be verifying that it returns the same thing as ts.dump_tables().subset()

mufernando · 2020-06-23T14:19:47Z

thanks for the revisions, @jeromekelleher! I added some mutations to the C side, but codecov isn't being rerun. Can we explicitly ask it to be run again?

benjeffery · 2020-06-23T14:22:16Z

Codecov isn't reporting as valgrind is reporting a memory leak: https://circleci.com/gh/tskit-dev/tskit/2088?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link
EDIT: Looks like you commit a fix just as I was typing that! :D

mufernando · 2020-06-23T14:23:40Z

yeah, just fixed that. but either way codecov is not in the list of pending checks.

benjeffery · 2020-06-23T14:25:08Z

Codecov will only appear after circle ci is successfully completed, the coverage is measured in the circle CI run.

jeromekelleher · 2020-06-23T15:55:35Z

Looking great, thanks @mufernando! Apologies for all the nitpicking, but I think it was worth the effort. The code is super-clear and clean now.

Just need to get codecov happy and squash the commits now I think.

mufernando · 2020-06-23T16:25:15Z

Thank you, @jeromekelleher and @petrelharp for the patience and help along the way!

I am not sure how to bump up codecov at this point. The parts of the code not being tested are some ifs on the return of add_row lines (and other error handling sections). What is the proper way of dealing with this?

jeromekelleher · 2020-06-23T18:25:55Z

You've tested all the conditions that can reasonably happen @mufernando, so we're ready to merge. Can you squash the commits please?

mufernando · 2020-06-23T19:53:32Z

woot woot!

thanks, you all!

petrelharp force-pushed the subset branch from 1b53526 to 146d8ac Compare June 1, 2020 23:53