-
Notifications
You must be signed in to change notification settings - Fork 78
method to subset and reorder table collection #663
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
To other viewing this issue - we've got a fair bit of work to tidy this up, so comments appreciated especially on the big picture, but we know that it's not ready to go yet. Maybe the biggest question in my head is what to call this method? It has two main use cases, really: subsetting, and reordering, so maybe the name should reflect reordering also. |
Codecov Report
@@ Coverage Diff @@
## master #663 +/- ##
==========================================
- Coverage 87.75% 87.71% -0.05%
==========================================
Files 23 23
Lines 17687 17810 +123
Branches 3498 3526 +28
==========================================
+ Hits 15522 15622 +100
- Misses 1062 1073 +11
- Partials 1103 1115 +12
Continue to review full report at Codecov.
|
|
To avoid confusion: right now, there are two implementations, one ( |
|
The main big-picture question for me would be, doesn't simplify do this already? We started out calling simplify |
A good point. Just a quick thought: simplify won't reorder internal nodes arbitrarily (but I'm not sure if the new method will do that either). Also simplify doesn't reorder populations / individuals AFAIK, but I guess it could be extended to do that. |
Simplify does a lot more. As Yan points out, simplify only lets you reorder samples; but much of the structure above the samples might be changed (nodes removed, edges squashed, etc). I've put a reminder of this in the docs; should I make this confusing point even more explicit? If you passed in just the list of sample nodes to subset in an msprime tree, you'd get back the trivial forest where no-one is related to anyone else. Maybe putting a note like this into the docs would help clarify: |
|
Ah, that makes sense now. Sure, seems like a useful operation then. |
|
Ping me when you'd like detailed feedback - I'll mute this thread for now. |
python/tskit/tables.py
Outdated
| old_inds = n.individual[nodes] | ||
| indiv_to_keep, i = np.unique(old_inds, return_index=True) | ||
| # maintaining order in which they appeard | ||
| indiv_to_keep = indiv_to_keep[np.argsort(i)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very nice
41b49cc to
7b4beaf
Compare
55518db fixed this and added metadata to the tests |
|
Ok, thanks! I was curious, because if numpy development is much easier, then it'd be a good way to quickly add more functionality. Comparing these two implementations, I'm not sure it is easier (the C code is pretty straightfoward), and the difference betwee 7sec and 0sec is definately noticable (although not a big deal in the bigger scheme of things). Any other thoughts on the comparison, @mufernando? Seems like we want to move your (very nice) python implementation over to testing code, for verification, though. |
|
add a test with random locations |
|
I think I addressed all of your comments, @jeromekelleher I was fighting with codecov bc it says the loops over mutations and migrations are never covered by our tests, but I'm pretty sure they are -- all our examples have mutations and migrations which seem to be correctly subsetted... |
|
I can take a look first if you like, @jeromekelleher. |
Sure, that'd be great, thanks @petrelharp. Or are you procrastinating from something... |
python/tests/test_tables.py
Outdated
|
|
||
| def test_wf_examples(self): | ||
| tables = wf.wf_sim(5, 5, seed=65634) | ||
| tables.sort() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why the sort?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure, but all the other examples with the WF simulator do that. I think it has to do with the Edges Tables requirements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, well, since we're just working with tables, not tree sequences, let's not sort!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need tree sequences for some of the tsutil functions (to add metadata, etc).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm: those don't need the tree sequence either, but ok!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do you mean? they take in tree sequences and dump the tables internally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right - they could be rewritten to operate on tables instead of tree sequences, since that's really what's going on. But, I'm not suggesting you do that.
|
Looks good - just a few comments. I don't understand codecov either. Could you do something about those comments and ping jerome again? |
|
I think I addressed all the issues. Not sure about codecov, @jeromekelleher. I don't understand how it works -- sometimes it complains sometimes it doesn't. Pretty sure the parts about mutations and migrations in the C code are covered by our tests, though. |
|
Yep, over to you, @jeromekelleher (except my note about removing the sort() above). |
jeromekelleher
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good thanks @mufernando. I think we can make the C code a bit clearer and more efficient. Also, there's some semantic difficulties with migrations. We need to either not support them, or handle (and test) this situtation.
|
We also need to bump up test coverage quite a bit. The C tests don't have any sites or mutations, so that's not being covered. Neither are migrations (but we should just drop that anyway, and test that we behave correctly when we try to subset a including them). Need to add a test for the TreeSequence.subset method also. This can just be verifying that it returns the same thing as |
|
thanks for the revisions, @jeromekelleher! I added some mutations to the C side, but codecov isn't being rerun. Can we explicitly ask it to be run again? |
|
Codecov isn't reporting as valgrind is reporting a memory leak: https://circleci.com/gh/tskit-dev/tskit/2088?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link |
|
yeah, just fixed that. but either way codecov is not in the list of pending checks. |
|
Codecov will only appear after circle ci is successfully completed, the coverage is measured in the circle CI run. |
|
Looking great, thanks @mufernando! Apologies for all the nitpicking, but I think it was worth the effort. The code is super-clear and clean now. Just need to get codecov happy and squash the commits now I think. |
|
Thank you, @jeromekelleher and @petrelharp for the patience and help along the way! I am not sure how to bump up codecov at this point. The parts of the code not being tested are some ifs on the return of |
|
You've tested all the conditions that can reasonably happen @mufernando, so we're ready to merge. Can you squash the commits please? |
|
woot woot! thanks, you all! |


Here I implemented a numpy version of a method to subset a table collection.
There are probably many uses for this, but most importantly:
graftmethod we need a better way of checking for equivalency between Table Collections, and now we can quickly do so by subsetting.graftworks.@petrelharp described the motivation and approach here:
The proposed subset-and-reorder method is: tables.subset_nodes(nodes) modifies the tables to retain exactly those:
I also implemented a new function to
subset_ragged_colthat subsets ragged cols based on a list of indices (or boolean mask). I added it toutil.pybut not sure that is the right place.