New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
individual convenience extractors #2298
Conversation
Well, allowing non-sample individuals did break some things, albeit in predictable ways - mostly around VCF output. (I'm a bit nervous we're not adequately testing VCF output in the case where there's non-sample individuals, though.) |
Codecov Report
@@ Coverage Diff @@
## main #2298 +/- ##
===========================================
+ Coverage 81.51% 93.28% +11.76%
===========================================
Files 27 27
Lines 26124 26377 +253
Branches 1182 1188 +6
===========================================
+ Hits 21295 24605 +3310
+ Misses 4759 1742 -3017
+ Partials 70 30 -40
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
Thinking about this a bit more - Alternatively we could put our "missing time" in for the value when Out of these options I like the second one. |
Thanks @petrelharp! Should have time to look at this later today. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Two high-level things:
- I don't like the semantics around -1, I think it would be better to require strict equality of population (etc) ids.
- Bike-sheddingly, I wonder about the names. We ended up using things like
individuals_time
in tsinfer (eg). The rationale was we usually use arrays in the singlar in the tables API (nodes.time, say), and it's then inconsistent to flip this around. We could imagine doing this for all the table arrays (plus some extras, like say,ts.mutations_edge
), where we're just essentially replacing a "." with a "_".
We probably would want to push the implementation down to C for speed, but that wouldn't have to happen straight away.
python/tskit/trees.py
Outdated
""" | ||
if self._individual_populations is None: | ||
pops = np.repeat(tskit.NULL, self.num_individuals) | ||
for n in self.nodes(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about doing it this way?
for ind in self.individuals():
if len(ind.nodes) > 0:
ind_pops = set(self.node(u).population for u in ind.nodes)
if len(ind_pops) > 1:
raise ValueError("...")
pops[ind.id] = ind_pops.pop() # or whatever the set operation is, can't remember
This avoids iteration over all nodes, and makes the "populations must be exactly equal" semantics easier to implement.
I wouldn't worry too much about the efficiency of doing set operations here, as we'll probably want to drop this functionality down to the C api at some point anyway. I'm pretty sure I've written the equivalent C code in msprime, so we could just copy that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That'd work, but I was writing this to be translatable to C more easily (which I was planning to do in this PR; if these operations are slow they'll be a bottleneck for things I do!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, I implemnted this in pyslim using numpy in a much more efficient way - but, the error checking (for inconsistent populations) is ugly and not very robust.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is totally translatable to efficient C - I've done it all already here in msprime.
Hmph. That will make the implementation more annoying. =) But, agreed.
This makes sense! |
However, changing it from |
Not breaking code is good... How about we do it both ways, like this? @property
def individuals_time(self):
"""
Documentation
"""
# implementation
@property
def individual_times(self):
# Undocumented alias for individuals_time to avoid breaking pre (VERSION) pyslim code
return individuals_time |
python/tskit/trees.py
Outdated
@@ -5085,17 +5085,18 @@ def individual_populations(self): | |||
has nodes with inconsistent non-NULL populations. | |||
""" | |||
if self._individual_populations is None: | |||
pops = np.repeat(tskit.NULL, self.num_individuals) | |||
pops = np.full(self.num_individuals, tskit.NULL, dtype=np.int32) | |||
seen = np.full(self.num_individuals, False, dtype=bool) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need the full seen
array - you can look at the local values for a given individual. In C-like code you'd have:
for ind in self.individuals():
if len(ind.nodes) > 0:
ind_pop = -2
for u in ind.nodes:
if ind_pop == -2:
ind_pop = tables.nodes.population[u]
elif ind_pop != tables.nodes.population[u]:
raise ValueError(..)
population[ind.id] = ind_pop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I'm confused - in C we won't have ready access to the individual's nodes when iterating over individuals, right? Oh! I see, we do! Sorry, I was looking at the stable
documentation, which I see hasn't updated to C 1.0 yet. Yes, that is a lot easier.
LGTM - shall we merge this much and follow up with a C implementation? |
Er, I think I've got the C implementation now? |
I think this is good to go, although I haven't seen the codecov report for the C code. Except for the darn clang-format versioning issue; I guess I'll fix those things up by hand. |
Ok, I've got the compile-with-big-tables issues fixed, and now CircleCI is showing a failure in the C tests that I do not see locally:
Any idea what might be causing that? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
The failure is because you're using decimal fractions which are recurrent in base 2, and older x86 math processors don't handle them in the way you might expect.
c/tskit/core.h
Outdated
An individual had nodes from more than one population | ||
(and only one was requested). | ||
*/ | ||
#define TSK_ERR_MISMATCHING_INDIVIDUAL_POPULATIONS -1703 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slightly more consistent if it's TSK_ERR_INDIVIDUAL_POPULATION_MISMATCH
(or something), since other individual errors have TSK_ERR_INDIVIDUAL_...
prefix. This is handier for code completion, if you want to tab through the set of errors for individuals
Thanks! Great suggestions there. I'll squash this, just on the off chance that it's actually really ready to go now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, spotted one small detail.
Oh yeah, can you update the Python changelog with this also please? Good to merge whenever then I think. |
Did the changelog. Note that this does not (necessarily) close #1481 because that one also discusses adding |
Huh, somehow my python tests are not being seen, according to codecov? Gotta run, will look later. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just a few nits.
Thanks, @benjeffery! Got those things in also. |
allow non-individual nodes in tests a few more things lint
OK - this is ready to go! I'll someone else hit 'merge' though. |
I'll get the bot to hit the button! Thanks @petrelharp! |
Ok, here I'm adding
ts.individual_locations
,ts.individual_times
, andts.individual_nodes
. These are all cached properties... which I think does what we want? To be clear, what (I think) we want is to (a) have them only be computed once, and (b) not be returning new copies of them every time they are accessed. Hm - if this is working, I should make them read-only (how to do this?), so people don't think they can modify the tree sequence by modifying these.The main decision I made that I'm not sure about is the case when some of an individuals' nodes have
population=-1
and some have a legit population. I made the decision - honestly, for convenience - to treat-1
as "missing data", and so an individual with node populations-1
and1
would then be reported as having population1
, with no contradiction. (i.e., the unique non-null population) This seems nice but I haven't thought through other implications of this sort of thing.I also said I'd add
time
andpopulation
properties to theIndividual
class. However, I'm not sure what to put in for these properties in the case where these are not well-defined (ie the values are not consistent across nodes). Perhaps I should not do this and instruct people to use, e.g.,ts.individual_times[ind.id]
instead.Note: I also made one of the generic "make up examples" methods in
tsutil
a bit more general (it was only adding individuals to samples); hopefully this does not break anything (if it does, that will likely be a bug).