Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

individual convenience extractors #2298

Merged
merged 2 commits into from May 27, 2022
Merged

Conversation

petrelharp
Copy link
Contributor

@petrelharp petrelharp commented May 24, 2022

Ok, here I'm adding ts.individual_locations, ts.individual_times, and ts.individual_nodes. These are all cached properties... which I think does what we want? To be clear, what (I think) we want is to (a) have them only be computed once, and (b) not be returning new copies of them every time they are accessed. Hm - if this is working, I should make them read-only (how to do this?), so people don't think they can modify the tree sequence by modifying these.

The main decision I made that I'm not sure about is the case when some of an individuals' nodes have population=-1 and some have a legit population. I made the decision - honestly, for convenience - to treat -1 as "missing data", and so an individual with node populations -1 and 1 would then be reported as having population 1, with no contradiction. (i.e., the unique non-null population) This seems nice but I haven't thought through other implications of this sort of thing.

I also said I'd add time and population properties to the Individual class. However, I'm not sure what to put in for these properties in the case where these are not well-defined (ie the values are not consistent across nodes). Perhaps I should not do this and instruct people to use, e.g., ts.individual_times[ind.id] instead.

Note: I also made one of the generic "make up examples" methods in tsutil a bit more general (it was only adding individuals to samples); hopefully this does not break anything (if it does, that will likely be a bug).

@petrelharp
Copy link
Contributor Author

Well, allowing non-sample individuals did break some things, albeit in predictable ways - mostly around VCF output. (I'm a bit nervous we're not adequately testing VCF output in the case where there's non-sample individuals, though.)

@codecov
Copy link

codecov bot commented May 24, 2022

Codecov Report

Merging #2298 (e8b05e0) into main (fb157e4) will increase coverage by 11.76%.
The diff coverage is 82.24%.

❗ Current head e8b05e0 differs from pull request most recent head a54d029. Consider uploading reports for the commit a54d029 to get more accurate results

Impacted file tree graph

@@             Coverage Diff             @@
##             main    #2298       +/-   ##
===========================================
+ Coverage   81.51%   93.28%   +11.76%     
===========================================
  Files          27       27               
  Lines       26124    26377      +253     
  Branches     1182     1188        +6     
===========================================
+ Hits        21295    24605     +3310     
+ Misses       4759     1742     -3017     
+ Partials       70       30       -40     
Flag Coverage Δ
c-tests 92.31% <100.00%> (+0.02%) ⬆️
lwt-tests 89.05% <ø> (ø)
python-c-tests 71.65% <37.70%> (-0.16%) ⬇️
python-tests 98.89% <100.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
c/tskit/core.h 100.00% <ø> (ø)
python/_tskitmodule.c 90.66% <42.42%> (-0.25%) ⬇️
c/tskit/core.c 97.90% <100.00%> (+0.02%) ⬆️
c/tskit/trees.c 95.07% <100.00%> (+0.05%) ⬆️
python/tskit/trees.py 98.12% <100.00%> (+51.75%) ⬆️
python/tskit/provenance.py 100.00% <0.00%> (+5.71%) ⬆️
python/tskit/tables.py 98.89% <0.00%> (+14.86%) ⬆️
python/tskit/metadata.py 99.01% <0.00%> (+19.49%) ⬆️
python/tskit/util.py 100.00% <0.00%> (+43.58%) ⬆️
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fb157e4...a54d029. Read the comment docs.

@petrelharp
Copy link
Contributor Author

I also said I'd add time and population properties to the Individual class. However, I'm not sure what to put in for these properties in the case where these are not well-defined (ie the values are not consistent across nodes). Perhaps I should not do this and instruct people to use, e.g., ts.individual_times[ind.id] instead.

Thinking about this a bit more - ind.time could, for instance, return all the distinct times when the individual's nodes have more than one time. But, that's a bit icky because the return type would then need to be always list-like even in the common case when it's just really a single float. (Maybe not a big deal?)

Alternatively we could put our "missing time" in for the value when time is not unique, and tskit.NULL when the population is not unique? (We couldn't distinguish the latter case from the case where nodes all agree, but have tskit.NULL for their population, but perhaps that's ok?)

Out of these options I like the second one.

@benjeffery
Copy link
Member

Thanks @petrelharp! Should have time to look at this later today.

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Two high-level things:

  1. I don't like the semantics around -1, I think it would be better to require strict equality of population (etc) ids.
  2. Bike-sheddingly, I wonder about the names. We ended up using things like individuals_time in tsinfer (eg). The rationale was we usually use arrays in the singlar in the tables API (nodes.time, say), and it's then inconsistent to flip this around. We could imagine doing this for all the table arrays (plus some extras, like say, ts.mutations_edge), where we're just essentially replacing a "." with a "_".

We probably would want to push the implementation down to C for speed, but that wouldn't have to happen straight away.

python/tskit/trees.py Outdated Show resolved Hide resolved
"""
if self._individual_populations is None:
pops = np.repeat(tskit.NULL, self.num_individuals)
for n in self.nodes():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about doing it this way?

for ind in self.individuals():
     if len(ind.nodes) > 0: 
          ind_pops = set(self.node(u).population for u in ind.nodes)
          if len(ind_pops) > 1:
               raise ValueError("...")
           pops[ind.id] = ind_pops.pop() # or whatever the set operation is, can't remember

This avoids iteration over all nodes, and makes the "populations must be exactly equal" semantics easier to implement.

I wouldn't worry too much about the efficiency of doing set operations here, as we'll probably want to drop this functionality down to the C api at some point anyway. I'm pretty sure I've written the equivalent C code in msprime, so we could just copy that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That'd work, but I was writing this to be translatable to C more easily (which I was planning to do in this PR; if these operations are slow they'll be a bottleneck for things I do!)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, I implemnted this in pyslim using numpy in a much more efficient way - but, the error checking (for inconsistent populations) is ugly and not very robust.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is totally translatable to efficient C - I've done it all already here in msprime.

python/tests/test_highlevel.py Outdated Show resolved Hide resolved
python/tests/test_highlevel.py Outdated Show resolved Hide resolved
python/tests/test_highlevel.py Show resolved Hide resolved
python/tskit/trees.py Show resolved Hide resolved
@petrelharp
Copy link
Contributor Author

petrelharp commented May 25, 2022

  1. I don't like the semantics around -1, I think it would be better to
    require strict equality of population (etc) ids.

Hmph. That will make the implementation more annoying. =) But, agreed.

  1. Bike-sheddingly, I wonder about the names. We ended up using things
    like individuals_time in tsinfer (eg).
    The rationale was we usually use arrays in the singlar in the tables API (nodes.time, say),
    and it's then inconsistent to flip this around. We could imagine doing this for all the table
    arrays (plus some extras, like say, ts.mutations_edge), where we're just essentially replacing a "." with a "_".

This makes sense!

@petrelharp
Copy link
Contributor Author

However, changing it from ts.individual_times to ts.individuals_time is a breaking change from pyslim and will totally break code which would otherwise not be broken. I guess I"m voting to keep it the way it is, for this reason.

@jeromekelleher
Copy link
Member

However, changing it from ts.individual_times to ts.individuals_time is a breaking change from pyslim and will totally break code which would otherwise not be broken. I guess I"m voting to keep it the way it is, for this reason.

Not breaking code is good... How about we do it both ways, like this?

@property
def individuals_time(self):
    """
    Documentation
    """
    # implementation

@property
def individual_times(self):
    # Undocumented alias for individuals_time to avoid breaking pre (VERSION) pyslim code
    return individuals_time

@@ -5085,17 +5085,18 @@ def individual_populations(self):
has nodes with inconsistent non-NULL populations.
"""
if self._individual_populations is None:
pops = np.repeat(tskit.NULL, self.num_individuals)
pops = np.full(self.num_individuals, tskit.NULL, dtype=np.int32)
seen = np.full(self.num_individuals, False, dtype=bool)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need the full seen array - you can look at the local values for a given individual. In C-like code you'd have:

for ind in self.individuals():
     if len(ind.nodes) > 0: 
          ind_pop = -2
          for u in ind.nodes:
               if ind_pop == -2:
                  ind_pop = tables.nodes.population[u]
               elif ind_pop != tables.nodes.population[u]:
                   raise ValueError(..)
           population[ind.id] = ind_pop

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I'm confused - in C we won't have ready access to the individual's nodes when iterating over individuals, right? Oh! I see, we do! Sorry, I was looking at the stable documentation, which I see hasn't updated to C 1.0 yet. Yes, that is a lot easier.

@jeromekelleher
Copy link
Member

LGTM - shall we merge this much and follow up with a C implementation?

@petrelharp
Copy link
Contributor Author

Er, I think I've got the C implementation now?

@petrelharp
Copy link
Contributor Author

I think this is good to go, although I haven't seen the codecov report for the C code.

Except for the darn clang-format versioning issue; I guess I'll fix those things up by hand.

@petrelharp
Copy link
Contributor Author

Ok, I've got the compile-with-big-tables issues fixed, and now CircleCI is showing a failure in the C tests that I do not see locally:

  Test: test_treeseq_get_individuals_time ...FAILED
    1. ../c/tests/test_trees.c:6846  - CU_ASSERT_EQUAL_FATAL(output[0],3.2)

Any idea what might be causing that?

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

The failure is because you're using decimal fractions which are recurrent in base 2, and older x86 math processors don't handle them in the way you might expect.

c/tskit/core.h Outdated
An individual had nodes from more than one population
(and only one was requested).
*/
#define TSK_ERR_MISMATCHING_INDIVIDUAL_POPULATIONS -1703
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slightly more consistent if it's TSK_ERR_INDIVIDUAL_POPULATION_MISMATCH (or something), since other individual errors have TSK_ERR_INDIVIDUAL_... prefix. This is handier for code completion, if you want to tab through the set of errors for individuals

c/tests/test_trees.c Outdated Show resolved Hide resolved
c/tskit/trees.c Outdated Show resolved Hide resolved
c/tskit/trees.c Outdated Show resolved Hide resolved
c/tskit/trees.c Show resolved Hide resolved
c/tskit/trees.c Outdated Show resolved Hide resolved
python/tests/test_highlevel.py Outdated Show resolved Hide resolved
@petrelharp
Copy link
Contributor Author

Thanks! Great suggestions there. I'll squash this, just on the off chance that it's actually really ready to go now.

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, spotted one small detail.

c/tskit/core.c Outdated Show resolved Hide resolved
c/tskit/core.c Outdated Show resolved Hide resolved
@jeromekelleher
Copy link
Member

Oh yeah, can you update the Python changelog with this also please? Good to merge whenever then I think.

@petrelharp
Copy link
Contributor Author

Did the changelog. Note that this does not (necessarily) close #1481 because that one also discusses adding population and time to the Individual class. (I'm going to hop over there to record my thoughts on that topic.)

@petrelharp
Copy link
Contributor Author

Huh, somehow my python tests are not being seen, according to codecov? Gotta run, will look later.

Copy link
Member

@benjeffery benjeffery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a few nits.

c/tskit/core.c Outdated Show resolved Hide resolved
c/tskit/trees.h Show resolved Hide resolved
python/tests/test_highlevel.py Outdated Show resolved Hide resolved
python/tskit/trees.py Show resolved Hide resolved
@petrelharp
Copy link
Contributor Author

Thanks, @benjeffery! Got those things in also.

allow non-individual nodes in tests

a few more things

lint
@petrelharp
Copy link
Contributor Author

OK - this is ready to go! I'll someone else hit 'merge' though.

@benjeffery benjeffery added the AUTOMERGE-REQUESTED Ask Mergify to merge this PR label May 27, 2022
@benjeffery
Copy link
Member

OK - this is ready to go! I'll someone else hit 'merge' though.

I'll get the bot to hit the button! Thanks @petrelharp!

@mergify mergify bot merged commit bdace77 into tskit-dev:main May 27, 2022
@mergify mergify bot removed the AUTOMERGE-REQUESTED Ask Mergify to merge this PR label May 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants