individual convenience extractors #2298

petrelharp · 2022-05-24T22:12:10Z

Ok, here I'm adding ts.individual_locations, ts.individual_times, and ts.individual_nodes. These are all cached properties... which I think does what we want? To be clear, what (I think) we want is to (a) have them only be computed once, and (b) not be returning new copies of them every time they are accessed. Hm - if this is working, I should make them read-only (how to do this?), so people don't think they can modify the tree sequence by modifying these.

The main decision I made that I'm not sure about is the case when some of an individuals' nodes have population=-1 and some have a legit population. I made the decision - honestly, for convenience - to treat -1 as "missing data", and so an individual with node populations -1 and 1 would then be reported as having population 1, with no contradiction. (i.e., the unique non-null population) This seems nice but I haven't thought through other implications of this sort of thing.

I also said I'd add time and population properties to the Individual class. However, I'm not sure what to put in for these properties in the case where these are not well-defined (ie the values are not consistent across nodes). Perhaps I should not do this and instruct people to use, e.g., ts.individual_times[ind.id] instead.

Note: I also made one of the generic "make up examples" methods in tsutil a bit more general (it was only adding individuals to samples); hopefully this does not break anything (if it does, that will likely be a bug).

petrelharp · 2022-05-24T22:33:12Z

Well, allowing non-sample individuals did break some things, albeit in predictable ways - mostly around VCF output. (I'm a bit nervous we're not adequately testing VCF output in the case where there's non-sample individuals, though.)

python/tests/test_tables.py

codecov · 2022-05-24T23:04:25Z

Codecov Report

Merging #2298 (e8b05e0) into main (fb157e4) will increase coverage by 11.76%.
The diff coverage is 82.24%.

❗ Current head e8b05e0 differs from pull request most recent head a54d029. Consider uploading reports for the commit a54d029 to get more accurate results

@@             Coverage Diff             @@
##             main    #2298       +/-   ##
===========================================
+ Coverage   81.51%   93.28%   +11.76%     
===========================================
  Files          27       27               
  Lines       26124    26377      +253     
  Branches     1182     1188        +6     
===========================================
+ Hits        21295    24605     +3310     
+ Misses       4759     1742     -3017     
+ Partials       70       30       -40

Flag	Coverage Δ
c-tests	`92.31% <100.00%> (+0.02%)`	⬆️
lwt-tests	`89.05% <ø> (ø)`
python-c-tests	`71.65% <37.70%> (-0.16%)`	⬇️
python-tests	`98.89% <100.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
c/tskit/core.h	`100.00% <ø> (ø)`
python/_tskitmodule.c	`90.66% <42.42%> (-0.25%)`	⬇️
c/tskit/core.c	`97.90% <100.00%> (+0.02%)`	⬆️
c/tskit/trees.c	`95.07% <100.00%> (+0.05%)`	⬆️
python/tskit/trees.py	`98.12% <100.00%> (+51.75%)`	⬆️
python/tskit/provenance.py	`100.00% <0.00%> (+5.71%)`	⬆️
python/tskit/tables.py	`98.89% <0.00%> (+14.86%)`	⬆️
python/tskit/metadata.py	`99.01% <0.00%> (+19.49%)`	⬆️
python/tskit/util.py	`100.00% <0.00%> (+43.58%)`	⬆️
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fb157e4...a54d029. Read the comment docs.

petrelharp · 2022-05-25T05:01:29Z

I also said I'd add time and population properties to the Individual class. However, I'm not sure what to put in for these properties in the case where these are not well-defined (ie the values are not consistent across nodes). Perhaps I should not do this and instruct people to use, e.g., ts.individual_times[ind.id] instead.

Thinking about this a bit more - ind.time could, for instance, return all the distinct times when the individual's nodes have more than one time. But, that's a bit icky because the return type would then need to be always list-like even in the common case when it's just really a single float. (Maybe not a big deal?)

Alternatively we could put our "missing time" in for the value when time is not unique, and tskit.NULL when the population is not unique? (We couldn't distinguish the latter case from the case where nodes all agree, but have tskit.NULL for their population, but perhaps that's ok?)

Out of these options I like the second one.

benjeffery · 2022-05-25T10:01:01Z

Thanks @petrelharp! Should have time to look at this later today.

jeromekelleher

LGTM!

Two high-level things:

I don't like the semantics around -1, I think it would be better to require strict equality of population (etc) ids.
Bike-sheddingly, I wonder about the names. We ended up using things like individuals_time in tsinfer (eg). The rationale was we usually use arrays in the singlar in the tables API (nodes.time, say), and it's then inconsistent to flip this around. We could imagine doing this for all the table arrays (plus some extras, like say, ts.mutations_edge), where we're just essentially replacing a "." with a "_".

We probably would want to push the implementation down to C for speed, but that wouldn't have to happen straight away.

python/tskit/trees.py

jeromekelleher · 2022-05-25T14:18:40Z

python/tskit/trees.py

+        """
+        if self._individual_populations is None:
+            pops = np.repeat(tskit.NULL, self.num_individuals)
+            for n in self.nodes():


What about doing it this way?

for ind in self.individuals(): if len(ind.nodes) > 0: ind_pops = set(self.node(u).population for u in ind.nodes) if len(ind_pops) > 1: raise ValueError("...") pops[ind.id] = ind_pops.pop() # or whatever the set operation is, can't remember

This avoids iteration over all nodes, and makes the "populations must be exactly equal" semantics easier to implement.

I wouldn't worry too much about the efficiency of doing set operations here, as we'll probably want to drop this functionality down to the C api at some point anyway. I'm pretty sure I've written the equivalent C code in msprime, so we could just copy that.

That'd work, but I was writing this to be translatable to C more easily (which I was planning to do in this PR; if these operations are slow they'll be a bottleneck for things I do!)

FWIW, I implemnted this in pyslim using numpy in a much more efficient way - but, the error checking (for inconsistent populations) is ugly and not very robust.

This is totally translatable to efficient C - I've done it all already here in msprime.

python/tests/test_highlevel.py

python/tskit/trees.py

petrelharp · 2022-05-25T17:09:34Z

I don't like the semantics around -1, I think it would be better to
require strict equality of population (etc) ids.

Hmph. That will make the implementation more annoying. =) But, agreed.

Bike-sheddingly, I wonder about the names. We ended up using things
like individuals_time in tsinfer (eg).
The rationale was we usually use arrays in the singlar in the tables API (nodes.time, say),
and it's then inconsistent to flip this around. We could imagine doing this for all the table
arrays (plus some extras, like say, ts.mutations_edge), where we're just essentially replacing a "." with a "_".

This makes sense!

petrelharp · 2022-05-25T17:19:01Z

However, changing it from ts.individual_times to ts.individuals_time is a breaking change from pyslim and will totally break code which would otherwise not be broken. I guess I"m voting to keep it the way it is, for this reason.

jeromekelleher · 2022-05-25T19:38:18Z

However, changing it from ts.individual_times to ts.individuals_time is a breaking change from pyslim and will totally break code which would otherwise not be broken. I guess I"m voting to keep it the way it is, for this reason.

Not breaking code is good... How about we do it both ways, like this?

@property
def individuals_time(self):
    """
    Documentation
    """
    # implementation

@property
def individual_times(self):
    # Undocumented alias for individuals_time to avoid breaking pre (VERSION) pyslim code
    return individuals_time

jeromekelleher · 2022-05-25T19:44:10Z

python/tskit/trees.py

@@ -5085,17 +5085,18 @@ def individual_populations(self):
        has nodes with inconsistent non-NULL populations.
        """
        if self._individual_populations is None:
-            pops = np.repeat(tskit.NULL, self.num_individuals)
+            pops = np.full(self.num_individuals, tskit.NULL, dtype=np.int32)
+            seen = np.full(self.num_individuals, False, dtype=bool)


You don't need the full seen array - you can look at the local values for a given individual. In C-like code you'd have:

for ind in self.individuals(): if len(ind.nodes) > 0: ind_pop = -2 for u in ind.nodes: if ind_pop == -2: ind_pop = tables.nodes.population[u] elif ind_pop != tables.nodes.population[u]: raise ValueError(..) population[ind.id] = ind_pop

Sorry, I'm confused - in C we won't have ready access to the individual's nodes when iterating over individuals, right? Oh! I see, we do! Sorry, I was looking at the stable documentation, which I see hasn't updated to C 1.0 yet. Yes, that is a lot easier.

jeromekelleher · 2022-05-26T05:41:53Z

LGTM - shall we merge this much and follow up with a C implementation?

petrelharp · 2022-05-26T06:22:30Z

Er, I think I've got the C implementation now?

petrelharp · 2022-05-26T06:37:34Z

I think this is good to go, although I haven't seen the codecov report for the C code.

Except for the darn clang-format versioning issue; I guess I'll fix those things up by hand.

petrelharp · 2022-05-26T07:03:03Z

Ok, I've got the compile-with-big-tables issues fixed, and now CircleCI is showing a failure in the C tests that I do not see locally:

  Test: test_treeseq_get_individuals_time ...FAILED
    1. ../c/tests/test_trees.c:6846  - CU_ASSERT_EQUAL_FATAL(output[0],3.2)

Any idea what might be causing that?

jeromekelleher

LGTM!

The failure is because you're using decimal fractions which are recurrent in base 2, and older x86 math processors don't handle them in the way you might expect.

jeromekelleher · 2022-05-26T08:52:04Z

c/tskit/core.h

+An individual had nodes from more than one population
+(and only one was requested).
+*/
+#define TSK_ERR_MISMATCHING_INDIVIDUAL_POPULATIONS                 -1703


Slightly more consistent if it's TSK_ERR_INDIVIDUAL_POPULATION_MISMATCH (or something), since other individual errors have TSK_ERR_INDIVIDUAL_... prefix. This is handier for code completion, if you want to tab through the set of errors for individuals

c/tests/test_trees.c

c/tskit/trees.c

python/tests/test_highlevel.py

petrelharp · 2022-05-26T13:39:32Z

Thanks! Great suggestions there. I'll squash this, just on the off chance that it's actually really ready to go now.

jeromekelleher

LGTM, spotted one small detail.

c/tskit/core.c

jeromekelleher · 2022-05-26T13:43:28Z

Oh yeah, can you update the Python changelog with this also please? Good to merge whenever then I think.

petrelharp · 2022-05-26T13:48:15Z

Did the changelog. Note that this does not (necessarily) close #1481 because that one also discusses adding population and time to the Individual class. (I'm going to hop over there to record my thoughts on that topic.)

petrelharp · 2022-05-26T13:58:00Z

Huh, somehow my python tests are not being seen, according to codecov? Gotta run, will look later.

benjeffery

LGTM, just a few nits.

c/tskit/core.c

c/tskit/trees.h

python/tests/test_highlevel.py

python/tskit/trees.py

petrelharp · 2022-05-26T15:33:17Z

Thanks, @benjeffery! Got those things in also.

allow non-individual nodes in tests a few more things lint

petrelharp · 2022-05-26T23:21:28Z

OK - this is ready to go! I'll someone else hit 'merge' though.

benjeffery · 2022-05-27T08:37:31Z

OK - this is ready to go! I'll someone else hit 'merge' though.

I'll get the bot to hit the button! Thanks @petrelharp!

petrelharp force-pushed the indiv_stuff branch from a238d91 to f7a782a Compare May 24, 2022 22:32

petrelharp force-pushed the indiv_stuff branch from f7a782a to 25fcad7 Compare May 24, 2022 22:58

petrelharp commented May 24, 2022

View reviewed changes

python/tests/test_tables.py Show resolved Hide resolved

jeromekelleher reviewed May 25, 2022

View reviewed changes

petrelharp force-pushed the indiv_stuff branch from 301625d to 0a009cb Compare May 26, 2022 06:42

jeromekelleher reviewed May 26, 2022

View reviewed changes

fix rst links

ce6d5b5

petrelharp force-pushed the indiv_stuff branch from adbdb5d to 5b6e041 Compare May 26, 2022 13:40

jeromekelleher approved these changes May 26, 2022

View reviewed changes

c/tskit/core.c Outdated Show resolved Hide resolved

c/tskit/core.c Outdated Show resolved Hide resolved

petrelharp force-pushed the indiv_stuff branch from e44584b to 1a67d54 Compare May 26, 2022 13:43

petrelharp force-pushed the indiv_stuff branch from 1a67d54 to 5c7503c Compare May 26, 2022 13:47

benjeffery reviewed May 26, 2022

View reviewed changes

c/tskit/core.c Outdated Show resolved Hide resolved

c/tskit/trees.h Show resolved Hide resolved

python/tests/test_highlevel.py Outdated Show resolved Hide resolved

python/tskit/trees.py Show resolved Hide resolved

individual vectors

a54d029

allow non-individual nodes in tests a few more things lint

petrelharp force-pushed the indiv_stuff branch from e8b05e0 to a54d029 Compare May 26, 2022 23:11

benjeffery added the AUTOMERGE-REQUESTED Ask Mergify to merge this PR label May 27, 2022

mergify bot merged commit bdace77 into tskit-dev:main May 27, 2022

mergify bot removed the AUTOMERGE-REQUESTED Ask Mergify to merge this PR label May 27, 2022

benjeffery mentioned this pull request May 27, 2022

Doc links for preorder and postorder broken #2241

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

individual convenience extractors #2298

individual convenience extractors #2298

petrelharp commented May 24, 2022 •

edited

petrelharp commented May 24, 2022

codecov bot commented May 24, 2022 •

edited

petrelharp commented May 25, 2022

benjeffery commented May 25, 2022

jeromekelleher left a comment

jeromekelleher May 25, 2022

petrelharp May 25, 2022

petrelharp May 25, 2022

jeromekelleher May 25, 2022

petrelharp commented May 25, 2022 •

edited

petrelharp commented May 25, 2022

jeromekelleher commented May 25, 2022

jeromekelleher May 25, 2022

petrelharp May 26, 2022

jeromekelleher commented May 26, 2022

petrelharp commented May 26, 2022

petrelharp commented May 26, 2022

petrelharp commented May 26, 2022

jeromekelleher left a comment

jeromekelleher May 26, 2022

petrelharp commented May 26, 2022

jeromekelleher left a comment

jeromekelleher commented May 26, 2022

petrelharp commented May 26, 2022

petrelharp commented May 26, 2022

benjeffery left a comment

petrelharp commented May 26, 2022

petrelharp commented May 26, 2022

benjeffery commented May 27, 2022

individual convenience extractors #2298

individual convenience extractors #2298

Conversation

petrelharp commented May 24, 2022 • edited

petrelharp commented May 24, 2022

codecov bot commented May 24, 2022 • edited

Codecov Report

petrelharp commented May 25, 2022

benjeffery commented May 25, 2022

jeromekelleher left a comment

Choose a reason for hiding this comment

jeromekelleher May 25, 2022

Choose a reason for hiding this comment

petrelharp May 25, 2022

Choose a reason for hiding this comment

petrelharp May 25, 2022

Choose a reason for hiding this comment

jeromekelleher May 25, 2022

Choose a reason for hiding this comment

petrelharp commented May 25, 2022 • edited

petrelharp commented May 25, 2022

jeromekelleher commented May 25, 2022

jeromekelleher May 25, 2022

Choose a reason for hiding this comment

petrelharp May 26, 2022

Choose a reason for hiding this comment

jeromekelleher commented May 26, 2022

petrelharp commented May 26, 2022

petrelharp commented May 26, 2022

petrelharp commented May 26, 2022

jeromekelleher left a comment

Choose a reason for hiding this comment

jeromekelleher May 26, 2022

Choose a reason for hiding this comment

petrelharp commented May 26, 2022

jeromekelleher left a comment

Choose a reason for hiding this comment

jeromekelleher commented May 26, 2022

petrelharp commented May 26, 2022

petrelharp commented May 26, 2022

benjeffery left a comment

Choose a reason for hiding this comment

petrelharp commented May 26, 2022

petrelharp commented May 26, 2022

benjeffery commented May 27, 2022

petrelharp commented May 24, 2022 •

edited

codecov bot commented May 24, 2022 •

edited

petrelharp commented May 25, 2022 •

edited