Mutation time #672

benjeffery · 2020-06-06T00:46:04Z

Fixes #513
Fixes #692

The mutation table now has a mandatory time column. For a tree sequence to be valid, these times have to be non-null, finite, older than their node, younger than parent mutation and younger than the parent node of their node, this is all checked in tsk_table_collection_check_mutation_times which is called by tsk_table_collection_check_integrity.

There is a new function tsk_table_collection_compute_mutation_times which creates times that satisfy these constraints by equally spreading out mutations along an edge that share a site. Along with a tskit upgrade addition to add times.

UPDATE:
Following discussion in #692 I'm changing this PR to be backwards compatible with mutation times being optional.

TODOs:

jeromekelleher · 2020-06-08T07:29:58Z

I've had a quick look over, looks exactly right to me.

codecov · 2020-06-11T23:27:15Z

Codecov Report

Merging #672 into master will increase coverage by 1.55%.
The diff coverage is 90.50%.

@@            Coverage Diff             @@
##           master     #672      +/-   ##
==========================================
+ Coverage   87.42%   88.98%   +1.55%     
==========================================
  Files          23       23              
  Lines       17963    13614    -4349     
  Branches     3575     2573    -1002     
==========================================
- Hits        15705    12115    -3590     
+ Misses       1091      837     -254     
+ Partials     1167      662     -505

Flag	Coverage Δ
#c_tests	`88.98% <90.50%> (+0.02%)`	⬆️
#python_c_tests	`?`
#python_tests	`99.00% <100.00%> (+<0.01%)`	⬆️

Impacted Files	Coverage Δ
c/tskit/tables.c	`79.19% <86.79%> (+0.13%)`	⬆️
c/tskit/trees.c	`90.83% <89.28%> (+0.05%)`	⬆️
c/tskit/core.c	`91.69% <100.00%> (+0.24%)`	⬆️
c/tskit/core.h	`100.00% <100.00%> (ø)`
python/tskit/__init__.py	`100.00% <100.00%> (ø)`
python/tskit/tables.py	`99.68% <100.00%> (+<0.01%)`	⬆️
python/tskit/trees.py	`98.24% <100.00%> (+0.02%)`	⬆️
python/tskit/util.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a9964b7...bbef335. Read the comment docs.

benjeffery · 2020-06-12T02:13:06Z

@jeromekelleher
Just the tskit upgrade code to go now, running into an issue. I need to load the tables from a kastore, bypassing validation. I'm trying something like this:

    with kastore.load(src) as store:
        data = dict(store)
    data["format/version"] = numpy.asarray([13, 0], dtype=numpy.uint32)
    data["mutations/time"] = np.full(data["mutations/node"].shape, -1, dtype=np.float64)
    table_dict = collections.defaultdict(collections.defaultdict)
    for key, value in data.items():
        if "/" in key:
            table, col = key.split("/")
            if col != "metadata_schema":
                table_dict[table][col] = value
        else:
            table_dict[key] = value
    # TODO Reinstate schemas
    tables = tskit.TableCollection.fromdict(table_dict)
    tables.compute_mutation_times()
    tables.tree_sequence().dump(dest)

However I'm getting:

  File "/home/benj/projects/tskit/python/tskit/tables.py", line 467, in set_columns
    self.ll_table.set_columns(
TypeError: Cannot cast array data from dtype('uint8') to dtype('int8') according to the rule 'safe'

As it seems the python code is using NPY_INT8 and the C code KAS_UINT8 for byte columns. This looks like an error?

benjeffery · 2020-06-14T15:29:22Z

I've gotten around this for now with a cast: 9b6553d

benjeffery · 2020-06-14T15:29:48Z

Almost ready for review - let me do a line-by-line check first.

benjeffery · 2020-06-15T11:55:29Z

@jeromekelleher Ready for review. Sorry it took a while, keep finding tests to add and bits to tweak!

c/tskit/tables.c

petrelharp · 2020-06-15T16:19:32Z

c/tskit/tables.c

    }
    /* Check everything except site duplicates (which we expect) and
-     * edge indexes (which we don't use) */
+     * edge indexes and mutation timesc (which we don't use) */


Fixed in 79fb3a0

docs/data-model.rst

petrelharp · 2020-06-15T16:38:31Z

Wow, that hits a lot of places! I've read through, and it looks great. Thanks!

Note: we should next ask sort tables to use the time column.

python/tests/__init__.py

jeromekelleher

Wow, this was epic, thanks @benjeffery! Overall it looks great, and the comments are pretty minor. I have a couple of substantive issues:

I'm a bit concerned about the breakage caused by the Python MutationTable.add_row. Maybe it is the right thing to do to force users to come up with a time that makes sense rather than just calling compute_mutation_times, though.
I don't think we should check the parent time in the table collection, and that should be done at tree sequence load time because it's a property of the tree topologies.

c/tests/test_trees.c

c/tskit/tables.c

jeromekelleher · 2020-06-17T08:32:13Z

c/tskit/tables.c

+    tsk_table_collection_t *self, double *TSK_UNUSED(options))
+{
+    int ret = 0;
+    const tsk_id_t *I, *O;


Might as well restrict I and O for a bit of a perf boost.

Fixed in 9ee98a7

c/tskit/tables.c

python/tests/tsutil.py

python/tskit/cli.py

python/tskit/formats.py

jeromekelleher · 2020-06-17T09:27:53Z

python/tskit/tables.py

        return headers, rows

-    def add_row(self, site, node, derived_state, parent=-1, metadata=None):
+    def add_row(self, site, node, time, derived_state, parent=-1, metadata=None):


I'd have been inclined to put this after derived_state as an optional value with default 0 to avoid some breakage. I agree this is conceptually the right place for it though. Since you've gone through the pain of updating the test suite it's probably best to keep it as it is. I doubt we'll break that much downstream code...

python/tskit/tables.py

benjeffery · 2020-06-17T10:34:10Z

I'm a bit concerned about the breakage caused by the Python MutationTable.add_row. Maybe it is the right thing to do to force users to come up with a time that makes sense rather than just calling compute_mutation_times, though.

My assumption is that there will be stats methods etc. that rely on mutation times so the actual change isn't "we added a parameter to add_row" but "we added a property that needs to make sense" so breakage is desired to bring people's attention to it. Using compute_mutation_times needs to be an active choice.

I don't think we should check the parent time in the table collection, and that should be done at tree sequence load time because it's a property of the tree topologies.

I also wasn't sure about this as it felt wrong to need to build indexes to check a table collection, but I wasn't clear on the distinction between "valid tables" and "valid tree sequence".

jeromekelleher · 2020-06-17T11:48:27Z

Using compute_mutation_times needs to be an active choice.

I agree, you've made the right choice here.

c/tskit/tables.c

benjeffery · 2020-06-19T12:54:56Z

Ok, I think all comments are addressed.
Tests will pass once this is rebased onto #687, but I don't want to invalidate all the comments till they are resolved.

jeromekelleher

Looks good - need to be careful with NaNs though, they're tricky!

c/tskit/tables.c

c/tskit/core.c

benjeffery · 2020-07-01T20:54:51Z

@jeromekelleher In a lot of places we check for table equality, this will fail for mutations that have the default value. My current plan is to override equality methods to make NAN==NAN, we'll see how that goes.

benjeffery · 2020-07-01T23:48:45Z

This then gets more complex - we'd like two mutations that have "default" time to be equal. It would be simplest to keep using memcmp for table equality so the C and Python default times both need to be the same bit field. Thankfully it seems that (bitfield-wise) math.nan == np.nan == C NAN. For the record this is 011111111111100000...zeros...0 which for python is hard-coded here for numpy hard-coded here (it's the same value even though the hex is different as it is cast to double after). The C99 spec only specifies that the NAN macro is a quiet nan, but we can assert that it matches.

Some python operations generate NANs with a different sign bit to math.nan but I don't think that is a problem as we don't want those NANs to be equal. The operations that generate NANs with the same sign bit are problematic as they will appear equal to the default so will be valid but are likely due to error, ensuring no NANs before times are used will be essential.

So the way forward is to fill the C array with NAN and not 0xFF by default.

jeromekelleher · 2020-07-02T07:38:21Z

So the way forward is to fill the C array with NAN and not 0xFF by default.

Good, agreed. I think we can be pragmatic here and use memcmp when checking for equality. It's fine if we say the tables are only equal if they both contain exactly the same NAN.

The Python default times can be set in the Python C API, so we should be able to use the same macro everywhere to set the default value. Come to think of it, it'd be better if we defined our own NaN value for TSK_MUTATION_TIME_UNKNOWN so that we don't get hit by different platforms using different bit values for the NAN macro. We would want a set of tables generated on Windows to be the same as one generated on Linux.

jeromekelleher · 2020-07-02T07:41:04Z

We could be even stricter actually, and only consider this one specific NaN, TSK_MUTATION_TIME_UNKNOWN, as acceptable input, and any other NaN throws an error. We'd probably need to do some fiddling like making a union to that we can compare the actual bits, but it should work.

What do you think? It would solve some problems.

benjeffery · 2020-07-02T08:40:51Z

I think that's right - to have a macro with a specific NaN. The fraction bits are free to use and not set usually - we could define our own bit pattern for TSK_MUTATION_TIME_UNKNOWN so that adding a row with np.nan or any other NaN would fail?

jeromekelleher · 2020-07-02T09:14:56Z

I think that's right - to have a macro with a specific NaN. The fraction bits are free to use and not set usually - we could define our own bit pattern for TSK_MUTATION_TIME_UNKNOWN so that adding a row with np.nan or any other NaN would fail?

That seems like a good idea. No room for ambiguity then, but NaNs will still propagate in a sensible way if people try to do calculations on them directly. Excellent.

benjeffery · 2020-07-02T11:13:44Z

Cool, we'll go with:

>>> struct.unpack(">d", b'\x7f\xf8tskit!')
(nan,)

benjeffery · 2020-07-03T08:35:22Z

@jeromekelleher ok, I think this is ready for review. We have Python mutation equality methods so I needed to bring the default value through to Python. I considered replacing the NAN with None at the CPython layer, but think that's potentially confusing and bad perf.

jeromekelleher · 2020-07-03T12:12:05Z

Great, thanks @benjeffery. I'll go through ASAP.

jeromekelleher

Looks great, thanks @benjeffery! I've gone through it in detail, with a few comments. I think the main thing is that we might as well use TSK_UNKNOWN_TIME, is_unknown_time, etc.

When squashing, it would be ideal if we could split this into at least 2 commits - one where we made this mandatory, and the other where we added the unknown_time concept. It'll be useful to keep the process by which we arrived at this point in the history.

c/CHANGELOG.rst

c/tskit/core.c

jeromekelleher · 2020-07-03T13:13:43Z

c/tskit/core.h

+/* We define a specific NAN value for default mutation time which indicates
+ * the time is unknown. We use a specific value so that if mutation time is set to
+ * a NAN from a computation we can reject it.
+ */


We should note here where we got the bit pattern from - people will wonder where we plucked this specific value from otherwise.

Fixed f42d908

c/tskit/core.h

jeromekelleher · 2020-07-03T13:25:18Z

c/tskit/core.h

+    return nan_union.i == TSK_MUTATION_UNKNOWN_TIME_HEX;
+}
+#define TSK_MUTATION_UNKNOWN_TIME __tsk_nan_f()
+#define TSK_IS_MUTATION_UNKNOWN_TIME(val) __tsk_is_nan_f(val)


Should we just make this a function altogether? is_mutation_unknown_time(val) seems a bit more natural. Or, to put it another way, is there any particular reason for making this a function-like macro rather than just function?

I think I wanted to keep it all together - but yes no need to be a macro. Fixed in 9b742c8

python/CHANGELOG.rst

python/tests/test_util.py

python/tskit/tables.py

python/tskit/trees.py

benjeffery · 2020-07-04T00:08:38Z

@jeromekelleher Addressed comments. When I squash I'll keep the first commit separate as that was the initial proposal.

jeromekelleher · 2020-07-06T12:24:57Z

LGTM, let's merge!

benjeffery · 2020-07-06T12:32:14Z

Squashed to 2 commits!

jeromekelleher · 2020-07-06T12:37:14Z

Let's remember how hard this was the next time we start talking about adding new columns to the tables!

… value.

jeromekelleher mentioned this pull request Jun 9, 2020

Change MutationTable.node to MutationTable.edge? #668

Closed

benjeffery marked this pull request as ready for review June 15, 2020 11:53

petrelharp reviewed Jun 15, 2020

View reviewed changes

c/tskit/tables.c Outdated Show resolved Hide resolved

petrelharp reviewed Jun 15, 2020

View reviewed changes

docs/data-model.rst Outdated Show resolved Hide resolved

petrelharp reviewed Jun 15, 2020

View reviewed changes

docs/data-model.rst Outdated Show resolved Hide resolved

petrelharp reviewed Jun 15, 2020

View reviewed changes

python/tests/__init__.py Outdated Show resolved Hide resolved

jeromekelleher reviewed Jun 17, 2020

View reviewed changes

jeromekelleher reviewed Jun 18, 2020

View reviewed changes

c/tskit/tables.c Outdated Show resolved Hide resolved

benjeffery mentioned this pull request Jun 29, 2020

Mutation time - backwards compatibility #692

Closed

benjeffery force-pushed the mutation_time branch from bbf936f to cc28131 Compare June 30, 2020 23:43

jeromekelleher reviewed Jul 1, 2020

View reviewed changes

c/tskit/tables.c Outdated Show resolved Hide resolved

c/tskit/core.c Outdated Show resolved Hide resolved

jeromekelleher reviewed Jul 3, 2020

View reviewed changes

jeromekelleher approved these changes Jul 6, 2020

View reviewed changes

benjeffery force-pushed the mutation_time branch from bc4041b to 83dcaf3 Compare July 6, 2020 12:30

jeromekelleher added the AUTOMERGE-REQUESTED label Jul 6, 2020

benjeffery added 2 commits July 6, 2020 12:52

Add mutation time

5708a04

Make mutation time an optional column by defaulting to a specific NAN…

bbef335

… value.

benjeffery force-pushed the mutation_time branch from 83dcaf3 to bbef335 Compare July 6, 2020 12:52

mergify bot merged commit 1374dbb into tskit-dev:master Jul 6, 2020

mergify bot removed the AUTOMERGE-REQUESTED label Jul 6, 2020

benjeffery deleted the mutation_time branch July 10, 2020 23:41

Mutation time #672

Mutation time #672

Uh oh!

Conversation

benjeffery commented Jun 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeromekelleher commented Jun 8, 2020

Uh oh!

codecov bot commented Jun 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

benjeffery commented Jun 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

benjeffery commented Jun 14, 2020

Uh oh!

benjeffery commented Jun 14, 2020

Uh oh!

benjeffery commented Jun 15, 2020

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

petrelharp commented Jun 15, 2020

Uh oh!

Uh oh!

jeromekelleher left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

benjeffery commented Jun 17, 2020

Uh oh!

jeromekelleher commented Jun 17, 2020

Uh oh!

Uh oh!

benjeffery commented Jun 19, 2020

Uh oh!

jeromekelleher left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

benjeffery commented Jul 1, 2020

Uh oh!

benjeffery commented Jul 1, 2020

Uh oh!

jeromekelleher commented Jul 2, 2020

Uh oh!

jeromekelleher commented Jul 2, 2020

Uh oh!

benjeffery commented Jul 2, 2020

Uh oh!

jeromekelleher commented Jul 2, 2020

Uh oh!

benjeffery commented Jul 2, 2020

Uh oh!

benjeffery commented Jul 3, 2020

Uh oh!

jeromekelleher commented Jul 3, 2020

Uh oh!

jeromekelleher left a comment

benjeffery commented Jun 6, 2020 •

edited

Loading

codecov bot commented Jun 11, 2020 •

edited

Loading

benjeffery commented Jun 12, 2020 •

edited

Loading