Skip to content

Conversation

@benjeffery
Copy link
Member

@benjeffery benjeffery commented Jun 6, 2020

Fixes #513
Fixes #692

The mutation table now has a mandatory time column. For a tree sequence to be valid, these times have to be non-null, finite, older than their node, younger than parent mutation and younger than the parent node of their node, this is all checked in tsk_table_collection_check_mutation_times which is called by tsk_table_collection_check_integrity.

There is a new function tsk_table_collection_compute_mutation_times which creates times that satisfy these constraints by equally spreading out mutations along an edge that share a site. Along with a tskit upgrade addition to add times.

UPDATE:
Following discussion in #692 I'm changing this PR to be backwards compatible with mutation times being optional.

TODOs:

  • C tests for new functions
  • C docs for new functions
  • Python tests for new functions
  • Python docs for new functions
  • Add mutation.time to C Python
  • Add compute_mutation_times to C Python
  • Add mutation.time to Python
  • Fix up python tests somehow (monkey patch msprime to call compute_mutation_times as a temporary measure?)
  • Add random spreading option to compute_mutation_times Will move to follow-up PR.
  • File format version number bump
  • tskit upgrade implementation and tests.
  • Specifc NAN as default
  • file test with missing column
  • test with mixed nans and values
  • test that NAN macro isnan in python
  • test no times in dict repr
  • test for __eq__ __neq__ on Mutation class

@jeromekelleher
Copy link
Member

I've had a quick look over, looks exactly right to me.

@codecov
Copy link

codecov bot commented Jun 11, 2020

Codecov Report

Merging #672 into master will increase coverage by 1.55%.
The diff coverage is 90.50%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #672      +/-   ##
==========================================
+ Coverage   87.42%   88.98%   +1.55%     
==========================================
  Files          23       23              
  Lines       17963    13614    -4349     
  Branches     3575     2573    -1002     
==========================================
- Hits        15705    12115    -3590     
+ Misses       1091      837     -254     
+ Partials     1167      662     -505     
Flag Coverage Δ
#c_tests 88.98% <90.50%> (+0.02%) ⬆️
#python_c_tests ?
#python_tests 99.00% <100.00%> (+<0.01%) ⬆️
Impacted Files Coverage Δ
c/tskit/tables.c 79.19% <86.79%> (+0.13%) ⬆️
c/tskit/trees.c 90.83% <89.28%> (+0.05%) ⬆️
c/tskit/core.c 91.69% <100.00%> (+0.24%) ⬆️
c/tskit/core.h 100.00% <100.00%> (ø)
python/tskit/__init__.py 100.00% <100.00%> (ø)
python/tskit/tables.py 99.68% <100.00%> (+<0.01%) ⬆️
python/tskit/trees.py 98.24% <100.00%> (+0.02%) ⬆️
python/tskit/util.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a9964b7...bbef335. Read the comment docs.

@benjeffery
Copy link
Member Author

benjeffery commented Jun 12, 2020

@jeromekelleher
Just the tskit upgrade code to go now, running into an issue. I need to load the tables from a kastore, bypassing validation. I'm trying something like this:

    with kastore.load(src) as store:
        data = dict(store)
    data["format/version"] = numpy.asarray([13, 0], dtype=numpy.uint32)
    data["mutations/time"] = np.full(data["mutations/node"].shape, -1, dtype=np.float64)
    table_dict = collections.defaultdict(collections.defaultdict)
    for key, value in data.items():
        if "/" in key:
            table, col = key.split("/")
            if col != "metadata_schema":
                table_dict[table][col] = value
        else:
            table_dict[key] = value
    # TODO Reinstate schemas
    tables = tskit.TableCollection.fromdict(table_dict)
    tables.compute_mutation_times()
    tables.tree_sequence().dump(dest)

However I'm getting:

  File "/home/benj/projects/tskit/python/tskit/tables.py", line 467, in set_columns
    self.ll_table.set_columns(
TypeError: Cannot cast array data from dtype('uint8') to dtype('int8') according to the rule 'safe'

As it seems the python code is using NPY_INT8 and the C code KAS_UINT8 for byte columns. This looks like an error?

@benjeffery
Copy link
Member Author

I've gotten around this for now with a cast: 9b6553d

@benjeffery
Copy link
Member Author

Almost ready for review - let me do a line-by-line check first.

@benjeffery benjeffery marked this pull request as ready for review June 15, 2020 11:53
@benjeffery
Copy link
Member Author

@jeromekelleher Ready for review. Sorry it took a while, keep finding tests to add and bits to tweak!

c/tskit/tables.c Outdated
}
/* Check everything except site duplicates (which we expect) and
* edge indexes (which we don't use) */
* edge indexes and mutation timesc (which we don't use) */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"timesc"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 79fb3a0

@petrelharp
Copy link
Contributor

Wow, that hits a lot of places! I've read through, and it looks great. Thanks!

Note: we should next ask sort tables to use the time column.

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, this was epic, thanks @benjeffery! Overall it looks great, and the comments are pretty minor. I have a couple of substantive issues:

  1. I'm a bit concerned about the breakage caused by the Python MutationTable.add_row. Maybe it is the right thing to do to force users to come up with a time that makes sense rather than just calling compute_mutation_times, though.
  2. I don't think we should check the parent time in the table collection, and that should be done at tree sequence load time because it's a property of the tree topologies.

c/tskit/tables.c Outdated
tsk_table_collection_t *self, double *TSK_UNUSED(options))
{
int ret = 0;
const tsk_id_t *I, *O;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might as well restrict I and O for a bit of a perf boost.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 9ee98a7

return headers, rows

def add_row(self, site, node, derived_state, parent=-1, metadata=None):
def add_row(self, site, node, time, derived_state, parent=-1, metadata=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd have been inclined to put this after derived_state as an optional value with default 0 to avoid some breakage. I agree this is conceptually the right place for it though. Since you've gone through the pain of updating the test suite it's probably best to keep it as it is. I doubt we'll break that much downstream code...

@benjeffery
Copy link
Member Author

  1. I'm a bit concerned about the breakage caused by the Python MutationTable.add_row. Maybe it is the right thing to do to force users to come up with a time that makes sense rather than just calling compute_mutation_times, though.

My assumption is that there will be stats methods etc. that rely on mutation times so the actual change isn't "we added a parameter to add_row" but "we added a property that needs to make sense" so breakage is desired to bring people's attention to it. Using compute_mutation_times needs to be an active choice.

  1. I don't think we should check the parent time in the table collection, and that should be done at tree sequence load time because it's a property of the tree topologies.

I also wasn't sure about this as it felt wrong to need to build indexes to check a table collection, but I wasn't clear on the distinction between "valid tables" and "valid tree sequence".

@jeromekelleher
Copy link
Member

Using compute_mutation_times needs to be an active choice.

I agree, you've made the right choice here.

@benjeffery
Copy link
Member Author

Ok, I think all comments are addressed.
Tests will pass once this is rebased onto #687, but I don't want to invalidate all the comments till they are resolved.

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good - need to be careful with NaNs though, they're tricky!

@benjeffery
Copy link
Member Author

@jeromekelleher In a lot of places we check for table equality, this will fail for mutations that have the default value. My current plan is to override equality methods to make NAN==NAN, we'll see how that goes.

@benjeffery
Copy link
Member Author

This then gets more complex - we'd like two mutations that have "default" time to be equal. It would be simplest to keep using memcmp for table equality so the C and Python default times both need to be the same bit field. Thankfully it seems that (bitfield-wise) math.nan == np.nan == C NAN. For the record this is 011111111111100000...zeros...0 which for python is hard-coded here for numpy hard-coded here (it's the same value even though the hex is different as it is cast to double after). The C99 spec only specifies that the NAN macro is a quiet nan, but we can assert that it matches.

Some python operations generate NANs with a different sign bit to math.nan but I don't think that is a problem as we don't want those NANs to be equal. The operations that generate NANs with the same sign bit are problematic as they will appear equal to the default so will be valid but are likely due to error, ensuring no NANs before times are used will be essential.

So the way forward is to fill the C array with NAN and not 0xFF by default.

@jeromekelleher
Copy link
Member

So the way forward is to fill the C array with NAN and not 0xFF by default.

Good, agreed. I think we can be pragmatic here and use memcmp when checking for equality. It's fine if we say the tables are only equal if they both contain exactly the same NAN.

The Python default times can be set in the Python C API, so we should be able to use the same macro everywhere to set the default value. Come to think of it, it'd be better if we defined our own NaN value for TSK_MUTATION_TIME_UNKNOWN so that we don't get hit by different platforms using different bit values for the NAN macro. We would want a set of tables generated on Windows to be the same as one generated on Linux.

@jeromekelleher
Copy link
Member

We could be even stricter actually, and only consider this one specific NaN, TSK_MUTATION_TIME_UNKNOWN, as acceptable input, and any other NaN throws an error. We'd probably need to do some fiddling like making a union to that we can compare the actual bits, but it should work.

What do you think? It would solve some problems.

@benjeffery
Copy link
Member Author

I think that's right - to have a macro with a specific NaN. The fraction bits are free to use and not set usually - we could define our own bit pattern for TSK_MUTATION_TIME_UNKNOWN so that adding a row with np.nan or any other NaN would fail?

@jeromekelleher
Copy link
Member

I think that's right - to have a macro with a specific NaN. The fraction bits are free to use and not set usually - we could define our own bit pattern for TSK_MUTATION_TIME_UNKNOWN so that adding a row with np.nan or any other NaN would fail?

That seems like a good idea. No room for ambiguity then, but NaNs will still propagate in a sensible way if people try to do calculations on them directly. Excellent.

@benjeffery
Copy link
Member Author

Cool, we'll go with:

>>> struct.unpack(">d", b'\x7f\xf8tskit!')
(nan,)

@benjeffery
Copy link
Member Author

@jeromekelleher ok, I think this is ready for review. We have Python mutation equality methods so I needed to bring the default value through to Python. I considered replacing the NAN with None at the CPython layer, but think that's potentially confusing and bad perf.

@jeromekelleher
Copy link
Member

Great, thanks @benjeffery. I'll go through ASAP.

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks @benjeffery! I've gone through it in detail, with a few comments. I think the main thing is that we might as well use TSK_UNKNOWN_TIME, is_unknown_time, etc.

When squashing, it would be ideal if we could split this into at least 2 commits - one where we made this mandatory, and the other where we added the unknown_time concept. It'll be useful to keep the process by which we arrived at this point in the history.

/* We define a specific NAN value for default mutation time which indicates
* the time is unknown. We use a specific value so that if mutation time is set to
* a NAN from a computation we can reject it.
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should note here where we got the bit pattern from - people will wonder where we plucked this specific value from otherwise.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed f42d908

c/tskit/core.h Outdated
return nan_union.i == TSK_MUTATION_UNKNOWN_TIME_HEX;
}
#define TSK_MUTATION_UNKNOWN_TIME __tsk_nan_f()
#define TSK_IS_MUTATION_UNKNOWN_TIME(val) __tsk_is_nan_f(val)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just make this a function altogether? is_mutation_unknown_time(val) seems a bit more natural. Or, to put it another way, is there any particular reason for making this a function-like macro rather than just function?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I wanted to keep it all together - but yes no need to be a macro. Fixed in 9b742c8

@benjeffery
Copy link
Member Author

@jeromekelleher Addressed comments. When I squash I'll keep the first commit separate as that was the initial proposal.

@jeromekelleher
Copy link
Member

LGTM, let's merge!

@benjeffery
Copy link
Member Author

unnamed

Squashed to 2 commits!

@jeromekelleher
Copy link
Member

Let's remember how hard this was the next time we start talking about adding new columns to the tables!

@mergify mergify bot merged commit 1374dbb into tskit-dev:master Jul 6, 2020
@benjeffery benjeffery deleted the mutation_time branch July 10, 2020 23:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Mutation time - backwards compatibility Add MutationTable.time column

3 participants