-
Notifications
You must be signed in to change notification settings - Fork 78
Mutation time #672
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mutation time #672
Conversation
|
I've had a quick look over, looks exactly right to me. |
Codecov Report
@@ Coverage Diff @@
## master #672 +/- ##
==========================================
+ Coverage 87.42% 88.98% +1.55%
==========================================
Files 23 23
Lines 17963 13614 -4349
Branches 3575 2573 -1002
==========================================
- Hits 15705 12115 -3590
+ Misses 1091 837 -254
+ Partials 1167 662 -505
Continue to review full report at Codecov.
|
|
@jeromekelleher with kastore.load(src) as store:
data = dict(store)
data["format/version"] = numpy.asarray([13, 0], dtype=numpy.uint32)
data["mutations/time"] = np.full(data["mutations/node"].shape, -1, dtype=np.float64)
table_dict = collections.defaultdict(collections.defaultdict)
for key, value in data.items():
if "/" in key:
table, col = key.split("/")
if col != "metadata_schema":
table_dict[table][col] = value
else:
table_dict[key] = value
# TODO Reinstate schemas
tables = tskit.TableCollection.fromdict(table_dict)
tables.compute_mutation_times()
tables.tree_sequence().dump(dest)However I'm getting: As it seems the python code is using |
|
I've gotten around this for now with a cast: 9b6553d |
|
Almost ready for review - let me do a line-by-line check first. |
|
@jeromekelleher Ready for review. Sorry it took a while, keep finding tests to add and bits to tweak! |
c/tskit/tables.c
Outdated
| } | ||
| /* Check everything except site duplicates (which we expect) and | ||
| * edge indexes (which we don't use) */ | ||
| * edge indexes and mutation timesc (which we don't use) */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"timesc"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 79fb3a0
|
Wow, that hits a lot of places! I've read through, and it looks great. Thanks! Note: we should next ask sort tables to use the time column. |
jeromekelleher
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, this was epic, thanks @benjeffery! Overall it looks great, and the comments are pretty minor. I have a couple of substantive issues:
- I'm a bit concerned about the breakage caused by the Python MutationTable.add_row. Maybe it is the right thing to do to force users to come up with a time that makes sense rather than just calling compute_mutation_times, though.
- I don't think we should check the parent time in the table collection, and that should be done at tree sequence load time because it's a property of the tree topologies.
c/tskit/tables.c
Outdated
| tsk_table_collection_t *self, double *TSK_UNUSED(options)) | ||
| { | ||
| int ret = 0; | ||
| const tsk_id_t *I, *O; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might as well restrict I and O for a bit of a perf boost.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 9ee98a7
python/tskit/tables.py
Outdated
| return headers, rows | ||
|
|
||
| def add_row(self, site, node, derived_state, parent=-1, metadata=None): | ||
| def add_row(self, site, node, time, derived_state, parent=-1, metadata=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd have been inclined to put this after derived_state as an optional value with default 0 to avoid some breakage. I agree this is conceptually the right place for it though. Since you've gone through the pain of updating the test suite it's probably best to keep it as it is. I doubt we'll break that much downstream code...
My assumption is that there will be stats methods etc. that rely on mutation times so the actual change isn't "we added a parameter to
I also wasn't sure about this as it felt wrong to need to build indexes to check a table collection, but I wasn't clear on the distinction between "valid tables" and "valid tree sequence". |
I agree, you've made the right choice here. |
|
Ok, I think all comments are addressed. |
bbf936f to
cc28131
Compare
jeromekelleher
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good - need to be careful with NaNs though, they're tricky!
|
@jeromekelleher In a lot of places we check for table equality, this will fail for mutations that have the default value. My current plan is to override equality methods to make NAN==NAN, we'll see how that goes. |
|
This then gets more complex - we'd like two mutations that have "default" time to be equal. It would be simplest to keep using Some python operations generate NANs with a different sign bit to So the way forward is to fill the C array with |
Good, agreed. I think we can be pragmatic here and use memcmp when checking for equality. It's fine if we say the tables are only equal if they both contain exactly the same NAN. The Python default times can be set in the Python C API, so we should be able to use the same macro everywhere to set the default value. Come to think of it, it'd be better if we defined our own NaN value for TSK_MUTATION_TIME_UNKNOWN so that we don't get hit by different platforms using different bit values for the NAN macro. We would want a set of tables generated on Windows to be the same as one generated on Linux. |
|
We could be even stricter actually, and only consider this one specific NaN, TSK_MUTATION_TIME_UNKNOWN, as acceptable input, and any other NaN throws an error. We'd probably need to do some fiddling like making a union to that we can compare the actual bits, but it should work. What do you think? It would solve some problems. |
|
I think that's right - to have a macro with a specific NaN. The fraction bits are free to use and not set usually - we could define our own bit pattern for TSK_MUTATION_TIME_UNKNOWN so that adding a row with np.nan or any other NaN would fail? |
That seems like a good idea. No room for ambiguity then, but NaNs will still propagate in a sensible way if people try to do calculations on them directly. Excellent. |
|
Cool, we'll go with: >>> struct.unpack(">d", b'\x7f\xf8tskit!')
(nan,) |
|
@jeromekelleher ok, I think this is ready for review. We have Python mutation equality methods so I needed to bring the default value through to Python. I considered replacing the NAN with None at the CPython layer, but think that's potentially confusing and bad perf. |
|
Great, thanks @benjeffery. I'll go through ASAP. |
jeromekelleher
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thanks @benjeffery! I've gone through it in detail, with a few comments. I think the main thing is that we might as well use TSK_UNKNOWN_TIME, is_unknown_time, etc.
When squashing, it would be ideal if we could split this into at least 2 commits - one where we made this mandatory, and the other where we added the unknown_time concept. It'll be useful to keep the process by which we arrived at this point in the history.
| /* We define a specific NAN value for default mutation time which indicates | ||
| * the time is unknown. We use a specific value so that if mutation time is set to | ||
| * a NAN from a computation we can reject it. | ||
| */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should note here where we got the bit pattern from - people will wonder where we plucked this specific value from otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed f42d908
c/tskit/core.h
Outdated
| return nan_union.i == TSK_MUTATION_UNKNOWN_TIME_HEX; | ||
| } | ||
| #define TSK_MUTATION_UNKNOWN_TIME __tsk_nan_f() | ||
| #define TSK_IS_MUTATION_UNKNOWN_TIME(val) __tsk_is_nan_f(val) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we just make this a function altogether? is_mutation_unknown_time(val) seems a bit more natural. Or, to put it another way, is there any particular reason for making this a function-like macro rather than just function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I wanted to keep it all together - but yes no need to be a macro. Fixed in 9b742c8
|
@jeromekelleher Addressed comments. When I squash I'll keep the first commit separate as that was the initial proposal. |
|
LGTM, let's merge! |
|
Let's remember how hard this was the next time we start talking about adding new columns to the tables! |

Fixes #513
Fixes #692
The mutation table now has a mandatory
timecolumn. For a tree sequence to be valid, these times have to be non-null, finite, older than their node, younger than parent mutation and younger than the parent node of their node, this is all checked intsk_table_collection_check_mutation_timeswhich is called bytsk_table_collection_check_integrity.There is a new function
tsk_table_collection_compute_mutation_timeswhich creates times that satisfy these constraints by equally spreading out mutations along an edge that share a site. Along with atskit upgradeaddition to add times.UPDATE:
Following discussion in #692 I'm changing this PR to be backwards compatible with mutation times being optional.
TODOs:
compute_mutation_timesto C Pythonmsprimeto callcompute_mutation_timesas a temporary measure?)Add random spreading option toWill move to follow-up PR.compute_mutation_timestskit upgradeimplementation and tests.isnanin python__eq____neq__onMutationclass