-
Notifications
You must be signed in to change notification settings - Fork 78
Decapitate #2240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Decapitate #2240
Conversation
Codecov Report
@@ Coverage Diff @@
## main #2240 +/- ##
==========================================
- Coverage 93.33% 93.28% -0.06%
==========================================
Files 27 27
Lines 26141 26347 +206
Branches 1175 1178 +3
==========================================
+ Hits 24398 24577 +179
- Misses 1713 1740 +27
Partials 30 30
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
|
Oh my gosh, it's beautiful! That's so very nice and clean.
Consider a mutation with time >
Ok - I think we should remove the mutations above |
benjeffery
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice.
What happens if a time is specified that is younger than any of the nodes? I guess you get an empty edge table? Would be good to test that case.
Also I am thinking about the implications of methods like this leaving nodes and mutations without edges. I think it is the right thing to do as you don't end up losing samples - they just become missing, but do we know that all the other tskit routines don't make assumptions about mutations being on a valid edge? I assume they don't but worth thinking about.
|
A bit more about leaving nodes without edges - this will be very convenient because if there are samples above |
|
OK, I've added support for mutations and I think I've covered all the quirky corner cases. If this definition looks OK to you @petrelharp I'll code things up in numpy and add some tests where we run this over bigger examples and I think we're done. |
89df1cf to
ab6f86d
Compare
|
This is ready for a full review now. There were lots of fiddly details to deal with, but hopefully good to go now. I ended up implementing in C, as we still don't have nice ways of handling full table updates in Python and it seemed like this would be a useful operation to have in C at some point anyway. |
|
Looks great!! See comments. |
ab6f86d to
e3d01f4
Compare
|
Something we should consider actually @petrelharp - what if we didn't bother splitting the edges, and just deleted any edges that intersect with The only downside I guess is if you want to throw mutations on to a decapitated tree sequence afterwards, then you'll lose the bits of edges that you would have had to put mutations on. However, if we had a "census" operation, which split edges that intersect with Actually, the more I think about this the more convinced I am this i the right way to do, so going to mark this PR back as draft. |
|
In general I think that operations that add rows to tables need to have an argument which is the metadata to insert for those new elements. In C this would be bytes of course, but at the python level, as part of the method, we can check that this metadata satisfies the schema. |
Actually, there is a clear downside in that edges over samples would be deleted, so that unless there's mutations directly over them before So, the question is do we want to implement def decapitate(self, time, *, node_metadata=None, preserve_mutational_state=None):
self.split_edges(time, metadata=node_metadata)
self.delete_older(time, preserve_mutational_state=preserve_mutational_state)
def split_edges(self, time, *, flags=None, metadata=None, population=None):
"""
Replace any edge ``(l, r, parent, child)`` in which ``node_time[child] < time < node_time[parent]`` with two
edges ``(l, r, parent, u)`` and ``(l, r, u, child)``, where ``u`` is a newly added node for each intersecting edge.
If ``metadata`` or ``population`` are specified, newly added nodes will be assigned these values. Otherwise,
default values will be used. If a metadata schema is defined for the node table, the empty dictionary will be
used by default for the new node when calling :meth:`.NodeTable.add_row`; otherwise, an empty byte string.
The population value for the new node will be derived from the population of the edge's child, and any
intersecting migrations. [Details of how we decide migrations intersect with the edge]. Any migrations that
intersect with the edge that are older than ``time`` will have the ``node`` value set to ``u``, the newly added
node.
Newly added nodes will have a ``flags`` value of 0, if not specified.
Any mutations lying on the edge whose time is >= ``time`` will have their node value set to ``u``. Note that
the time of the mutation is defined as the time of the child node if the mutation's time is unknown.
"""
if metadata is None:
metadata = # figure out default metadata value in bytes
# Implement. Communicating the defaults down to C will be tricky, I guess we'd have to use some function
# option flags to do that.
self._ll_tables.split_edges(...)
def delete_older(self, time, *, preserve_mutational_state=True):
"""
Delete information from the data model that has a time >= to the specified value. For the purposes of this
method, the time of an edge is defined as the time of the child node. The time of a mutation is defined
as the time of its node, if it is marked as unknown. The time of a migration is its ``time`` value.
To avoid changing IDs nodes are *not* deleted. If you wish to remove redundant nodes, please use
simplify on the result of this operation.
If ``preserve_mutational_state`` is True, then the state inherited by samples at each site is guaranteed to
be the same before an after this operation. This is done by inserting new mutations [not sure how actually].
If ``preserve_mutational_state``, then the state that samples inherit is can change arbitrarily, including
samples being marked as missing data.
"""
# Easy enough to implement, except for ``preserve_mutational_state`` bit.Hmm, this is tricky. I think the `split_edges If we do want |
|
I think the |
|
Or, just put the |
59b8323 to
b96325a
Compare
|
Restacked on #2279 |
b96325a to
2ef801c
Compare
bf42b78 to
1be677f
Compare
Co-authored-by: Peter Ralph <petrel.harp@gmail.com>
3636823 to
ab8a260
Compare
0502296 to
3c9ae2e
Compare
f009e5e to
5100269
Compare
|
An update here if anyone is interested: the implementation of I could separate out |
|
(Nix that, I'm going to open another PR - this one is too big.) |
|
Closed in favour of #2331 |
What do we think of this definition? We could go along and delete nodes that aren't referred to any more, but I don't think it's worth it. It's much simpler if we can trust that the nodes in the pre-decapitated trees are the same as afterwards. If someone wants to get rid of them, then they can easily simplify the tables afterwards (we made the simplification an option in some of the trim methods, which we subsequently decided was a mistake).
I haven't thought through how to deal with mutations yet, thought I'd get some feedback first. I'm tempted to just leave them in though. They won't do anything unless they're over samples, so the same argument as above for nodes holds?