Skip to content

Conversation

@jeromekelleher
Copy link
Member

What do we think of this definition? We could go along and delete nodes that aren't referred to any more, but I don't think it's worth it. It's much simpler if we can trust that the nodes in the pre-decapitated trees are the same as afterwards. If someone wants to get rid of them, then they can easily simplify the tables afterwards (we made the simplification an option in some of the trim methods, which we subsequently decided was a mistake).

I haven't thought through how to deal with mutations yet, thought I'd get some feedback first. I'm tempted to just leave them in though. They won't do anything unless they're over samples, so the same argument as above for nodes holds?

@jeromekelleher jeromekelleher requested a review from benjeffery May 4, 2022 15:06
@codecov
Copy link

codecov bot commented May 4, 2022

Codecov Report

Merging #2240 (5100269) into main (1374729) will decrease coverage by 0.05%.
The diff coverage is 85.50%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #2240      +/-   ##
==========================================
- Coverage   93.33%   93.28%   -0.06%     
==========================================
  Files          27       27              
  Lines       26141    26347     +206     
  Branches     1175     1178       +3     
==========================================
+ Hits        24398    24577     +179     
- Misses       1713     1740      +27     
  Partials       30       30              
Flag Coverage Δ
c-tests 92.21% <81.61%> (-0.10%) ⬇️
lwt-tests 89.05% <ø> (ø)
python-c-tests 71.86% <67.60%> (-0.03%) ⬇️
python-tests 98.89% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
c/tskit/tables.c 90.21% <78.26%> (-0.08%) ⬇️
c/tskit/trees.c 95.02% <85.07%> (-0.22%) ⬇️
python/_tskitmodule.c 90.89% <90.00%> (-0.01%) ⬇️
python/tskit/tables.py 98.89% <100.00%> (+<0.01%) ⬆️
python/tskit/trees.py 98.07% <100.00%> (+0.01%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1374729...5100269. Read the comment docs.

@petrelharp
Copy link
Contributor

Oh my gosh, it's beautiful! That's so very nice and clean.

I haven't thought through how to deal with mutations yet, thought I'd get some feedback first. I'm tempted to just leave them in though. They won't do anything unless they're over samples, so the same argument as above for nodes holds?

Consider a mutation with time > time. There are three situations:

  1. it's node has time >= time also; in this case there will no longer be an edge on which the mutation falls (and there might not have been, previously, if the node was a root); so the interpretation is that the mutation specifies the genotype of that node, but no other nodes, because the node no longer has any descendants. So I guess this is harmless? However, it might have unexpected effects if the mutation's node is a sample?
  2. it's node has time < time and the mutation is on an edge; in this case at minimum the mutation's node should be updated to the newly-added node; if we do this then still the mutation will affect statistics, because (for instance) it could be a mutation "inherited by these samples, but not those others" and thus contribute to divergence. So, it ought to be removed.
  3. it's node has time < time and it is not on a edge; in this case the mutation just has to be removed, unless we also add a new node (and, new edge!) above the mutation's node, at time, and reassign the mutation's node. But, same problem as in (2).

Ok - I think we should remove the mutations above time.

Copy link
Member

@benjeffery benjeffery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice.
What happens if a time is specified that is younger than any of the nodes? I guess you get an empty edge table? Would be good to test that case.

Also I am thinking about the implications of methods like this leaving nodes and mutations without edges. I think it is the right thing to do as you don't end up losing samples - they just become missing, but do we know that all the other tskit routines don't make assumptions about mutations being on a valid edge? I assume they don't but worth thinking about.

@petrelharp
Copy link
Contributor

A bit more about leaving nodes without edges - this will be very convenient because if there are samples above time, then we'll still be able to compute statistics with them. And, they'll have the same order; removing nodes will make our use case complicated thanks to reordering. And, if you interpret decapitation as "give me relationships that happened since this time" then "no edges" is arguably the right thing. (Hm - it's not actually 'missing data' as in missing sequencing, though, but it will do the right thing.)

@jeromekelleher
Copy link
Member Author

OK, I've added support for mutations and I think I've covered all the quirky corner cases. If this definition looks OK to you @petrelharp I'll code things up in numpy and add some tests where we run this over bigger examples and I think we're done.

@jeromekelleher jeromekelleher marked this pull request as ready for review May 13, 2022 13:56
@jeromekelleher
Copy link
Member Author

This is ready for a full review now. There were lots of fiddly details to deal with, but hopefully good to go now.

I ended up implementing in C, as we still don't have nice ways of handling full table updates in Python and it seemed like this would be a useful operation to have in C at some point anyway.

@petrelharp
Copy link
Contributor

Looks great!! See comments.

@jeromekelleher
Copy link
Member Author

Something we should consider actually @petrelharp - what if we didn't bother splitting the edges, and just deleted any edges that intersect with t? This would be much simpler in that we wouldn't have to add any nodes. I don't think it would affect mutations that are on the deleted edge in any way, so the output genotypes will be identical.

The only downside I guess is if you want to throw mutations on to a decapitated tree sequence afterwards, then you'll lose the bits of edges that you would have had to put mutations on. However, if we had a "census" operation, which split edges that intersect with t then you could combine this with decapitate and that would do the desired thing. The advantage would be that we would only have to do this "inserting nodes into an edge" thing once, and could think more clearly about what the semantics of population, individual, metadata etc should be.

Actually, the more I think about this the more convinced I am this i the right way to do, so going to mark this PR back as draft.

@jeromekelleher jeromekelleher marked this pull request as draft May 13, 2022 20:24
@benjeffery
Copy link
Member

In general I think that operations that add rows to tables need to have an argument which is the metadata to insert for those new elements. In C this would be bytes of course, but at the python level, as part of the method, we can check that this metadata satisfies the schema.

@jeromekelleher
Copy link
Member Author

jeromekelleher commented May 15, 2022

The only downside I guess is if you want to throw mutations on to a decapitated tree sequence afterwards, then you'll lose the bits of edges that you would have had to put mutations on.

Actually, there is a clear downside in that edges over samples would be deleted, so that unless there's mutations directly over them before t, they will be missing data. That seems wrong.

So, the question is do we want to implement decapitate like it's done here, or to make it an operation that's composed of two simpler operations like this:

def decapitate(self, time, *, node_metadata=None, preserve_mutational_state=None):
     self.split_edges(time, metadata=node_metadata)
     self.delete_older(time, preserve_mutational_state=preserve_mutational_state)

def split_edges(self, time, *, flags=None, metadata=None, population=None):
    """
    Replace any edge ``(l, r, parent, child)`` in which ``node_time[child] < time < node_time[parent]`` with two 
    edges ``(l, r, parent, u)`` and ``(l, r, u, child)``, where ``u`` is a newly added node for each intersecting edge.
    If ``metadata`` or ``population`` are specified, newly added nodes will be assigned these values. Otherwise,
    default values will be used. If a metadata schema is defined for the node table, the empty dictionary will be 
    used by default for the new node when calling :meth:`.NodeTable.add_row`; otherwise, an empty byte string. 
    
    The population value for the new node will be derived from the population of the edge's child, and any 
    intersecting migrations. [Details of how we decide migrations intersect with the edge]. Any migrations that 
    intersect with the edge that are older than ``time`` will have the ``node`` value set to ``u``, the newly added 
    node.

    Newly added nodes will have a ``flags`` value of 0, if not specified.

    Any mutations lying on the edge whose time is >= ``time`` will have their node value set to ``u``. Note that 
    the time of the mutation is defined as the time of the child node if the mutation's time is unknown.
    """
    if metadata is None:
         metadata = # figure out default metadata value in bytes
    # Implement. Communicating the defaults down to C will be tricky, I guess we'd have to use some function
    # option flags to do that.
    self._ll_tables.split_edges(...)

def delete_older(self, time, *, preserve_mutational_state=True):
    """
    Delete information from the data model that has a time >= to the specified value. For the purposes of this 
    method, the time of an edge is defined as the time of the child node.  The time of a mutation is defined 
    as the time of its node, if it is marked as unknown. The time of a migration is its ``time`` value.

    To avoid changing IDs nodes are *not* deleted. If you wish to remove redundant nodes, please use 
    simplify on the result of this operation.

    If ``preserve_mutational_state`` is True, then the state inherited by samples at each site is guaranteed to 
    be the same before an after this operation. This is done by inserting new mutations [not sure how actually].
    If ``preserve_mutational_state``, then the state that samples inherit is can change arbitrarily, including 
    samples being marked as missing data.
    """
    # Easy enough to implement, except for ``preserve_mutational_state`` bit.

Hmm, this is tricky. I think the `split_edgesoperation is reasonably well defined, butdelete_older`` is still a bit wooly. Do we want to preserve the mutational state? This seems like a reasonable thing to want to me, but maybe that's missing the point. If we do delete the ancient stuff, then how can we expect to maintain the eventual output of the mutational process. In that case then, does it make sense to keep any mutations after decapitate? Mutations build on older mutations, so keeping the newer ones is a bit weird. But then, if you're a forwards-in-time simulation, newer topology builds on older topology so you could argue that's equally weird.

If we do want preserve_mutational_state I think the simplest thing to do is to just keep the older mutations.

@petrelharp
Copy link
Contributor

I think the split_edges is quite useful and we've wanted it in other places. But, no need for a separate delete_older operation - why not just split_edges and then decapitate, that uses it and passes some arguments along to it?

@petrelharp
Copy link
Contributor

Or, just put the metadata and flags arguments you've got sketched out into decapitate for now?

@jeromekelleher
Copy link
Member Author

Restacked on #2279

@jeromekelleher jeromekelleher force-pushed the decapitate branch 2 times, most recently from bf42b78 to 1be677f Compare May 21, 2022 09:34
@jeromekelleher
Copy link
Member Author

An update here if anyone is interested: the implementation of split_edges is complete I think. Having thought about it a bit, I think the simplest path forward is to implement TableCollection.delete_older, and then finally implement TreeSequence.decapitate by combining split_edges and delete_older. So, there's a bit more work to be done, but I'll plug away here.

I could separate out split_edges into another PR which we could first merge, if this one is getting too big. It would be a bit of a faddle, but might be worth it in the interest of having something reviewable. LMK what you think @benjeffery @petrelharp .

@jeromekelleher
Copy link
Member Author

(Nix that, I'm going to open another PR - this one is too big.)

This was referenced May 24, 2022
@jeromekelleher jeromekelleher mentioned this pull request Jun 9, 2022
@jeromekelleher
Copy link
Member Author

Closed in favour of #2331

@jeromekelleher jeromekelleher deleted the decapitate branch June 9, 2022 16:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants