Decapitate #2240

jeromekelleher · 2022-05-04T15:06:28Z

What do we think of this definition? We could go along and delete nodes that aren't referred to any more, but I don't think it's worth it. It's much simpler if we can trust that the nodes in the pre-decapitated trees are the same as afterwards. If someone wants to get rid of them, then they can easily simplify the tables afterwards (we made the simplification an option in some of the trim methods, which we subsequently decided was a mistake).

I haven't thought through how to deal with mutations yet, thought I'd get some feedback first. I'm tempted to just leave them in though. They won't do anything unless they're over samples, so the same argument as above for nodes holds?

codecov · 2022-05-04T15:19:03Z

Codecov Report

Merging #2240 (5100269) into main (1374729) will decrease coverage by 0.05%.
The diff coverage is 85.50%.

@@            Coverage Diff             @@
##             main    #2240      +/-   ##
==========================================
- Coverage   93.33%   93.28%   -0.06%     
==========================================
  Files          27       27              
  Lines       26141    26347     +206     
  Branches     1175     1178       +3     
==========================================
+ Hits        24398    24577     +179     
- Misses       1713     1740      +27     
  Partials       30       30

Flag	Coverage Δ
c-tests	`92.21% <81.61%> (-0.10%)`	⬇️
lwt-tests	`89.05% <ø> (ø)`
python-c-tests	`71.86% <67.60%> (-0.03%)`	⬇️
python-tests	`98.89% <100.00%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
c/tskit/tables.c	`90.21% <78.26%> (-0.08%)`	⬇️
c/tskit/trees.c	`95.02% <85.07%> (-0.22%)`	⬇️
python/_tskitmodule.c	`90.89% <90.00%> (-0.01%)`	⬇️
python/tskit/tables.py	`98.89% <100.00%> (+<0.01%)`	⬆️
python/tskit/trees.py	`98.07% <100.00%> (+0.01%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1374729...5100269. Read the comment docs.

petrelharp · 2022-05-05T00:58:04Z

Oh my gosh, it's beautiful! That's so very nice and clean.

I haven't thought through how to deal with mutations yet, thought I'd get some feedback first. I'm tempted to just leave them in though. They won't do anything unless they're over samples, so the same argument as above for nodes holds?

Consider a mutation with time > time. There are three situations:

it's node has time >= time also; in this case there will no longer be an edge on which the mutation falls (and there might not have been, previously, if the node was a root); so the interpretation is that the mutation specifies the genotype of that node, but no other nodes, because the node no longer has any descendants. So I guess this is harmless? However, it might have unexpected effects if the mutation's node is a sample?
it's node has time < time and the mutation is on an edge; in this case at minimum the mutation's node should be updated to the newly-added node; if we do this then still the mutation will affect statistics, because (for instance) it could be a mutation "inherited by these samples, but not those others" and thus contribute to divergence. So, it ought to be removed.
it's node has time < time and it is not on a edge; in this case the mutation just has to be removed, unless we also add a new node (and, new edge!) above the mutation's node, at time, and reassign the mutation's node. But, same problem as in (2).

Ok - I think we should remove the mutations above time.

benjeffery

Very nice.
What happens if a time is specified that is younger than any of the nodes? I guess you get an empty edge table? Would be good to test that case.

Also I am thinking about the implications of methods like this leaving nodes and mutations without edges. I think it is the right thing to do as you don't end up losing samples - they just become missing, but do we know that all the other tskit routines don't make assumptions about mutations being on a valid edge? I assume they don't but worth thinking about.

petrelharp · 2022-05-05T12:33:33Z

A bit more about leaving nodes without edges - this will be very convenient because if there are samples above time, then we'll still be able to compute statistics with them. And, they'll have the same order; removing nodes will make our use case complicated thanks to reordering. And, if you interpret decapitation as "give me relationships that happened since this time" then "no edges" is arguably the right thing. (Hm - it's not actually 'missing data' as in missing sequencing, though, but it will do the right thing.)

jeromekelleher · 2022-05-05T13:46:15Z

OK, I've added support for mutations and I think I've covered all the quirky corner cases. If this definition looks OK to you @petrelharp I'll code things up in numpy and add some tests where we run this over bigger examples and I think we're done.

jeromekelleher · 2022-05-13T13:57:38Z

This is ready for a full review now. There were lots of fiddly details to deal with, but hopefully good to go now.

I ended up implementing in C, as we still don't have nice ways of handling full table updates in Python and it seemed like this would be a useful operation to have in C at some point anyway.

c/tskit/tables.c

python/tskit/tables.py

python/tskit/trees.py

petrelharp · 2022-05-13T16:43:54Z

Looks great!! See comments.

jeromekelleher · 2022-05-13T20:18:53Z

Something we should consider actually @petrelharp - what if we didn't bother splitting the edges, and just deleted any edges that intersect with t? This would be much simpler in that we wouldn't have to add any nodes. I don't think it would affect mutations that are on the deleted edge in any way, so the output genotypes will be identical.

The only downside I guess is if you want to throw mutations on to a decapitated tree sequence afterwards, then you'll lose the bits of edges that you would have had to put mutations on. However, if we had a "census" operation, which split edges that intersect with t then you could combine this with decapitate and that would do the desired thing. The advantage would be that we would only have to do this "inserting nodes into an edge" thing once, and could think more clearly about what the semantics of population, individual, metadata etc should be.

Actually, the more I think about this the more convinced I am this i the right way to do, so going to mark this PR back as draft.

benjeffery · 2022-05-14T21:07:15Z

In general I think that operations that add rows to tables need to have an argument which is the metadata to insert for those new elements. In C this would be bytes of course, but at the python level, as part of the method, we can check that this metadata satisfies the schema.

jeromekelleher · 2022-05-15T09:12:30Z

The only downside I guess is if you want to throw mutations on to a decapitated tree sequence afterwards, then you'll lose the bits of edges that you would have had to put mutations on.

Actually, there is a clear downside in that edges over samples would be deleted, so that unless there's mutations directly over them before t, they will be missing data. That seems wrong.

So, the question is do we want to implement decapitate like it's done here, or to make it an operation that's composed of two simpler operations like this:

def decapitate(self, time, *, node_metadata=None, preserve_mutational_state=None):
     self.split_edges(time, metadata=node_metadata)
     self.delete_older(time, preserve_mutational_state=preserve_mutational_state)

def split_edges(self, time, *, flags=None, metadata=None, population=None):
    """
    Replace any edge ``(l, r, parent, child)`` in which ``node_time[child] < time < node_time[parent]`` with two 
    edges ``(l, r, parent, u)`` and ``(l, r, u, child)``, where ``u`` is a newly added node for each intersecting edge.
    If ``metadata`` or ``population`` are specified, newly added nodes will be assigned these values. Otherwise,
    default values will be used. If a metadata schema is defined for the node table, the empty dictionary will be 
    used by default for the new node when calling :meth:`.NodeTable.add_row`; otherwise, an empty byte string. 
    
    The population value for the new node will be derived from the population of the edge's child, and any 
    intersecting migrations. [Details of how we decide migrations intersect with the edge]. Any migrations that 
    intersect with the edge that are older than ``time`` will have the ``node`` value set to ``u``, the newly added 
    node.

    Newly added nodes will have a ``flags`` value of 0, if not specified.

    Any mutations lying on the edge whose time is >= ``time`` will have their node value set to ``u``. Note that 
    the time of the mutation is defined as the time of the child node if the mutation's time is unknown.
    """
    if metadata is None:
         metadata = # figure out default metadata value in bytes
    # Implement. Communicating the defaults down to C will be tricky, I guess we'd have to use some function
    # option flags to do that.
    self._ll_tables.split_edges(...)

def delete_older(self, time, *, preserve_mutational_state=True):
    """
    Delete information from the data model that has a time >= to the specified value. For the purposes of this 
    method, the time of an edge is defined as the time of the child node.  The time of a mutation is defined 
    as the time of its node, if it is marked as unknown. The time of a migration is its ``time`` value.

    To avoid changing IDs nodes are *not* deleted. If you wish to remove redundant nodes, please use 
    simplify on the result of this operation.

    If ``preserve_mutational_state`` is True, then the state inherited by samples at each site is guaranteed to 
    be the same before an after this operation. This is done by inserting new mutations [not sure how actually].
    If ``preserve_mutational_state``, then the state that samples inherit is can change arbitrarily, including 
    samples being marked as missing data.
    """
    # Easy enough to implement, except for ``preserve_mutational_state`` bit.

Hmm, this is tricky. I think the `split_edgesoperation is reasonably well defined, butdelete_older`` is still a bit wooly. Do we want to preserve the mutational state? This seems like a reasonable thing to want to me, but maybe that's missing the point. If we do delete the ancient stuff, then how can we expect to maintain the eventual output of the mutational process. In that case then, does it make sense to keep any mutations after decapitate? Mutations build on older mutations, so keeping the newer ones is a bit weird. But then, if you're a forwards-in-time simulation, newer topology builds on older topology so you could argue that's equally weird.

If we do want preserve_mutational_state I think the simplest thing to do is to just keep the older mutations.

petrelharp · 2022-05-16T04:52:27Z

I think the split_edges is quite useful and we've wanted it in other places. But, no need for a separate delete_older operation - why not just split_edges and then decapitate, that uses it and passes some arguments along to it?

petrelharp · 2022-05-16T04:53:02Z

Or, just put the metadata and flags arguments you've got sketched out into decapitate for now?

jeromekelleher · 2022-05-18T15:46:12Z

Restacked on #2279

python/tskit/trees.py

Closes tskit-dev#2236

Closes tskit-dev#2276

Co-authored-by: Peter Ralph <petrel.harp@gmail.com>

jeromekelleher · 2022-05-24T12:09:28Z

An update here if anyone is interested: the implementation of split_edges is complete I think. Having thought about it a bit, I think the simplest path forward is to implement TableCollection.delete_older, and then finally implement TreeSequence.decapitate by combining split_edges and delete_older. So, there's a bit more work to be done, but I'll plug away here.

I could separate out split_edges into another PR which we could first merge, if this one is getting too big. It would be a bit of a faddle, but might be worth it in the interest of having something reviewable. LMK what you think @benjeffery @petrelharp .

jeromekelleher · 2022-05-24T12:13:56Z

(Nix that, I'm going to open another PR - this one is too big.)

jeromekelleher · 2022-06-09T16:40:38Z

Closed in favour of #2331

jeromekelleher requested a review from benjeffery May 4, 2022 15:06

jeromekelleher mentioned this pull request May 4, 2022

Method to 'prune back' a tree sequence #281

Closed

benjeffery reviewed May 5, 2022

View reviewed changes

jeromekelleher force-pushed the decapitate branch from 1f25ab9 to 89df1cf Compare May 5, 2022 13:44

jeromekelleher force-pushed the decapitate branch from 89df1cf to ab6f86d Compare May 13, 2022 13:55

jeromekelleher marked this pull request as ready for review May 13, 2022 13:56

This was referenced May 13, 2022

Replace tsutil.decapitate in tests #2269

Closed

Move table transformation tests to tests/test_table_transforms.py #2270

Closed

petrelharp reviewed May 13, 2022

View reviewed changes

c/tskit/tables.c Show resolved Hide resolved

petrelharp reviewed May 13, 2022

View reviewed changes

c/tskit/tables.c Show resolved Hide resolved

petrelharp reviewed May 13, 2022

View reviewed changes

python/tskit/tables.py Show resolved Hide resolved

jeromekelleher mentioned this pull request May 13, 2022

Remove tsutil decapitate #2272

Closed

petrelharp reviewed May 13, 2022

View reviewed changes

python/tskit/trees.py Show resolved Hide resolved

jeromekelleher force-pushed the decapitate branch from ab6f86d to e3d01f4 Compare May 13, 2022 16:47

jeromekelleher marked this pull request as draft May 13, 2022 20:24

jeromekelleher mentioned this pull request May 16, 2022

Add TableCollection.split_edges #2276

Closed

jeromekelleher force-pushed the decapitate branch from 59b8323 to b96325a Compare May 18, 2022 15:45

jeromekelleher force-pushed the decapitate branch from b96325a to 2ef801c Compare May 18, 2022 16:10

jeromekelleher force-pushed the decapitate branch 2 times, most recently from bf42b78 to 1be677f Compare May 21, 2022 09:34

petrelharp reviewed May 23, 2022

View reviewed changes

python/tskit/trees.py Outdated Show resolved Hide resolved

petrelharp reviewed May 23, 2022

View reviewed changes

python/tskit/trees.py Outdated Show resolved Hide resolved

petrelharp reviewed May 23, 2022

View reviewed changes

python/tskit/trees.py Outdated Show resolved Hide resolved

petrelharp reviewed May 23, 2022

View reviewed changes

python/tskit/trees.py Outdated Show resolved Hide resolved

jeromekelleher and others added 9 commits May 23, 2022 17:28

Implement decapitate function

6a2c704

Closes tskit-dev#2236

Add start of new operation

a0f960c

Implement split_edges

a3c1569

Closes tskit-dev#2276

tests

31367be

Implement semantics for extra values

7b6c9c7

Update

8f4fbf7

C implementation of split_edges

68deb71

Apply suggestions from code review

1168ec5

Co-authored-by: Peter Ralph <petrel.harp@gmail.com>

Review tweak

ab8a260

jeromekelleher force-pushed the decapitate branch from 3636823 to ab8a260 Compare May 23, 2022 16:28

jeromekelleher added 2 commits May 23, 2022 17:51

Add impute population option

b9c50dc

Implement py-c split edges

3c9ae2e

jeromekelleher force-pushed the decapitate branch from 0502296 to 3c9ae2e Compare May 24, 2022 10:13

jeromekelleher added 2 commits May 24, 2022 12:58

Split edges using C impl

e2de3c7

Start for delete older

5100269

jeromekelleher force-pushed the decapitate branch from f009e5e to 5100269 Compare May 24, 2022 11:58

This was referenced May 24, 2022

Implement split_edges #2296

Merged

Delete_older operation #2302

Merged

jeromekelleher mentioned this pull request Jun 9, 2022

Decapitate2 #2331

Merged

jeromekelleher closed this Jun 9, 2022

jeromekelleher deleted the decapitate branch June 9, 2022 16:40

Decapitate #2240

Decapitate #2240

Uh oh!

Conversation

jeromekelleher commented May 4, 2022

Uh oh!

codecov bot commented May 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

petrelharp commented May 5, 2022

Uh oh!

benjeffery left a comment

Choose a reason for hiding this comment

Uh oh!

petrelharp commented May 5, 2022

Uh oh!

jeromekelleher commented May 5, 2022

Uh oh!

jeromekelleher commented May 13, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

petrelharp commented May 13, 2022

Uh oh!

jeromekelleher commented May 13, 2022

Uh oh!

benjeffery commented May 14, 2022

Uh oh!

jeromekelleher commented May 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

petrelharp commented May 16, 2022

Uh oh!

petrelharp commented May 16, 2022

Uh oh!

jeromekelleher commented May 18, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jeromekelleher commented May 24, 2022

Uh oh!

jeromekelleher commented May 24, 2022

Uh oh!

jeromekelleher commented Jun 9, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov bot commented May 4, 2022 •

edited

Loading

jeromekelleher commented May 15, 2022 •

edited

Loading