-
Notifications
You must be signed in to change notification settings - Fork 78
Description
Currently a mutation knows what node it happened over, but really it should know what edge it happened on. Conceptually, rather than storing the node ID, we should be storing the edge ID - then, you can look up both the child and parent node for a given mutation, as well as know the spatial extent of the edge.
I can't think of any immediate applications of this beyond checking that mutation times are valid and viz, so I think it's perhaps it's one to acknowledge as a sub optimal design and move on. It would cause a fair bit of breakage if we were to change the Tables API so that the MutationTable.node became MutationTable.edge. The TreeSequence API could be updated in a non-breaking way, I'd imagine, as we'd fill out Mutation.node by looking up the edge table.
There's a few ways we could go about doing this, if we did it. Here's the "ripping off the band aid" way:
- In the C Tables API, change and references to
nodetoedge. - Change the file format to store the
edgeinstead ofnode. - In the C mutation_t struct, add a new
edgeattribute. Fill in the oldnodeitem by looking up the edge table. Code that uses the mutation_t struct shouldn't be affected at all then. - Similarly, in Python, we change and break the Tables API, but keep the compatability in the high-level TreeSequence/Tree API. The MutationTable could also provide the
nodecolumn as a property computed on demand, which should really minimise downstream breakage.
I don't think there'd be that much downstream breakage if we change the tables API. Most things that use the Tables API are either within tskit, or closely related projects. Since we're already going through a major file version bump for adding the time field to mutations (#513), there's no additional pain caused there.