'mutation' mode for statistics

Currently we have [three modes](https://tskit.dev/tskit/docs/latest/stats.html#sec-stats-mode) for statistics. Quoting from there:

> These three examples can all be answered in the same way with the tree sequence: first, draw all the paths from one genome to the other through the tree sequence (back up to their common ancestor and back down in each marginal tree). Then, (site) count the number of mutations falling on the paths, (branch) measure the length of the paths, or (node) count how often the path goes through each node. 

However, that's a bit wrong - what's written for `site` mode is only true under an infinite-sites model of mutations (ie, not more than one mutation per site). The reason is that `site` mode is defined so that you get a result that is equal to what you'd get from the genome sequences themselves; and hence:
1. you have to use identity-by-state, and so can't know if two identical alleles came from different mutations
2. you can't see extra mutations that may have happened along the way, and so you only know "are these identical or not", rather than "how many mutations separate these two" at a given site.

However, given inferred tree sequences, what we'd arguably rather compute is the more sorts of statistics that simply "count the number of mutations falling on the paths", even if these mutations happen on the same site. This is closer to what we can more easily do theory about, etcetera. Furthermore, there's things we want to do that we *can't* do with `site` mode, including:
1. time windows: `site` mode is about allelic state, and without a 1-to-1 mapping between alleles and mutations, alleles don't have times
2. relatedness matrix operations: several of the efficient algorithms for relatedness matrices don't work if we want to use identity-by-state instead of identity-by-descent: `genetic_relatedness_matrix` and `genetic_relatedness_vector`. The reason the algorithms don't work is because we're relying on the tree to tell us who's similar to who, and for instance assuming that if nodes `a` and `b` differ by three mutations, then those three mutations also differ between everyone in the subtrees below `a` and `b`. I see no way to get around that while keeping the algorithms in their current form at all.

So, I'd like to add a new `mode` for statistics, `mode="mutation"`. Implementation is conceptually straightforward: we want to do the same thing as `mode="branch"` except the area of the branch (span multiplied by time interval) is replaced by number of mutations on the branch. One choice to be made here regards time windows: for a given time window, do we

  a. count only mutations with `time` that fall in that window, or
  b. just "spread out the mass of" mutations uniformly on the edge

I think the answer is (b) - for real data we will mostly don't know where mutations are along the branch (well, if the mutation is deleterious then we'd have a prior that it's at the more recent end), so we definitely want to do (b); perhaps the argument could be made to also do (a) but it's not clear to me if it would be useful.

Thoughts? Concerns? Suggestions? How's the name?

The first step in implementing this might be some refactoring of `test_tree_stats.py`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

'mutation' mode for statistics #3299

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

'mutation' mode for statistics #3299

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions