Skip to content

'mutation' mode for statistics #3299

@petrelharp

Description

@petrelharp

Currently we have three modes for statistics. Quoting from there:

These three examples can all be answered in the same way with the tree sequence: first, draw all the paths from one genome to the other through the tree sequence (back up to their common ancestor and back down in each marginal tree). Then, (site) count the number of mutations falling on the paths, (branch) measure the length of the paths, or (node) count how often the path goes through each node.

However, that's a bit wrong - what's written for site mode is only true under an infinite-sites model of mutations (ie, not more than one mutation per site). The reason is that site mode is defined so that you get a result that is equal to what you'd get from the genome sequences themselves; and hence:

  1. you have to use identity-by-state, and so can't know if two identical alleles came from different mutations
  2. you can't see extra mutations that may have happened along the way, and so you only know "are these identical or not", rather than "how many mutations separate these two" at a given site.

However, given inferred tree sequences, what we'd arguably rather compute is the more sorts of statistics that simply "count the number of mutations falling on the paths", even if these mutations happen on the same site. This is closer to what we can more easily do theory about, etcetera. Furthermore, there's things we want to do that we can't do with site mode, including:

  1. time windows: site mode is about allelic state, and without a 1-to-1 mapping between alleles and mutations, alleles don't have times
  2. relatedness matrix operations: several of the efficient algorithms for relatedness matrices don't work if we want to use identity-by-state instead of identity-by-descent: genetic_relatedness_matrix and genetic_relatedness_vector. The reason the algorithms don't work is because we're relying on the tree to tell us who's similar to who, and for instance assuming that if nodes a and b differ by three mutations, then those three mutations also differ between everyone in the subtrees below a and b. I see no way to get around that while keeping the algorithms in their current form at all.

So, I'd like to add a new mode for statistics, mode="mutation". Implementation is conceptually straightforward: we want to do the same thing as mode="branch" except the area of the branch (span multiplied by time interval) is replaced by number of mutations on the branch. One choice to be made here regards time windows: for a given time window, do we

a. count only mutations with time that fall in that window, or
b. just "spread out the mass of" mutations uniformly on the edge

I think the answer is (b) - for real data we will mostly don't know where mutations are along the branch (well, if the mutation is deleterious then we'd have a prior that it's at the more recent end), so we definitely want to do (b); perhaps the argument could be made to also do (a) but it's not clear to me if it would be useful.

Thoughts? Concerns? Suggestions? How's the name?

The first step in implementing this might be some refactoring of test_tree_stats.py.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions