Microsatellite mutations models #2071

andrewkern · 2022-06-15T19:20:12Z

Here is a PR adding a three specific microsatellite mutation models along with a more general backend to add more.

The generalized parameterization I've chosen here follow from this nice paper from Rasmus Nielsen's group. It offers plenty of flexibility and room for growth. I've currently implemented just 3 of the models in that paper, but it will be trivial to add more.

I've implemented these as a subclass of MatrixMutationModel, and so I'm translating the CTMC representation from the original paper into the scaled, jump process transition matrix that is used in MatrixMutationModel. I hope I got the details correct here but may be off by a scalar?

Hoping to get some feedback as to what others would like to see here and what clean up to do.

i should add-- this would close #2013

codecov · 2022-06-15T19:28:15Z

Codecov Report

Merging #2071 (5b3294a) into main (8c1b8a6) will increase coverage by 0.05%.
The diff coverage is 97.77%.

❗ Current head 5b3294a differs from pull request most recent head 3b57f82. Consider uploading reports for the commit 3b57f82 to get more accurate results

@@            Coverage Diff             @@
##             main    #2071      +/-   ##
==========================================
+ Coverage   91.29%   91.35%   +0.05%     
==========================================
  Files          20       20              
  Lines       10803    10893      +90     
  Branches     2246     2273      +27     
==========================================
+ Hits         9863     9951      +88     
- Misses        514      515       +1     
- Partials      426      427       +1

Flag	Coverage Δ
C	`91.35% <97.77%> (+0.05%)`	⬆️
python	`98.82% <97.77%> (-0.03%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
msprime/__init__.py	`100.00% <ø> (ø)`
msprime/demography.py	`99.31% <ø> (ø)`
msprime/mutations.py	`98.33% <97.77%> (-0.24%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update dafdcfe...3b57f82. Read the comment docs.

jeromekelleher

Looks great! My only comments are around the interface really, really nice that we can construct this based on the general matrix code.

docs/mutations.md

msprime/__init__.py

msprime/mutations.py

jeromekelleher · 2022-06-16T08:04:49Z

@petrelharp - any thoughts on the above? What do you think of my ideas for the interface?

docs/mutations.md

msprime/mutations.py

tests/test_mutations.py

andrewkern · 2022-06-18T06:20:01Z

okay, I think this is all cleaned up and ready to go. I've added a bunch more tests to verify the transition matrices, cleaned up docs, refactored stuff according to the nice design suggestions (thanks!), and rebased and squashed.

jeromekelleher · 2022-06-20T07:58:03Z

LGTM, thanks @andrewkern! Can you update the CHANGELOG please? We're good to merge then.

I'll open an issue to track statistical validation of these sims.

andrewkern · 2022-06-20T15:30:53Z

happy to, but i've never messed with the CHANGELOG before. Should I just add this under new features on a new 1.2.1 release?

jeromekelleher · 2022-06-20T15:56:30Z

Sure. Just stick in 2022-xx-xx as the date.

andrewkern · 2022-06-20T16:10:39Z

done

jeromekelleher

Looks great, thanks!

It's nice to have defaults, but we need to document what they are and have a good justification for setting them to that value.

Also hard-coding the defuaults into the signatures means that we have no flexibility later in terms of changing the defaults depending on what the other inputs are. The general pattern is to use param=None in the signature to avoid painting ourselves into corners (unless it's totally clear that there'll never be another other params). Also it avoids having to put the hard coded params into several functions.

However, we can make an issue to track this if you'd rather merge this now?

jeromekelleher · 2022-06-21T12:25:36Z

On second thoughts, let's just merge and update. Opened #2074 to track.

docs/mutations.md

tests/test_mutations.py

petrelharp · 2022-06-21T14:51:49Z

tests/test_mutations.py

+                # this test might seem a little loosey goosey
+                # but the scaling is create numerical noise i think
+                assert np.abs(exp - obs) < 1e-2


Uh-oh: if you actually need 1e-2 for this to pass, then something is wrong here.

i will recheck this

was not an issue it turned out

msprime/mutations.py

tests/test_mutations.py

jeromekelleher · 2022-06-21T15:35:39Z

OK, removing the AUTOMERGE then since we should capture @petrelharp's comments in this PR.

FYI @andrewkern, I pushed to this branch to update the CHANGELOG date (the date of release was set to today, which might be confusing on the "current" docs build). This means you'll need to fetch from your origin and reset --hard. Or, just force push over the change I made and we can do it again, no biggie.

andrewkern · 2022-06-21T21:03:36Z

Lots of minor nitpicks (apologies). But: if this is going to be used for simulating CODIS loci I think we should have more exhaustive tests for correctness. Let's see:

test that alleles are actually between lo and hi (inclusive)

test that rates are 0 for non-adjacent ones in SMM

test that stationary distribution is uniform when we say it is

write tests for ratios of expressions for EL2 without copy-pasting the helper functions (I've suggested how to do this for TPM

test that the sum of qij for j > i divided by the sum for j < i is equal to to (1 - alpha)/alpha

test that the sum of qij for i \neq j divided by the same thing for i = lo is equal to beta

And, I think that tests everything?

Okay I think I've not got all these tests in, EXCEPT for the last one which I'm confused on-- I don't think that's true for all models where s = 0? might need coffee

tests/test_mutations.py

msprime/mutations.py

petrelharp · 2022-06-23T05:57:40Z

tests/test_mutations.py

+                if np.abs(exp - obs) > 1e-6:
+                    raise ValueError(f"failed test on row: {i} exp:{exp} obs:{obs}")


Whoops - left over from debugging?

Suggested change

if np.abs(exp - obs) > 1e-6:

raise ValueError(f"failed test on row: {i} exp:{exp} obs:{obs}")

assert np.abs(exp - obs) < 1e-6

tests/test_mutations.py

petrelharp

LGTM!! - just a couple of minor things (aren't checking for validity of a few parameters).

andrewkern · 2022-06-23T17:26:05Z

okay this should be all set i think @petrelharp @jeromekelleher?

note I haven't edited the CHANGELOG further. not sure if i need to?

andrewkern · 2022-06-23T17:34:52Z

well darn now i need to hit these new tests with tests... lol. brb

andrewkern · 2022-06-23T18:35:12Z

okay this is squashed, rebased, and ready to merge I think

petrelharp · 2022-06-24T02:33:00Z

msprime/mutations.py

+        if u < 0 or u > 1:
+            raise ValueError("u must be between 0 and 1 inclusive")


I think this should be non-inclusive. Otherwise, some wierd things can happen, e.g. the chain is not ergodic:

>>> mm = msprime.MicrosatMutationModel(u=0, lo=3, hi=6, v=-10) >>> mm.transition_matrix array([[1., 0., 0., 0.], [0., 0., 1., 0.], [0., 0., 0., 1.], [0., 0., 0., 1.]]) >>> mm.root_distribution array([1., 0., 0., 0.])

we can change this but i think it conflicts with the original paper

Well, you get a valid Markov chain for the endpoints, but where one or both of the endpoints are absorbing, and so you'd get a situation where the root distribution draws with probability 1 from an absorbing state, and so you'd never get any non-silent mutations. That seems like not what anyone would actually want? And, the derived models have 0 < u < 1? There's no harm in leaving it 0 <= u <= 1, I guess, but I'd vote to restrict it.

petrelharp · 2022-06-24T02:34:33Z

Terribly sorry, but I have one more change (I think it should be 0 < u < 1 not <=), and there's two pending suggestions?

andrewkern · 2022-06-24T04:33:15Z

pending suggestions should be all set in my commits-- i didn't use the GH interface to incorporate them, but let me know if i missed something

msprime/mutations.py

petrelharp · 2022-06-24T05:00:16Z

Here's the two missing ones. I think they're important because they say what precisely what the mutation rates are.

petrelharp · 2022-06-24T05:05:10Z

Oh! And another important thing - where did the defaults come from? Edit: oh right, there's a different issue filed for that.

andrewkern · 2022-06-24T05:47:43Z

okay got the suggestions. will post a separate commit for limiting u

andrewkern · 2022-06-24T06:15:29Z

okay got a commit in that will hopefully close #2074?

petrelharp · 2022-06-24T14:23:16Z

msprime/mutations.py

@@ -422,8 +422,8 @@ class TPM(MicrosatMutationModel):
    For this model both `p` and `m` need to be set to < 1.0.

    This is a :class:`.MicrosatMutationModel` with alleles ``[lo, .. , hi]``.
-    For a precise definition of the mutation rates, see :class:`.MicrosatMutationModel`
-    with ``s=0``, ``u=0.5``, and ``v=0``.


Wait, why'd you take this out? This is necessary information for understanding what the mutation rates are?

So - I think that someone needs to be able to work out, exactly, from the docs, what the mutation rate from j to k is, at a given mutation_rate. To do that, we need to say this stuff about calculating qij and dividing by the max, excetera. I figure we can for this model just refer to the general class where it's explained, but for people to be able to do that they need to know how this one is derived from the general class. I guess that people can figure this out from Sainudiin, and find the expression for qij people need to look in Sainudiin anyhow, but it's not completely obvious, and why not just say what these conditions are?

Sorry. I thought it was more precise to say Default parameterization as estimated in Sainudiin et al. (2004). but I'll keep in this bit above as well as my addition

msprime/mutations.py

@petrelharp

implemented generalized model of microsatellite evolution better doi for TPM more microsatellite model stuff and some documentation rescaled things at Peter's suggestion Update docs/mutations.md simple commits from review Co-authored-by: Jerome Kelleher <jk@well.ox.ac.uk> refactored design to include MicrosatMutationModel class Update msprime/mutations.py Co-authored-by: Peter Ralph <petrel.harp@gmail.com> Peter's docstring suggestions included added tests of transition matrix; further clean up to docs one last test added another example to the docs edited CHANGELOG Apply suggestions from code review suggestions from @petrelharp Co-authored-by: Peter Ralph <petrel.harp@gmail.com> further tests and cleanup from code review linting wasn't installed... weird insertion throwing a bug fixed doc bug introduced during code review found bug in rescaling test of betas in trans mat parameterize testing added test of internal allele beta ratios add vscode to gitignore comments from peter tests on tests Apply suggestions from code review Co-authored-by: Peter Ralph <petrel.harp@gmail.com> changed bounds on u to be open interval better doc string on defaults for TPM and EL2 models further doc string edits further doc string edits; part 2

andrewkern · 2022-06-24T18:42:25Z

@jeromekelleher, @petrelharp and I finished off these doc strings this am. this should be ready to merge

jeromekelleher reviewed Jun 16, 2022

View reviewed changes

jeromekelleher requested a review from petrelharp June 16, 2022 08:04