Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Microsatellite mutations models #2071

Merged
merged 1 commit into from Jun 24, 2022
Merged

Conversation

andrewkern
Copy link
Member

@andrewkern andrewkern commented Jun 15, 2022

Here is a PR adding a three specific microsatellite mutation models along with a more general backend to add more.

The generalized parameterization I've chosen here follow from this nice paper from Rasmus Nielsen's group. It offers plenty of flexibility and room for growth. I've currently implemented just 3 of the models in that paper, but it will be trivial to add more.

I've implemented these as a subclass of MatrixMutationModel, and so I'm translating the CTMC representation from the original paper into the scaled, jump process transition matrix that is used in MatrixMutationModel. I hope I got the details correct here but may be off by a scalar?

Hoping to get some feedback as to what others would like to see here and what clean up to do.

i should add-- this would close #2013

@codecov
Copy link

codecov bot commented Jun 15, 2022

Codecov Report

Merging #2071 (5b3294a) into main (8c1b8a6) will increase coverage by 0.05%.
The diff coverage is 97.77%.

❗ Current head 5b3294a differs from pull request most recent head 3b57f82. Consider uploading reports for the commit 3b57f82 to get more accurate results

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #2071      +/-   ##
==========================================
+ Coverage   91.29%   91.35%   +0.05%     
==========================================
  Files          20       20              
  Lines       10803    10893      +90     
  Branches     2246     2273      +27     
==========================================
+ Hits         9863     9951      +88     
- Misses        514      515       +1     
- Partials      426      427       +1     
Flag Coverage Δ
C 91.35% <97.77%> (+0.05%) ⬆️
python 98.82% <97.77%> (-0.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
msprime/__init__.py 100.00% <ø> (ø)
msprime/demography.py 99.31% <ø> (ø)
msprime/mutations.py 98.33% <97.77%> (-0.24%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update dafdcfe...3b57f82. Read the comment docs.

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! My only comments are around the interface really, really nice that we can construct this based on the general matrix code.

docs/mutations.md Outdated Show resolved Hide resolved
docs/mutations.md Show resolved Hide resolved
msprime/__init__.py Outdated Show resolved Hide resolved
msprime/mutations.py Outdated Show resolved Hide resolved
msprime/mutations.py Outdated Show resolved Hide resolved
msprime/mutations.py Show resolved Hide resolved
@jeromekelleher
Copy link
Member

@petrelharp - any thoughts on the above? What do you think of my ideas for the interface?

docs/mutations.md Outdated Show resolved Hide resolved
msprime/mutations.py Outdated Show resolved Hide resolved
msprime/mutations.py Outdated Show resolved Hide resolved
msprime/mutations.py Outdated Show resolved Hide resolved
msprime/mutations.py Outdated Show resolved Hide resolved
msprime/mutations.py Outdated Show resolved Hide resolved
msprime/mutations.py Outdated Show resolved Hide resolved
msprime/mutations.py Outdated Show resolved Hide resolved
msprime/mutations.py Outdated Show resolved Hide resolved
msprime/mutations.py Outdated Show resolved Hide resolved
msprime/mutations.py Outdated Show resolved Hide resolved
tests/test_mutations.py Outdated Show resolved Hide resolved
@andrewkern
Copy link
Member Author

okay, I think this is all cleaned up and ready to go. I've added a bunch more tests to verify the transition matrices, cleaned up docs, refactored stuff according to the nice design suggestions (thanks!), and rebased and squashed.

@jeromekelleher
Copy link
Member

LGTM, thanks @andrewkern! Can you update the CHANGELOG please? We're good to merge then.

I'll open an issue to track statistical validation of these sims.

@andrewkern
Copy link
Member Author

happy to, but i've never messed with the CHANGELOG before. Should I just add this under new features on a new 1.2.1 release?

@jeromekelleher
Copy link
Member

Sure. Just stick in 2022-xx-xx as the date.

@andrewkern
Copy link
Member Author

done

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks!

It's nice to have defaults, but we need to document what they are and have a good justification for setting them to that value.

Also hard-coding the defuaults into the signatures means that we have no flexibility later in terms of changing the defaults depending on what the other inputs are. The general pattern is to use param=None in the signature to avoid painting ourselves into corners (unless it's totally clear that there'll never be another other params). Also it avoids having to put the hard coded params into several functions.

However, we can make an issue to track this if you'd rather merge this now?

@jeromekelleher
Copy link
Member

On second thoughts, let's just merge and update. Opened #2074 to track.

docs/mutations.md Outdated Show resolved Hide resolved
docs/mutations.md Outdated Show resolved Hide resolved
docs/mutations.md Outdated Show resolved Hide resolved
docs/mutations.md Outdated Show resolved Hide resolved
docs/mutations.md Outdated Show resolved Hide resolved
tests/test_mutations.py Outdated Show resolved Hide resolved
tests/test_mutations.py Outdated Show resolved Hide resolved
Comment on lines 452 to 454
# this test might seem a little loosey goosey
# but the scaling is create numerical noise i think
assert np.abs(exp - obs) < 1e-2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uh-oh: if you actually need 1e-2 for this to pass, then something is wrong here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i will recheck this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was not an issue it turned out

msprime/mutations.py Outdated Show resolved Hide resolved
tests/test_mutations.py Show resolved Hide resolved
@jeromekelleher
Copy link
Member

OK, removing the AUTOMERGE then since we should capture @petrelharp's comments in this PR.

FYI @andrewkern, I pushed to this branch to update the CHANGELOG date (the date of release was set to today, which might be confusing on the "current" docs build). This means you'll need to fetch from your origin and reset --hard. Or, just force push over the change I made and we can do it again, no biggie.

@andrewkern
Copy link
Member Author

Lots of minor nitpicks (apologies). But: if this is going to be used for simulating CODIS loci I think we should have more exhaustive tests for correctness. Let's see:

  • test that alleles are actually between lo and hi (inclusive)
  • test that rates are 0 for non-adjacent ones in SMM
  • test that stationary distribution is uniform when we say it is
  • write tests for ratios of expressions for EL2 without copy-pasting the helper functions (I've suggested how to do this for TPM
  • test that the sum of qij for j > i divided by the sum for j < i is equal to to (1 - alpha)/alpha
  • test that the sum of qij for i \neq j divided by the same thing for i = lo is equal to beta

And, I think that tests everything?

Okay I think I've not got all these tests in, EXCEPT for the last one which I'm confused on-- I don't think that's true for all models where s = 0? might need coffee

tests/test_mutations.py Outdated Show resolved Hide resolved
Comment on lines 404 to 405
if np.abs(exp - obs) > 1e-6:
raise ValueError(f"failed test on row: {i} exp:{exp} obs:{obs}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops - left over from debugging?

Suggested change
if np.abs(exp - obs) > 1e-6:
raise ValueError(f"failed test on row: {i} exp:{exp} obs:{obs}")
assert np.abs(exp - obs) < 1e-6

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup

Copy link
Contributor

@petrelharp petrelharp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!! - just a couple of minor things (aren't checking for validity of a few parameters).

@andrewkern
Copy link
Member Author

okay this should be all set i think @petrelharp @jeromekelleher?

note I haven't edited the CHANGELOG further. not sure if i need to?

@andrewkern
Copy link
Member Author

andrewkern commented Jun 23, 2022

well darn now i need to hit these new tests with tests... lol. brb

@andrewkern
Copy link
Member Author

okay this is squashed, rebased, and ready to merge I think

Comment on lines 356 to 357
if u < 0 or u > 1:
raise ValueError("u must be between 0 and 1 inclusive")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be non-inclusive. Otherwise, some wierd things can happen, e.g. the chain is not ergodic:

>>> mm = msprime.MicrosatMutationModel(u=0, lo=3, hi=6, v=-10)
>>> mm.transition_matrix
array([[1., 0., 0., 0.],
       [0., 0., 1., 0.],
       [0., 0., 0., 1.],
       [0., 0., 0., 1.]])
>>> mm.root_distribution
array([1., 0., 0., 0.])

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can change this but i think it conflicts with the original paper
image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, you get a valid Markov chain for the endpoints, but where one or both of the endpoints are absorbing, and so you'd get a situation where the root distribution draws with probability 1 from an absorbing state, and so you'd never get any non-silent mutations. That seems like not what anyone would actually want? And, the derived models have 0 < u < 1? There's no harm in leaving it 0 <= u <= 1, I guess, but I'd vote to restrict it.

@petrelharp
Copy link
Contributor

Terribly sorry, but I have one more change (I think it should be 0 < u < 1 not <=), and there's two pending suggestions?

@andrewkern
Copy link
Member Author

pending suggestions should be all set in my commits-- i didn't use the GH interface to incorporate them, but let me know if i missed something

msprime/mutations.py Outdated Show resolved Hide resolved
@petrelharp
Copy link
Contributor

Here's the two missing ones. I think they're important because they say what precisely what the mutation rates are.
Screenshot from 2022-06-23 21-58-11
Screenshot from 2022-06-23 21-57-47

@petrelharp
Copy link
Contributor

petrelharp commented Jun 24, 2022

Oh! And another important thing - where did the defaults come from? Edit: oh right, there's a different issue filed for that.

@andrewkern
Copy link
Member Author

okay got the suggestions. will post a separate commit for limiting u

@andrewkern
Copy link
Member Author

okay got a commit in that will hopefully close #2074?

@@ -422,8 +422,8 @@ class TPM(MicrosatMutationModel):
For this model both `p` and `m` need to be set to < 1.0.

This is a :class:`.MicrosatMutationModel` with alleles ``[lo, .. , hi]``.
For a precise definition of the mutation rates, see :class:`.MicrosatMutationModel`
with ``s=0``, ``u=0.5``, and ``v=0``.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, why'd you take this out? This is necessary information for understanding what the mutation rates are?

So - I think that someone needs to be able to work out, exactly, from the docs, what the mutation rate from j to k is, at a given mutation_rate. To do that, we need to say this stuff about calculating qij and dividing by the max, excetera. I figure we can for this model just refer to the general class where it's explained, but for people to be able to do that they need to know how this one is derived from the general class. I guess that people can figure this out from Sainudiin, and find the expression for qij people need to look in Sainudiin anyhow, but it's not completely obvious, and why not just say what these conditions are?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry. I thought it was more precise to say Default parameterization as estimated in Sainudiin et al. (2004). but I'll keep in this bit above as well as my addition

msprime/mutations.py Outdated Show resolved Hide resolved
implemented generalized model of microsatellite evolution

better doi for TPM

more microsatellite model stuff and some documentation

rescaled things at Peter's suggestion

Update docs/mutations.md

simple commits from review

Co-authored-by: Jerome Kelleher <jk@well.ox.ac.uk>

refactored design to include MicrosatMutationModel class

Update msprime/mutations.py

Co-authored-by: Peter Ralph <petrel.harp@gmail.com>

Peter's docstring suggestions included

added tests of transition matrix; further clean up to docs

one last test

added another example to the docs

edited CHANGELOG

Apply suggestions from code review

suggestions from @petrelharp

Co-authored-by: Peter Ralph <petrel.harp@gmail.com>

further tests and cleanup from code review

linting wasn't installed...

weird insertion throwing a bug

fixed doc bug introduced during code review

found bug in rescaling

test of betas in trans mat

parameterize testing

added test of internal allele beta ratios

add vscode to gitignore

comments from peter

tests on tests

Apply suggestions from code review

Co-authored-by: Peter Ralph <petrel.harp@gmail.com>

changed bounds on u to be open interval

better doc string on defaults for TPM and EL2 models

further doc string edits

further doc string edits; part 2
@andrewkern
Copy link
Member Author

@jeromekelleher, @petrelharp and I finished off these doc strings this am. this should be ready to merge

@petrelharp petrelharp merged commit f717961 into tskit-dev:main Jun 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

adding microsatellite mutation models
3 participants