Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mutation model docs #1180

Merged
merged 1 commit into from
Sep 24, 2020
Merged

Conversation

petrelharp
Copy link
Contributor

Here I've added some documentation for

@petrelharp
Copy link
Contributor Author

Whoops - initially I had some commits off a different PR in here. Took those out.

@codecov
Copy link

codecov bot commented Sep 12, 2020

Codecov Report

Merging #1180 into master will not change coverage.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #1180   +/-   ##
=======================================
  Coverage   90.05%   90.05%           
=======================================
  Files          27       27           
  Lines       10946    10946           
  Branches     2356     2356           
=======================================
  Hits         9857     9857           
  Misses        547      547           
  Partials      542      542           
Flag Coverage Δ
#C 90.05% <100.00%> (ø)
#python 96.38% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
msprime/mutations.py 98.48% <ø> (ø)
msprime/_msprimemodule.c 87.24% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8f486f1...3e9aaff. Read the comment docs.

msprime/mutations.py Outdated Show resolved Hide resolved
docs/api.rst Outdated
.. autoclass:: msprime.HKY

TODO: move to msprime/mutations.py

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, the following is correct now, I believe. Note that the discussion below is somewhat complicated by the scaling factor M, required to make P a valid transition matrix. It looks like seq-gen avoids this issue by not documenting what the overall mutation rate actually is. Do you know, @GertjanBisschop - what's the overall mutation rate for seq-gen under HKY? Are we testing for equality of total number of mutations or just relative frequencies of different transition types? And, how does this description seem to you - clear? correct?

We could make the discussion below so that the mutation rates are just mu * q[i,j], if we added a "rate scaling factor" to the mtuation model object, and put M in there, transparent to the user. But I think this is probably not worth it - what do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great work. Will review this and answer all your questions tonight. Whichever way we decide on the scaling factor M, we will have one up on SeqGen given that this is all now very well documented.

Copy link
Member

@GertjanBisschop GertjanBisschop Sep 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overall rate of mutation for seq-gen is given by:
prob_mut = 1.0 - np.diag(transition_matrix)
seq_gen_rate = np.sum(prob_mut * root_distribution) * msprime_rate.
We have tested for equality of the total number of mutations, not just relative frequencies.

All of this is definitely clear. But I would argue against adding the rescaling factor to the mutation model object. If I understand things correctly, you would never allow the user, defining their own MatrixMutationModel to specify any M value, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I would argue against adding the rescaling factor to the mutation model object. I

Oh, good. =)

If I understand things correctly, you would never allow the user, defining their own MatrixMutationModel to specify any M value, right?

The way I'd imagine it would be that the MatrixMutationModel would have a rate_scaling argument ("M"), that gets multiplied by the mutation rate when it is used with mutate( ). So, yes, the user would be able to set this, but only if they are doing their own model. Then, if we defined HKY as a MatrixMutationModel(... rate_scaling=M), then we'd be able to delete all references to M from the documentation here.

@petrelharp
Copy link
Contributor Author

Ok, I think this is good for this PR? There's bigger picture rearranging that needs to happen, pending the API changes, but these can go in there somehow.

@petrelharp
Copy link
Contributor Author

FYI, to review the docs, check out this branch and then make them locally:

git fetch origin pull/1180/head:mutation_model_docs
git checkout mutation_model_docs
cd docs
make

and then look at the html files in docs/_build/html

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks @petrelharp!

docs/api.rst Outdated Show resolved Hide resolved
docs/api.rst Show resolved Hide resolved
docs/api.rst Outdated Show resolved Hide resolved
Copy link
Member

@GertjanBisschop GertjanBisschop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still really love the Parameterization of Matrix Mutation Models. This is super informative!

docs/api.rst Outdated
.. autoclass:: msprime.HKY

TODO: move to msprime/mutations.py

Copy link
Member

@GertjanBisschop GertjanBisschop Sep 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overall rate of mutation for seq-gen is given by:
prob_mut = 1.0 - np.diag(transition_matrix)
seq_gen_rate = np.sum(prob_mut * root_distribution) * msprime_rate.
We have tested for equality of the total number of mutations, not just relative frequencies.

All of this is definitely clear. But I would argue against adding the rescaling factor to the mutation model object. If I understand things correctly, you would never allow the user, defining their own MatrixMutationModel to specify any M value, right?

docs/api.rst Show resolved Hide resolved
@petrelharp
Copy link
Contributor Author

Should I wait on more comments here, @GertjanBisschop? Otherwise, I'll wait on #1176 and then tidy this up.

@GertjanBisschop
Copy link
Member

No more comments. Sorry @petrelharp, should have made that clearer.

@petrelharp
Copy link
Contributor Author

Ok, I've merged together the docs from #1176. I think this is reasonable, but suggestions of other ways to organize it are welcome.

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a massive step forward, thanks @petrelharp! LGTM.

I think this is ready to merge, @benjeffery?

Copy link
Member

@benjeffery benjeffery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking great @petrelharp. (I'm learning a lot about mutation models!)
Just a few comments.

docs/api.rst Outdated Show resolved Hide resolved
msprime/mutations.py Outdated Show resolved Hide resolved
docs/api.rst Outdated Show resolved Hide resolved
msprime/mutations.py Outdated Show resolved Hide resolved
transitions and transversions, and sets an equilibrium frequency for each nucleotide.
In addition a custom ancestral frequency (``root_distribution``) can be specified.
With ``kappa=1.0`` and the default values of the other arguments this model is equal
to :class:`JC69MutationModel`. This model is similar to :class:`F84MutationModel`
but with a differing parametrisation for ``kappa``.

See the documentation for more detail on the precise parameterization.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's quite a few places where it says to "see the documentation" Would be nice to say the section name (or URL?) as I'm not sure I'd know where to look otherwise.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right - I wasn't sure how to handle this, since on RTD we'd want to say "See below"; while in help(msprime.HKYMutationModel) we want to say "see https://msprime.readthedocs.io/en/latest/api.html#msprime.HKYMutationModel"... which on RTD would link to the very same place that the person is reading.

Alternatively, we could just put all of the details, that I've now got after the short description on RTD, into the docstring itself? How about that - or is that too much information?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the second option, I don't have any issue with large docstrings, if they are all useful info.

@petrelharp
Copy link
Contributor Author

Ah, one more thing: the block_size argument to the SLiM model is still undocumented. And, sadly, I still don't know quite what it does. It's the block size for the tsk_blkalloc... but, what happens if it is wrong? If it is too small, will the user get an error? When would the user want to adjust this number?

@jeromekelleher
Copy link
Member

The reasoning for keeping block_size is given here. Basically, we have to have it because in principle the amount of memory each slim mutation requires is unbounded, and we can't alloc more than the block_size in one go using tsk_blkalloc. So, the user shouldn't every need to worry about this unless their model is pretty daft and has 8K worth of mutations stacked on top of each other.

It's too much of a tedious faff to do the memory allocations via direct mallocs, so I think we're stuck with it.

@petrelharp
Copy link
Contributor Author

I read that, but still wasn't sure what happens if it's too small. But, come on peter, it's easy to check... I see: a _msprime.LibraryError: Out of memory.. I'll say "You should not need to change this unless you get a "out of memory" error due to a very large number of stacked mutations."

@jeromekelleher
Copy link
Member

Looks like this is ready to go, other than the discussion about how to refer to the documentation? I don't really have much of an opinion on that, so happy to merge whenever.

@petrelharp
Copy link
Contributor Author

I moved the docs, and fixed up a bit in the F84 documentation that I think was only correct for the case of equal base frequencies. I think this is good to go!

@jeromekelleher
Copy link
Member

Great! Merging.

@mergify mergify bot merged commit 7fbb8fa into tskit-dev:master Sep 24, 2020
@petrelharp petrelharp deleted the mutation_model_docs branch December 15, 2021 08:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants