Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding microsatellite mutation models #2013

Closed
andrewkern opened this issue Feb 26, 2022 · 13 comments · Fixed by #2071
Closed

adding microsatellite mutation models #2013

andrewkern opened this issue Feb 26, 2022 · 13 comments · Fixed by #2071

Comments

@andrewkern
Copy link
Member

i'm hoping to add some simple mutation models that would be appropriate for microsatellite data.
i think this will be quite useful to folks who are working on non-model organisms, ecological genetics,
and maybe even forensics?

The simplest, most classical model is the Stepwise Mutation Model of Kimura and Ohta (1978).
I reckon that's a good place to start, although this nice paper by Calabrese and Durrett
lays out a number of other models that would be simple enough to implement.

Here is how I would imagine the implementation to look for the stepwise microsat model.
generally these models have a lo and hi parameter which governs the bounds on the
number of repeat units of the microsatelllite

class SMM_MicrosatMutationModel(MatrixMutationModel):
    """
    Stepwise mutation model (Kimura and Ohta, 1978), 
    for microsatellite repeats.

    This is a :class:`.MatrixMutationModel` with alleles ``[lo, .. , hi]``,
    root distribution is uniform ``(lo, hi)``, and a transition matrix that reflects
    single-step mutation.

    :param int lo: lower bound of the microsatellite repeat length
    :param int hi: upper bound of the microsatellite repeat length
        See :ref:`??` for more details.
    """

    def __init__(self, lo, hi):
        alleles = list(range(lo, hi+1))
        root_distribution = [1/len(alleles)] * len(alleles)
        transition_matrix = np.zeros((len(alleles), len(alleles)))
        for i in range(len(alleles)):
            if i == 0:
                transition_matrix[i, i+1] = 1
            elif i == len(alleles)-1:
                transition_matrix[i, i-1] = 1
            else:
                transition_matrix[i, i-1] = 0.5
                transition_matrix[i, i+1] = 0.5

        super().__init__(alleles, root_distribution, transition_matrix)

what do others think?

@jeromekelleher
Copy link
Member

LGTM. I'd make hi exclusive just to keep consistent with things like range, but otherwise looks very sensible.

So, to be concrete, the plan is to make alleles equal to [str(j) for j in range(lo, hi)] and then if you have allele j, you can only ever mutate to j - 1 or j + 1?

@andrewkern
Copy link
Member Author

So, to be concrete, the plan is to make alleles equal to [str(j) for j in range(lo, hi)] and then if you have allele j, you can only ever mutate to j - 1 or j + 1?

yep that's the SMM model in a nutshell

@petrelharp
Copy link
Contributor

This sounds fine, although we should make sure to tell people not to set lo and hi to the actual min and max observed in their data. Alternatively, we could modify the infinite alleles or SLiM model code (not use the mutation matrix model), so that we can set hi=Inf (or, something large). It would be straightforward.

@jeromekelleher
Copy link
Member

Alternatively, we could modify the infinite alleles or SLiM model code (not use the mutation matrix model), so that we can set hi=Inf (or, something large). It would be straightforward.

That's not quite the same thing though is it? With infinite alleles you can only every go from j to j + 1, whereas the model above you can go back to j - 1 with equal proba.

@petrelharp
Copy link
Contributor

That's not quite the same thing though is it? With infinite alleles you can only every go from j to j + 1, whereas the model above you can go back to j - 1 with equal proba.

It's totally not the same thing - infinite alleles maintains a global counter for the next allele, and this would look at the previous state to choose a new one, but it provides a template. (so, maybe the SLiM mutation model code is better, as it actually modifies the previous allele)

@jeromekelleher
Copy link
Member

Right, gotcha

@andrewkern
Copy link
Member Author

This sounds fine, although we should make sure to tell people not to set lo and hi to the actual min and max observed in their data. Alternatively, we could modify the infinite alleles or SLiM model code (not use the mutation matrix model), so that we can set hi=Inf (or, something large). It would be straightforward.

another issue is the ancestral distribution. I think uniformly distributed is probably wrong?

@jeromekelleher
Copy link
Member

The matrix method @andrewkern outlined above would work, though, right? Probably a good place to start if someone is doing this rather than diving into the C code immediately.

@andrewkern
Copy link
Member Author

Yeah i think it should work just fine.

@jeromekelleher
Copy link
Member

Unless the number of alleles is 10_000 or something it may be good enough.

@petrelharp
Copy link
Contributor

Good enough, but I don't think anyone's doing it right now, and it'd be nice to not clutter up this list of models with out of date ones? Re: the root distribution - it'd be natural to allow a slight left/right bias and use the stationary distribution (I think there's a Slatkin paper on this?).

@andrewkern
Copy link
Member Author

Good enough, but I don't think anyone's doing it right now, and it'd be nice to not clutter up this list of models with out of date ones?

not sure what you mean? SMM is the most popular microsat model in analyses and simulations AFAIK

@petrelharp
Copy link
Contributor

I mean if we implement another slightly different but better SMM later. But I suppose that allowing hi=Inf is backwards-compatible, as is adding left/right jump probs. So, never mind.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants