-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
removing fixed mutations #260
Comments
I remember going through this, and it turns out to be quite tricky in the general case of lots of mutations at a site to know that it's fixed. I think the semantics that we've ended up with are good, in that we provide the option to remove objects that have no references (sites, populations, etc). Fixed sites are quite a different thing. I think we can do it with a function easily enough though? Should be fast enough in Python, since we can do in in O(#mutations) per site, using the counting logic as we have in the site general stats algorithm. |
Here's a python version, is this what you had in mind?
|
Basically, yeah. But this won't handle funky situations where we have (say) n mutations to 1 over the leaves, will it? But, we should be able to build an allele count table like we do in general_site_stats and reason using that, right? |
True, but I don't think we want to remove those mutations. |
Really? The site is still monomorphic right? What's the difference? |
Well, in that case everyone is IBS but not IBD. They're different cases, even if you can't tell from the sequence. |
I see that, but it's a statement about the sequences we're making, not the ancestry I would have thought. The semantics will get weird and strained if we don't I think ( |
So, I don't mind if there's a few fixed mutations in there, especially if they are due to multiple mutations at the same site. The main reason I wanted this is because after a long SLiM simulatoin, a large portion of the tree sequence file can be due to fixed mutations, even after simplification (they are still there because of the initial generation). So, this is purely pragmatic, about file size. But, I don't want this to discard the still-segregating-but-invisibly multiple IBS mutations, because that is discarding information relevant to polymorphism in the population. I can imagine someone wanting the method that you're talking about, too, but I don't think it's important because a bit of bookkeeping can be used to ignore those. |
For instance, with this SLiM recipe (100 individuals for 100,000 generations):
we see that 97.6% of the mutations are fixed:
... and in Anastasia's simulations, this makes the ps. we should make an "allele frequency" function, which we can do as above. |
Ah, I see. Should we add a |
+1 |
Just |
... Yeah, OK. But this implies that we only remove the mutation, not the site. However, if this is applied before By fixed mutation, we mean there is exactly one root and there is a mutation over it. What if there are multiple mutations over the root? Do remove all of them, or do we leave them be? (Seems best to remove all of them?) |
Hm, well we don't necessarily remove the site, since what if there's a polymorphic mutation at it still...
All of them! |
We'll only remove it if OK, this sounds like a plan then? Who wants to implement it? |
Over in #361 (which I suggest we merge into this one), @hyanwong suggests that we make it possible to filter not only fixed mutations (ie ones that all samples have) but more generally, mutations above nodes with no parents. We don't want to do only this, since in some applications we definately want to keep these (for instance, in slim pre-recapitation), but this could be another option. I suspect we could implement this at the same time, but haven't thought through the details. |
Yes, this should be doable I think, and I agree it's a good option to have. |
Is this still something we want @petrelharp? |
Yes! Sorry I haven't got to it. |
Just to return to this as @jeromekelleher noticed a load of fixed mutations in our workbooks after simplification, and thought it was odd. So that's another vote to implement this functionality. |
Huh, I thought we had an option for this in simplify. It's probably quite tricky to get right in the general case, I'm certain I thought about this at the time. |
You did, and it was tricky, but it's probably a good bit easier now - we've got more tools. |
The |
Just returning to this, there is some thinking to do for the case of multiple roots, right? What if there are two roots, and identical mutations over each? I suspect in this case we don't remove the mutation. So this function will only work when there is a single root in the tree? |
That sounds good to me - the mutations aren't fixed, from the point of view of the trees; if someone wants to filter on allele frequency they can do that another way. |
We can at times end up with a large number of fixed mutations. It would be nice to have a way to remove these. I think we discussed having
simplify()
do this, but it does not:This would require a bit of bookkeeping (e.g. updating the ancestral state at sites that still have a segregating mutation). Not sure whether to propose this as an argument to simplify (filter_fixed_mutations) or as its own function. This is not urgent, as far as I know, but if we were to change the default behavior of simplify, it would be nice to do it soon.
The text was updated successfully, but these errors were encountered: