Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

removing fixed mutations #260

Open
petrelharp opened this issue Jul 10, 2019 · 25 comments
Open

removing fixed mutations #260

petrelharp opened this issue Jul 10, 2019 · 25 comments
Labels
enhancement New feature or request

Comments

@petrelharp
Copy link
Contributor

We can at times end up with a large number of fixed mutations. It would be nice to have a way to remove these. I think we discussed having simplify() do this, but it does not:

ts = msprime.simulate(10, recombination_rate=2, mutation_rate=2)
sts = ts.simplify([0,1])
sts.genotype_matrix()
### 
array([[1, 1],
       [1, 1],
       [1, 1],
       [1, 0],
       [1, 1],
       [1, 1],
       [1, 1],
       [1, 1]], dtype=uint8)

This would require a bit of bookkeeping (e.g. updating the ancestral state at sites that still have a segregating mutation). Not sure whether to propose this as an argument to simplify (filter_fixed_mutations) or as its own function. This is not urgent, as far as I know, but if we were to change the default behavior of simplify, it would be nice to do it soon.

@petrelharp petrelharp changed the title removing fixed sites removing fixed mutations Jul 10, 2019
@jeromekelleher
Copy link
Member

I remember going through this, and it turns out to be quite tricky in the general case of lots of mutations at a site to know that it's fixed. I think the semantics that we've ended up with are good, in that we provide the option to remove objects that have no references (sites, populations, etc). Fixed sites are quite a different thing.

I think we can do it with a function easily enough though? Should be fast enough in Python, since we can do in in O(#mutations) per site, using the counting logic as we have in the site general stats algorithm.

@petrelharp
Copy link
Contributor Author

petrelharp commented Jul 11, 2019

Here's a python version, is this what you had in mind?

def remove_monomorphic(sts)
  tables = sts.tables
  tables.sites.clear()
  tables.mutations.clear()
  n = sts.num_samples
  for t in sts.trees(sample_counts=True):
     for s in t.sites():
        visited = False
        for m in s.mutations:
           k = t.num_samples(m.node)
           if k > 0 and k < n:
               if not visited:
                   visited = True
                   sid = tables.sites.add_row(s.position, s.ancestral_state, metadata=s.metadata)
               tables.mutations.add_row(sid, m.node, m.derived_state, parent=-1, metadata=None)
  tables.compute_mutation_parents()
  return tables.tree_sequence()

nts = remove_monomorphic(sts)
nts.genotype_matrix()

###
array([[1, 0],
       [0, 1],
       [1, 0],
       [0, 1]], dtype=uint8)

@jeromekelleher
Copy link
Member

Basically, yeah. But this won't handle funky situations where we have (say) n mutations to 1 over the leaves, will it? But, we should be able to build an allele count table like we do in general_site_stats and reason using that, right?

@petrelharp
Copy link
Contributor Author

True, but I don't think we want to remove those mutations.

@jeromekelleher
Copy link
Member

Really? The site is still monomorphic right? What's the difference?

@petrelharp
Copy link
Contributor Author

Well, in that case everyone is IBS but not IBD. They're different cases, even if you can't tell from the sequence.

@jeromekelleher
Copy link
Member

I see that, but it's a statement about the sequences we're making, not the ancestry I would have thought. The semantics will get weird and strained if we don't I think (remove_monomorphic_if_single_mutation).

@petrelharp
Copy link
Contributor Author

So, I don't mind if there's a few fixed mutations in there, especially if they are due to multiple mutations at the same site. The main reason I wanted this is because after a long SLiM simulatoin, a large portion of the tree sequence file can be due to fixed mutations, even after simplification (they are still there because of the initial generation). So, this is purely pragmatic, about file size. But, I don't want this to discard the still-segregating-but-invisibly multiple IBS mutations, because that is discarding information relevant to polymorphism in the population. I can imagine someone wanting the method that you're talking about, too, but I don't think it's important because a bit of bookkeeping can be used to ignore those.

@petrelharp
Copy link
Contributor Author

For instance, with this SLiM recipe (100 individuals for 100,000 generations):

initialize()
{
    setSeed(23);
    initializeTreeSeq();
    initializeMutationRate(1e-8);
    initializeMutationType("m1", 0.5, "f", 0.0);
    initializeGenomicElementType("g1", m1, 1.0);
    initializeGenomicElement(g1, 0, 1e8-1);
    initializeRecombinationRate(1e-8);
}

1 early() {
    sim.addSubpop("p1", 100);
}

100000 {
    sim.treeSeqOutput("fixed.trees");
    catn("Done.");
    sim.simulationFinished();
}

we see that 97.6% of the mutations are fixed:

import pyslim

ts = pyslim.load("fixed.trees")
sts = ts.simplify()

freqs = ts.general_stat(np.ones(ts.num_samples).reshape((ts.num_samples, 1)), lambda x: x, windows="sites", polarised=True, span_normalise=False).reshape((ts.num_sites,))

np.mean(freqs == 200)

... and in Anastasia's simulations, this makes the metadata column longer than is allowed.

ps. we should make an "allele frequency" function, which we can do as above.

@jeromekelleher
Copy link
Member

Ah, I see. Should we add a filter_fixed_simple_sites (or something) option to simplify then? I think catching the single mutation fixed sites is pretty easy? I don't want to call it monomorphic or whatever as I think this is misleading in the multiple mutation case.

@jeromekelleher
Copy link
Member

ps. we should make an "allele frequency" function, which we can do as above.

+1

@petrelharp
Copy link
Contributor Author

Just filter_fixed_mutations should do it? (i.e., fixed mutations, not fixed alleles...)

@jeromekelleher
Copy link
Member

Just filter_fixed_mutations should do it? (i.e., fixed mutations, not fixed alleles...)

... Yeah, OK. But this implies that we only remove the mutation, not the site. However, if this is applied before filter_sites, we'd remove the site because any fixed mutations would have already been filtered out.

By fixed mutation, we mean there is exactly one root and there is a mutation over it. What if there are multiple mutations over the root? Do remove all of them, or do we leave them be? (Seems best to remove all of them?)

@petrelharp
Copy link
Contributor Author

Hm, well we don't necessarily remove the site, since what if there's a polymorphic mutation at it still...

Do we remove all of them, or do we leave them be? (Seems best to remove all of them?)

All of them!

@jeromekelleher
Copy link
Member

Hm, well we don't necessarily remove the site, since what if there's a polymorphic mutation at it still...

We'll only remove it if filter_sites is True and the number of mutations after filter_fixed_mutations has been applied is zero.

OK, this sounds like a plan then? Who wants to implement it?

@petrelharp
Copy link
Contributor Author

Over in #361 (which I suggest we merge into this one), @hyanwong suggests that we make it possible to filter not only fixed mutations (ie ones that all samples have) but more generally, mutations above nodes with no parents. We don't want to do only this, since in some applications we definately want to keep these (for instance, in slim pre-recapitation), but this could be another option. I suspect we could implement this at the same time, but haven't thought through the details.

@jeromekelleher
Copy link
Member

Yes, this should be doable I think, and I agree it's a good option to have.

@jeromekelleher
Copy link
Member

Is this still something we want @petrelharp?

@jeromekelleher jeromekelleher added the enhancement New feature or request label Sep 29, 2020
@petrelharp
Copy link
Contributor Author

Yes! Sorry I haven't got to it.

@hyanwong
Copy link
Member

Just to return to this as @jeromekelleher noticed a load of fixed mutations in our workbooks after simplification, and thought it was odd. So that's another vote to implement this functionality.

hyanwong added a commit to hyanwong/genealogy_workshop that referenced this issue Jun 16, 2022
@jeromekelleher
Copy link
Member

Huh, I thought we had an option for this in simplify. It's probably quite tricky to get right in the general case, I'm certain I thought about this at the time.

@petrelharp
Copy link
Contributor Author

You did, and it was tricky, but it's probably a good bit easier now - we've got more tools.

@hyanwong
Copy link
Member

The filter_all idea in #2681 suggests that all the filter_XXX args are tweaked at the same time. Currently filter_X only exists where X is a table. So I wonder if filter_fixed_mutations is the right name?

@hyanwong
Copy link
Member

Just returning to this, there is some thinking to do for the case of multiple roots, right? What if there are two roots, and identical mutations over each? I suspect in this case we don't remove the mutation. So this function will only work when there is a single root in the tree?

@petrelharp
Copy link
Contributor Author

That sounds good to me - the mutations aren't fixed, from the point of view of the trees; if someone wants to filter on allele frequency they can do that another way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants