removing fixed mutations #260

petrelharp · 2019-07-10T23:42:25Z

We can at times end up with a large number of fixed mutations. It would be nice to have a way to remove these. I think we discussed having simplify() do this, but it does not:

ts = msprime.simulate(10, recombination_rate=2, mutation_rate=2)
sts = ts.simplify([0,1])
sts.genotype_matrix()
### 
array([[1, 1],
       [1, 1],
       [1, 1],
       [1, 0],
       [1, 1],
       [1, 1],
       [1, 1],
       [1, 1]], dtype=uint8)

This would require a bit of bookkeeping (e.g. updating the ancestral state at sites that still have a segregating mutation). Not sure whether to propose this as an argument to simplify (filter_fixed_mutations) or as its own function. This is not urgent, as far as I know, but if we were to change the default behavior of simplify, it would be nice to do it soon.

The text was updated successfully, but these errors were encountered:

jeromekelleher · 2019-07-11T07:36:39Z

I remember going through this, and it turns out to be quite tricky in the general case of lots of mutations at a site to know that it's fixed. I think the semantics that we've ended up with are good, in that we provide the option to remove objects that have no references (sites, populations, etc). Fixed sites are quite a different thing.

I think we can do it with a function easily enough though? Should be fast enough in Python, since we can do in in O(#mutations) per site, using the counting logic as we have in the site general stats algorithm.

petrelharp · 2019-07-11T16:54:27Z

Here's a python version, is this what you had in mind?

def remove_monomorphic(sts)
  tables = sts.tables
  tables.sites.clear()
  tables.mutations.clear()
  n = sts.num_samples
  for t in sts.trees(sample_counts=True):
     for s in t.sites():
        visited = False
        for m in s.mutations:
           k = t.num_samples(m.node)
           if k > 0 and k < n:
               if not visited:
                   visited = True
                   sid = tables.sites.add_row(s.position, s.ancestral_state, metadata=s.metadata)
               tables.mutations.add_row(sid, m.node, m.derived_state, parent=-1, metadata=None)
  tables.compute_mutation_parents()
  return tables.tree_sequence()

nts = remove_monomorphic(sts)
nts.genotype_matrix()

###
array([[1, 0],
       [0, 1],
       [1, 0],
       [0, 1]], dtype=uint8)

jeromekelleher · 2019-07-12T06:38:12Z

Basically, yeah. But this won't handle funky situations where we have (say) n mutations to 1 over the leaves, will it? But, we should be able to build an allele count table like we do in general_site_stats and reason using that, right?

petrelharp · 2019-07-12T09:04:35Z

True, but I don't think we want to remove those mutations.

jeromekelleher · 2019-07-12T09:14:52Z

Really? The site is still monomorphic right? What's the difference?

petrelharp · 2019-07-12T09:54:53Z

Well, in that case everyone is IBS but not IBD. They're different cases, even if you can't tell from the sequence.

jeromekelleher · 2019-07-12T09:58:00Z

I see that, but it's a statement about the sequences we're making, not the ancestry I would have thought. The semantics will get weird and strained if we don't I think (remove_monomorphic_if_single_mutation).

petrelharp · 2019-07-15T15:36:38Z

So, I don't mind if there's a few fixed mutations in there, especially if they are due to multiple mutations at the same site. The main reason I wanted this is because after a long SLiM simulatoin, a large portion of the tree sequence file can be due to fixed mutations, even after simplification (they are still there because of the initial generation). So, this is purely pragmatic, about file size. But, I don't want this to discard the still-segregating-but-invisibly multiple IBS mutations, because that is discarding information relevant to polymorphism in the population. I can imagine someone wanting the method that you're talking about, too, but I don't think it's important because a bit of bookkeeping can be used to ignore those.

petrelharp · 2019-07-15T15:43:42Z

For instance, with this SLiM recipe (100 individuals for 100,000 generations):

initialize()
{
    setSeed(23);
    initializeTreeSeq();
    initializeMutationRate(1e-8);
    initializeMutationType("m1", 0.5, "f", 0.0);
    initializeGenomicElementType("g1", m1, 1.0);
    initializeGenomicElement(g1, 0, 1e8-1);
    initializeRecombinationRate(1e-8);
}

1 early() {
    sim.addSubpop("p1", 100);
}

100000 {
    sim.treeSeqOutput("fixed.trees");
    catn("Done.");
    sim.simulationFinished();
}

we see that 97.6% of the mutations are fixed:

import pyslim

ts = pyslim.load("fixed.trees")
sts = ts.simplify()

freqs = ts.general_stat(np.ones(ts.num_samples).reshape((ts.num_samples, 1)), lambda x: x, windows="sites", polarised=True, span_normalise=False).reshape((ts.num_sites,))

np.mean(freqs == 200)

... and in Anastasia's simulations, this makes the metadata column longer than is allowed.

ps. we should make an "allele frequency" function, which we can do as above.

jeromekelleher · 2019-07-15T16:35:05Z

Ah, I see. Should we add a filter_fixed_simple_sites (or something) option to simplify then? I think catching the single mutation fixed sites is pretty easy? I don't want to call it monomorphic or whatever as I think this is misleading in the multiple mutation case.

jeromekelleher · 2019-07-15T16:35:27Z

ps. we should make an "allele frequency" function, which we can do as above.

+1

petrelharp · 2019-07-15T17:00:13Z

Just filter_fixed_mutations should do it? (i.e., fixed mutations, not fixed alleles...)

jeromekelleher · 2019-07-15T18:31:36Z

Just filter_fixed_mutations should do it? (i.e., fixed mutations, not fixed alleles...)

... Yeah, OK. But this implies that we only remove the mutation, not the site. However, if this is applied before filter_sites, we'd remove the site because any fixed mutations would have already been filtered out.

By fixed mutation, we mean there is exactly one root and there is a mutation over it. What if there are multiple mutations over the root? Do remove all of them, or do we leave them be? (Seems best to remove all of them?)

petrelharp · 2019-07-15T19:26:34Z

Hm, well we don't necessarily remove the site, since what if there's a polymorphic mutation at it still...

Do we remove all of them, or do we leave them be? (Seems best to remove all of them?)

All of them!

jeromekelleher · 2019-07-15T20:00:09Z

Hm, well we don't necessarily remove the site, since what if there's a polymorphic mutation at it still...

We'll only remove it if filter_sites is True and the number of mutations after filter_fixed_mutations has been applied is zero.

OK, this sounds like a plan then? Who wants to implement it?

petrelharp · 2019-09-09T17:02:03Z

Over in #361 (which I suggest we merge into this one), @hyanwong suggests that we make it possible to filter not only fixed mutations (ie ones that all samples have) but more generally, mutations above nodes with no parents. We don't want to do only this, since in some applications we definately want to keep these (for instance, in slim pre-recapitation), but this could be another option. I suspect we could implement this at the same time, but haven't thought through the details.

jeromekelleher · 2019-09-11T11:19:40Z

Yes, this should be doable I think, and I agree it's a good option to have.

jeromekelleher · 2020-09-29T15:03:45Z

Is this still something we want @petrelharp?

petrelharp · 2020-09-29T15:12:42Z

Yes! Sorry I haven't got to it.

hyanwong · 2022-06-16T10:15:49Z

Just to return to this as @jeromekelleher noticed a load of fixed mutations in our workbooks after simplification, and thought it was odd. So that's another vote to implement this functionality.

See tskit-dev/tskit#260

jeromekelleher · 2022-06-16T10:38:05Z

Huh, I thought we had an option for this in simplify. It's probably quite tricky to get right in the general case, I'm certain I thought about this at the time.

petrelharp · 2022-06-25T04:21:06Z

You did, and it was tricky, but it's probably a good bit easier now - we've got more tools.

hyanwong · 2023-01-12T11:11:27Z

The filter_all idea in #2681 suggests that all the filter_XXX args are tweaked at the same time. Currently filter_X only exists where X is a table. So I wonder if filter_fixed_mutations is the right name?

hyanwong · 2023-03-30T10:09:03Z

Just returning to this, there is some thinking to do for the case of multiple roots, right? What if there are two roots, and identical mutations over each? I suspect in this case we don't remove the mutation. So this function will only work when there is a single root in the tree?

petrelharp · 2023-03-31T14:04:51Z

That sounds good to me - the mutations aren't fixed, from the point of view of the trees; if someone wants to filter on allele frequency they can do that another way.

petrelharp changed the title ~~removing fixed sites~~ removing fixed mutations Jul 10, 2019

hyanwong mentioned this issue Sep 6, 2019

Filter mutations not on edges during simplify #361

Closed

jeromekelleher added the enhancement New feature or request label Sep 29, 2020

hyanwong added a commit to hyanwong/genealogy_workshop that referenced this issue Jun 16, 2022

Can't currently remove fixed mutations

d68dac3

See tskit-dev/tskit#260

hyanwong mentioned this issue Mar 29, 2023

Add fwd_bck_parents_max_mut_dist to breakpoint table jeromekelleher/sc2ts#141

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

removing fixed mutations #260

removing fixed mutations #260

petrelharp commented Jul 10, 2019

jeromekelleher commented Jul 11, 2019

petrelharp commented Jul 11, 2019 •

edited

Loading

jeromekelleher commented Jul 12, 2019

petrelharp commented Jul 12, 2019

jeromekelleher commented Jul 12, 2019

petrelharp commented Jul 12, 2019

jeromekelleher commented Jul 12, 2019

petrelharp commented Jul 15, 2019

petrelharp commented Jul 15, 2019

jeromekelleher commented Jul 15, 2019

jeromekelleher commented Jul 15, 2019

petrelharp commented Jul 15, 2019

jeromekelleher commented Jul 15, 2019

petrelharp commented Jul 15, 2019

jeromekelleher commented Jul 15, 2019

petrelharp commented Sep 9, 2019

jeromekelleher commented Sep 11, 2019

jeromekelleher commented Sep 29, 2020

petrelharp commented Sep 29, 2020

hyanwong commented Jun 16, 2022

jeromekelleher commented Jun 16, 2022

petrelharp commented Jun 25, 2022

hyanwong commented Jan 12, 2023

hyanwong commented Mar 30, 2023

petrelharp commented Mar 31, 2023

removing fixed mutations #260

removing fixed mutations #260

Comments

petrelharp commented Jul 10, 2019

jeromekelleher commented Jul 11, 2019

petrelharp commented Jul 11, 2019 • edited Loading

jeromekelleher commented Jul 12, 2019

petrelharp commented Jul 12, 2019

jeromekelleher commented Jul 12, 2019

petrelharp commented Jul 12, 2019

jeromekelleher commented Jul 12, 2019

petrelharp commented Jul 15, 2019

petrelharp commented Jul 15, 2019

jeromekelleher commented Jul 15, 2019

jeromekelleher commented Jul 15, 2019

petrelharp commented Jul 15, 2019

jeromekelleher commented Jul 15, 2019

petrelharp commented Jul 15, 2019

jeromekelleher commented Jul 15, 2019

petrelharp commented Sep 9, 2019

jeromekelleher commented Sep 11, 2019

jeromekelleher commented Sep 29, 2020

petrelharp commented Sep 29, 2020

hyanwong commented Jun 16, 2022

jeromekelleher commented Jun 16, 2022

petrelharp commented Jun 25, 2022

hyanwong commented Jan 12, 2023

hyanwong commented Mar 30, 2023

petrelharp commented Mar 31, 2023

petrelharp commented Jul 11, 2019 •

edited

Loading