Issues with `mutate` on same geneology sampled at two time points #38

DrK-Lo · 2019-05-08T09:42:25Z

We are conducting some climate change simulations, and are having some issues with overlaying mutations on the same genealogy sampled at two different timepoints.

We run the simulation for a while in SliM, output the .trees file at the first timepoint (T1), and then read this back into SliM and run the simulation again under the climate change scenario to output the second timepoint (T2).

When we recaptitate both trees, we get the same tree topology, but when we try overlaying neutral mutations to both trees (with the same seed) we get completely different genomes with almost no overlap among the location of variants. (In traditional forward time simulations we see many shared variants in the two timepoints, so this result was not an artifact of the simulation.) For full details including links to the files needed to reproduce the problem see here:

https://github.com/TestTheTests/TTT_Offset_Vulnerability_GF_Sims/blob/master/Notebook/2019_05_03_Mutate_Recap_notes.md

For now, we just want to understand better how "mutate" works and why this problem is happening.

petrelharp · 2019-05-14T20:39:40Z

Sorry, didn't see this earlier. So, mutate() is very straightforward. Since it's an infinite sites model, it just steps through the edges, drawing a Poisson(A) number of mutations for each edge, where A is equal to the mutation rate multiplied by the sequence length of the edge (right - left) multiplied by the time span of the edge (parent.time minus child.time). Since it's infinite-sites, it does this independently for each edge.

If I understand correctly, there should be many edges in common between T1 and T2 (like, a bunch of T1 should be the upper bit of T2), so you'd like to get those mutations being the same. They're not, because you haven't got the same edges in the same order. Is that right?

The way that I would do this is instead of saving out the simulation at T1, I would just Remember all the individuals alive at that time, then keep going to T2 and save then. That way you get a single tree sequence with both time points in it. I'm about to update pyslim (in a few hours) to a version that has a function ts.individuals_at( ), that you can use to get all the individuals alive at a particular time, which should be useful for working with the resulting tree sequence. How's that sound?

DrK-Lo · 2019-05-15T14:38:25Z

Thanks Peter, we didn't check if the orders of the edges were different between the two timepoints, so that was probably the issue.

Your solution sounds like it should work. So we would: (i) output at T2, (ii) recapitate, (iii) mutate, then (iv) use ts.individuals_at(T1) to get the individuals at that timepoint. Then we can output the individual genomes at both timepoints for our analysis.

petrelharp · 2019-05-15T17:16:35Z

Yep! If this sounds good, then will you close this issue? Ping me if you would like further input on making this happen.

DrK-Lo · 2019-05-16T19:54:04Z

One more thing to clarify - if there has been a pruning event between T1 and T2, the individuals we sample at T1 would be the ancestors of T2 (and therefore a biased representation of all individuals that were alive at T1)?

petrelharp · 2019-05-16T22:27:15Z

I'm not sure what you mean by "a pruning event" - do you mean a "simplification" of the tree sequence?

I am suggesting putting something like

1000 {
   sim.treeSeqRememberIndividuals(p1.individuals);
}

in your recipe, which would ensure that all individuals in population p1 who were alive in time step 1000 remain in the tree sequence forever. So, you'd substitude T1 for 1000, and you'd call this on all of your populations alive at the time.

akijarl · 2019-05-29T11:42:30Z

Hi Peter,

I'm attempting to implement what you've described with pyslim v. 0.31 ('pip3 install pyslim --upgrade' gives this version), SLiM v. 3.3, and python3 v. 3.7.3;

After running the SLiM simulation to completion with a minimum SLiM code:

initialize() {
initializeTreeSeq();
initializeMutationRate(0);
initializeMutationType("m1", 0.5, "f", 0.0);
initializeGenomicElementType("g1", m1, 1.0);
initializeGenomicElement(g1, 0, 1e8-1);
initializeRecombinationRate(1e-8);
}
1 {
sim.addSubpop("p1", 500);
}

1000 {
   sim.treeSeqRememberIndividuals(p1.individuals);
}

1200 early() {
sim.treeSeqOutput("./recipe_16.1_T2.trees");
}

to generate the file recipe_16.1_T2.trees, I'm attempting to access the individuals at Generation 1000 with the following:

import pyslim, msprime

ts = pyslim.load("recipe_16.1_T2.trees")

recap = ts.recapitate(recombination_rate = 1e-08, Ne=500, random_seed=1)

mutated = msprime.mutate(recap, rate=1e-7, random_seed=1, keep=True)

mutated.individuals_at(1000)

but the last command brings the error "object has no attribute 'individuals_at'"

Attempting ts.individuals_at(1000) gives the same error.

Is the problem the version of pyslim I'm getting from pip?

petrelharp · 2019-05-30T06:13:27Z

Apologies, that's because the function is called individuals_alive_at (I opted for a better name).

akijarl · 2019-05-30T19:42:18Z

Hi Peter,

That is a better name, and thank you once again very much for all your help, but I'm afraid I'm still having some troubles with this.

Here's what I'm running:

import pyslim, msprime

T2 = pyslim.load("recipe_16.1_T2.trees") #Same file as linked above

recapT2 = T2.recapitate(recombination_rate = 1e-08, Ne=500, random_seed=1)
mutatedT2 = msprime.mutate(recapT2, rate=1e-7, random_seed=1, keep=True)

#mutatedT2.individuals_alive_at(1000) #Doesn't work for recapped/mutated objects because they're TreeSequence objects, not SlimTreeSequence

#ts_new=pyslim.annotate_defaults(mutatedT2, model_type="nonWF",slim_generation=1200) #Gives an Assertion Error: line 594, in _set_nodes_individuals, assert(u >= 0)

T2.individuals_alive_at(1000) #Returns an empty array "array([], dtype=int64)", so I can't subset the original T2 file

Could you possibly show me an example of how this should be going?

petrelharp · 2019-06-03T19:29:58Z

To make a tree sequence back into a pyslim.SlimTreeSequence, you just need to use the pyslim.SlimTreeSequence( ) function:

mutatedT2 = pyslim.SlimTreeSequence(msprime.mutate(recapT2, rate=1e-7, random_seed=1, keep=True))

So that you didn't have to do this, we could define a pyslim.mutate method that is just a very thin wrapper around msprime.mutate( ), but this seems almost as confusing.

After I do that, then mutatedT2.individuals_alive_at(1000) works, although no-one is alive at that time:

>>> set(mutatedT2.individual_times)
{1200.0, 1.0, 201.0}

I don't know why the annotate_defaults gives an error, but that's totally not what you want to do there - the tree sequence already has the annotations, and annotate_defaults adds new ones.

akijarl · 2019-06-28T13:37:01Z

Hi Peter,

Trying to output both time point 1 and time point 2 as VCF files to compare their mutation shifts is still giving difficulties.

I run this code in SLiM 3.3

initialize() {
initializeTreeSeq();
initializeMutationRate(0);
initializeMutationType("m1", 0.5, "f", 0.0);
initializeGenomicElementType("g1", m1, 1.0);
initializeGenomicElement(g1, 0, 1e8-1);
initializeRecombinationRate(1e-8);
}
1 {
sim.addSubpop("p1", 500);
}

1000 {
   sim.treeSeqRememberIndividuals(p1.individuals);
}

1200 early() {
sim.treeSeqOutput("./recipe_16.1_T2.trees");
}

This, as I understand it is supposed to save the individuals at Gen. 1000, as well as the individuals at Gen. 1200 so that we can compare them down the line.

Next I run the python 3 script below:

import pyslim, msprime

T2 = pyslim.load("recipe_16.1_T2.trees") #Same file as linked above

recapT2 = T2.recapitate(recombination_rate = 1e-08, Ne=500, random_seed=1)
mutatedT2 = pyslim.SlimTreeSequence(msprime.mutate(recapT2, rate=1e-7, random_seed=1, keep=True))

mutatedT2.individuals_alive_at(1000) #Returns an empty array, not sure why these are empty
mutatedT2.individuals_alive_at(1200) #Returns an empty array

set(mutatedT2.individual_times)
{0.0, 200.0, 1199.0}

with open("T1.vcf", "w") as vcf_file:
     mutatedT2.individuals_alive_at(1000).write_vcf(vcf_file,2)

Attempting to export the recapped and mutated files as VCFs returns the error: "'numpy.ndarray' object has no attribute 'write_vcf'" and attempting this with an msprime.mutate file obviously get the "'TreeSequence' object has no attribute 'individuals_alive_at'" error.

The "individuals_alive_at" seems to not return any individuals and we're unable to export them anyway to a VCF file. Is the issue in SLiM not saving the individuals or msprime not registering them?

petrelharp · 2019-06-28T15:36:59Z

The "individuals_alive_at" seems to not return any individuals and we're

unable to export them anyway to a VCF file. Is the issue in SLiM not saving the individuals or msprime not registering them? Sigh: this is an annoying gotcha. This is because of the unfortunate difference between time in slim and "time" in msprime, which is really "time ago". The time argument to individuals_alive_at is "time ago", i.e., time since the end of the simulation:

individuals_alive_at(self, time) Returns an array giving the IDs of all individuals that are known to be

alive at the given time ago. This is determined by seeing if their age So, I think that what you want is mutatedT2.individuals_alive_at(200) # at slim time step 1000 = 1200 - 200 but you should probably double-check what times are actually present, by looking at set(T2.individual_times) and perhaps also set(T2.individual_ages), if this is a nonWF model. Hope that clears things up?

akijarl · 2019-07-01T13:38:49Z

Almost there but still circling the drain a bit.

set(mutatedT2.individual_times)

shows that the times present are

{0.0, 200.0, 1199.0}

so there should be individuals to export for a VCF file at time 0 (Gen. 1200), time 200 (Gen. 1000), and time 1199 (sim. starting population).

However individuals_alive_at seems to be returning an empty array no matter the time

>>> T2.individuals_alive_at(200) #Also tried with decimals, e.g 0.0, 200.0, etc.
array([], dtype=int64)

>>>mutatedT2.individuals_alive_at(200)
array([], dtype=int64)

But most interestingly, when exporting T2 as a VCF file

with open("T2.vcf", "w") as vcf_file:
     mutatedT2.write_vcf(vcf_file,2)

it exports a VCF file with information for 1000 individuals, as opposed to the 500 specified in the SLiM simulation. So it seems to be combining the individuals from T1 and T2 in the output file.

>akijarl$ vcftools --vcf T2.vcf 
...
After filtering, kept 1000 out of 1000 Individuals

So at least the individuals do in fact exists in the mutatedT2 file, they're just not separated by time...
Seems like I should be able to subset the file with .individuals_alive_at() but it keeps returning empty arrays. Is there another way to partition them by time?

petrelharp · 2019-07-01T15:47:02Z

Oh, dear, this is actually a bug, many apologies. I sure thought that individuals_alive_at( ) was getting tested with WF models, but I guess not. See #43 for a description of the problem. A simple fix is to just use T2.individual_times instead, for instance:

import numpy as np
# a list of those individuals alive 200 generations ago
np.where(T2.individual_times == 200)[0]

I've slightly lost track of what you want to do with that that list of individuals; recall that this is individuals, not genomes, so to e.g. simplify down to just those you need to first extract their node ids.

petrelharp · 2019-07-01T16:03:08Z

I've fixed that bug, now, so if you reinstall from git you could continue to use individuals_alive_at( ). Again, terribly sorry.

petrelharp · 2019-07-01T16:04:46Z

Also, this issue may have strayed beyond its original purpose. I am going to close it, but feel free to reopen if appropriate, or to open a new one for different issues.

petrelharp mentioned this issue Jul 1, 2019

individuals_alive_at( ) does not work with WF models #43

Closed

petrelharp closed this as completed Jul 1, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues with `mutate` on same geneology sampled at two time points #38

Issues with `mutate` on same geneology sampled at two time points #38

DrK-Lo commented May 8, 2019

petrelharp commented May 14, 2019

DrK-Lo commented May 15, 2019

petrelharp commented May 15, 2019

DrK-Lo commented May 16, 2019

petrelharp commented May 16, 2019

akijarl commented May 29, 2019 •

edited

Loading

petrelharp commented May 30, 2019

akijarl commented May 30, 2019 •

edited

Loading

petrelharp commented Jun 3, 2019

akijarl commented Jun 28, 2019

petrelharp commented Jun 28, 2019 via email

akijarl commented Jul 1, 2019

petrelharp commented Jul 1, 2019

petrelharp commented Jul 1, 2019

petrelharp commented Jul 1, 2019

Issues with mutate on same geneology sampled at two time points #38

Issues with mutate on same geneology sampled at two time points #38

Comments

DrK-Lo commented May 8, 2019

petrelharp commented May 14, 2019

DrK-Lo commented May 15, 2019

petrelharp commented May 15, 2019

DrK-Lo commented May 16, 2019

petrelharp commented May 16, 2019

akijarl commented May 29, 2019 • edited Loading

petrelharp commented May 30, 2019

akijarl commented May 30, 2019 • edited Loading

petrelharp commented Jun 3, 2019

akijarl commented Jun 28, 2019

petrelharp commented Jun 28, 2019 via email

akijarl commented Jul 1, 2019

petrelharp commented Jul 1, 2019

petrelharp commented Jul 1, 2019

petrelharp commented Jul 1, 2019

Issues with `mutate` on same geneology sampled at two time points #38

Issues with `mutate` on same geneology sampled at two time points #38

akijarl commented May 29, 2019 •

edited

Loading

akijarl commented May 30, 2019 •

edited

Loading