Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with mutate on same geneology sampled at two time points #38

Closed
DrK-Lo opened this issue May 8, 2019 · 15 comments
Closed

Issues with mutate on same geneology sampled at two time points #38

DrK-Lo opened this issue May 8, 2019 · 15 comments

Comments

@DrK-Lo
Copy link

DrK-Lo commented May 8, 2019

We are conducting some climate change simulations, and are having some issues with overlaying mutations on the same genealogy sampled at two different timepoints.

We run the simulation for a while in SliM, output the .trees file at the first timepoint (T1), and then read this back into SliM and run the simulation again under the climate change scenario to output the second timepoint (T2).

When we recaptitate both trees, we get the same tree topology, but when we try overlaying neutral mutations to both trees (with the same seed) we get completely different genomes with almost no overlap among the location of variants. (In traditional forward time simulations we see many shared variants in the two timepoints, so this result was not an artifact of the simulation.) For full details including links to the files needed to reproduce the problem see here:

https://github.com/TestTheTests/TTT_Offset_Vulnerability_GF_Sims/blob/master/Notebook/2019_05_03_Mutate_Recap_notes.md

For now, we just want to understand better how "mutate" works and why this problem is happening.

@petrelharp
Copy link
Contributor

Sorry, didn't see this earlier. So, mutate() is very straightforward. Since it's an infinite sites model, it just steps through the edges, drawing a Poisson(A) number of mutations for each edge, where A is equal to the mutation rate multiplied by the sequence length of the edge (right - left) multiplied by the time span of the edge (parent.time minus child.time). Since it's infinite-sites, it does this independently for each edge.

If I understand correctly, there should be many edges in common between T1 and T2 (like, a bunch of T1 should be the upper bit of T2), so you'd like to get those mutations being the same. They're not, because you haven't got the same edges in the same order. Is that right?

The way that I would do this is instead of saving out the simulation at T1, I would just Remember all the individuals alive at that time, then keep going to T2 and save then. That way you get a single tree sequence with both time points in it. I'm about to update pyslim (in a few hours) to a version that has a function ts.individuals_at( ), that you can use to get all the individuals alive at a particular time, which should be useful for working with the resulting tree sequence. How's that sound?

@DrK-Lo
Copy link
Author

DrK-Lo commented May 15, 2019

Thanks Peter, we didn't check if the orders of the edges were different between the two timepoints, so that was probably the issue.

Your solution sounds like it should work. So we would: (i) output at T2, (ii) recapitate, (iii) mutate, then (iv) use ts.individuals_at(T1) to get the individuals at that timepoint. Then we can output the individual genomes at both timepoints for our analysis.

@petrelharp
Copy link
Contributor

Yep! If this sounds good, then will you close this issue? Ping me if you would like further input on making this happen.

@DrK-Lo
Copy link
Author

DrK-Lo commented May 16, 2019

One more thing to clarify - if there has been a pruning event between T1 and T2, the individuals we sample at T1 would be the ancestors of T2 (and therefore a biased representation of all individuals that were alive at T1)?

@petrelharp
Copy link
Contributor

I'm not sure what you mean by "a pruning event" - do you mean a "simplification" of the tree sequence?

I am suggesting putting something like

1000 {
   sim.treeSeqRememberIndividuals(p1.individuals);
}

in your recipe, which would ensure that all individuals in population p1 who were alive in time step 1000 remain in the tree sequence forever. So, you'd substitude T1 for 1000, and you'd call this on all of your populations alive at the time.

@akijarl
Copy link

akijarl commented May 29, 2019

Hi Peter,

I'm attempting to implement what you've described with pyslim v. 0.31 ('pip3 install pyslim --upgrade' gives this version), SLiM v. 3.3, and python3 v. 3.7.3;

After running the SLiM simulation to completion with a minimum SLiM code:

initialize() {
initializeTreeSeq();
initializeMutationRate(0);
initializeMutationType("m1", 0.5, "f", 0.0);
initializeGenomicElementType("g1", m1, 1.0);
initializeGenomicElement(g1, 0, 1e8-1);
initializeRecombinationRate(1e-8);
}
1 {
sim.addSubpop("p1", 500);
}

1000 {
   sim.treeSeqRememberIndividuals(p1.individuals);
}

1200 early() {
sim.treeSeqOutput("./recipe_16.1_T2.trees");
}

to generate the file recipe_16.1_T2.trees, I'm attempting to access the individuals at Generation 1000 with the following:

import pyslim, msprime

ts = pyslim.load("recipe_16.1_T2.trees")

recap = ts.recapitate(recombination_rate = 1e-08, Ne=500, random_seed=1)

mutated = msprime.mutate(recap, rate=1e-7, random_seed=1, keep=True)

mutated.individuals_at(1000)

but the last command brings the error "object has no attribute 'individuals_at'"

Attempting ts.individuals_at(1000) gives the same error.

Is the problem the version of pyslim I'm getting from pip?

@petrelharp
Copy link
Contributor

Apologies, that's because the function is called individuals_alive_at (I opted for a better name).

@akijarl
Copy link

akijarl commented May 30, 2019

Hi Peter,

That is a better name, and thank you once again very much for all your help, but I'm afraid I'm still having some troubles with this.

Here's what I'm running:

import pyslim, msprime

T2 = pyslim.load("recipe_16.1_T2.trees") #Same file as linked above

recapT2 = T2.recapitate(recombination_rate = 1e-08, Ne=500, random_seed=1)
mutatedT2 = msprime.mutate(recapT2, rate=1e-7, random_seed=1, keep=True)

#mutatedT2.individuals_alive_at(1000) #Doesn't work for recapped/mutated objects because they're TreeSequence objects, not SlimTreeSequence

#ts_new=pyslim.annotate_defaults(mutatedT2, model_type="nonWF",slim_generation=1200) #Gives an Assertion Error: line 594, in _set_nodes_individuals, assert(u >= 0)

T2.individuals_alive_at(1000) #Returns an empty array "array([], dtype=int64)", so I can't subset the original T2 file

Could you possibly show me an example of how this should be going?

@petrelharp
Copy link
Contributor

To make a tree sequence back into a pyslim.SlimTreeSequence, you just need to use the pyslim.SlimTreeSequence( ) function:

mutatedT2 = pyslim.SlimTreeSequence(msprime.mutate(recapT2, rate=1e-7, random_seed=1, keep=True))

So that you didn't have to do this, we could define a pyslim.mutate method that is just a very thin wrapper around msprime.mutate( ), but this seems almost as confusing.

After I do that, then mutatedT2.individuals_alive_at(1000) works, although no-one is alive at that time:

>>> set(mutatedT2.individual_times)
{1200.0, 1.0, 201.0}

I don't know why the annotate_defaults gives an error, but that's totally not what you want to do there - the tree sequence already has the annotations, and annotate_defaults adds new ones.

@akijarl
Copy link

akijarl commented Jun 28, 2019

Hi Peter,

Trying to output both time point 1 and time point 2 as VCF files to compare their mutation shifts is still giving difficulties.

I run this code in SLiM 3.3

initialize() {
initializeTreeSeq();
initializeMutationRate(0);
initializeMutationType("m1", 0.5, "f", 0.0);
initializeGenomicElementType("g1", m1, 1.0);
initializeGenomicElement(g1, 0, 1e8-1);
initializeRecombinationRate(1e-8);
}
1 {
sim.addSubpop("p1", 500);
}

1000 {
   sim.treeSeqRememberIndividuals(p1.individuals);
}

1200 early() {
sim.treeSeqOutput("./recipe_16.1_T2.trees");
}

This, as I understand it is supposed to save the individuals at Gen. 1000, as well as the individuals at Gen. 1200 so that we can compare them down the line.

Next I run the python 3 script below:

import pyslim, msprime

T2 = pyslim.load("recipe_16.1_T2.trees") #Same file as linked above

recapT2 = T2.recapitate(recombination_rate = 1e-08, Ne=500, random_seed=1)
mutatedT2 = pyslim.SlimTreeSequence(msprime.mutate(recapT2, rate=1e-7, random_seed=1, keep=True))

mutatedT2.individuals_alive_at(1000) #Returns an empty array, not sure why these are empty
mutatedT2.individuals_alive_at(1200) #Returns an empty array

set(mutatedT2.individual_times)
{0.0, 200.0, 1199.0}

with open("T1.vcf", "w") as vcf_file:
     mutatedT2.individuals_alive_at(1000).write_vcf(vcf_file,2)

Attempting to export the recapped and mutated files as VCFs returns the error: "'numpy.ndarray' object has no attribute 'write_vcf'" and attempting this with an msprime.mutate file obviously get the "'TreeSequence' object has no attribute 'individuals_alive_at'" error.

The "individuals_alive_at" seems to not return any individuals and we're unable to export them anyway to a VCF file. Is the issue in SLiM not saving the individuals or msprime not registering them?

@petrelharp
Copy link
Contributor

petrelharp commented Jun 28, 2019 via email

@akijarl
Copy link

akijarl commented Jul 1, 2019

Almost there but still circling the drain a bit.

set(mutatedT2.individual_times)

shows that the times present are

{0.0, 200.0, 1199.0}

so there should be individuals to export for a VCF file at time 0 (Gen. 1200), time 200 (Gen. 1000), and time 1199 (sim. starting population).

However individuals_alive_at seems to be returning an empty array no matter the time

>>> T2.individuals_alive_at(200) #Also tried with decimals, e.g 0.0, 200.0, etc.
array([], dtype=int64)

>>>mutatedT2.individuals_alive_at(200)
array([], dtype=int64)

But most interestingly, when exporting T2 as a VCF file

with open("T2.vcf", "w") as vcf_file:
     mutatedT2.write_vcf(vcf_file,2)

it exports a VCF file with information for 1000 individuals, as opposed to the 500 specified in the SLiM simulation. So it seems to be combining the individuals from T1 and T2 in the output file.

>akijarl$ vcftools --vcf T2.vcf 
...
After filtering, kept 1000 out of 1000 Individuals

So at least the individuals do in fact exists in the mutatedT2 file, they're just not separated by time...
Seems like I should be able to subset the file with .individuals_alive_at() but it keeps returning empty arrays. Is there another way to partition them by time?

@petrelharp
Copy link
Contributor

Oh, dear, this is actually a bug, many apologies. I sure thought that individuals_alive_at( ) was getting tested with WF models, but I guess not. See #43 for a description of the problem. A simple fix is to just use T2.individual_times instead, for instance:

import numpy as np
# a list of those individuals alive 200 generations ago
np.where(T2.individual_times == 200)[0]

I've slightly lost track of what you want to do with that that list of individuals; recall that this is individuals, not genomes, so to e.g. simplify down to just those you need to first extract their node ids.

@petrelharp
Copy link
Contributor

I've fixed that bug, now, so if you reinstall from git you could continue to use individuals_alive_at( ). Again, terribly sorry.

@petrelharp
Copy link
Contributor

Also, this issue may have strayed beyond its original purpose. I am going to close it, but feel free to reopen if appropriate, or to open a new one for different issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants