Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

recapitate has surprisingly small diversity if WF simulation is started and saved in late #307

Closed
petrelharp opened this issue Jun 14, 2023 · 7 comments

Comments

@petrelharp
Copy link
Contributor

petrelharp commented Jun 14, 2023

As pointed out by @bodkan in bodkan/slendr#141 -
with this as late.slim:

initialize() {
  initializeTreeSeq();
  initializeMutationRate(0);
  initializeMutationType("m0", 0.5, "f", 0.0);
  initializeGenomicElementType("g1", m0, 1.0);
  initializeGenomicElement(g1, 0, 1e+07 - 1);
  initializeRecombinationRate(1e-08);
}
1 late() { // <<<------ NOTE THIS LINE --------
  sim.addSubpop("p0", 1000);
}
1000 late() {
  sim.treeSeqOutput("late.trees");
}

and this as early.slim:

initialize() {
  initializeTreeSeq();
  initializeMutationRate(0);
  initializeMutationType("m0", 0.5, "f", 0.0);
  initializeGenomicElementType("g1", m0, 1.0);
  initializeGenomicElement(g1, 0, 1e+07 - 1);
  initializeRecombinationRate(1e-08);
}
1 early() { // <<<------ NOTE THIS LINE --------
  sim.addSubpop("p0", 1000);
}
1000 late() {
  sim.treeSeqOutput("early.trees");
}

after running

for x in early late; do slim -s 23 ${x}.slim; done

then the following script

import msprime, tskit, pyslim

Ne = 1000    # constant Ne of the population
rho  = 1e-8  # recombination rate

ts_early = tskit.load("early.trees")
ts_late = tskit.load("late.trees")

def recap(ts):
    return pyslim.recapitate(ts, ancestral_Ne=Ne, recombination_rate=rho)

div_early = recap(ts_early).diversity(mode='branch')
div_late = recap(ts_late).diversity(mode='branch')

print(f"early: {div_early}, late: {div_late}")

gets something like

early: 4308.810302, late: 3347.565959

And, this is because the two tree sequences have the same tick value in metadata, but roots at different times ago:

>>> ts_early.metadata['SLiM']['tick'], ts_early.node(ts_early.first().roots[0]).time
(1000, 1000.0)
>>> ts_late.metadata['SLiM']['tick'], ts_late.node(ts_late.first().roots[0]).time
(1000, 999.0)

while pyslim.recapitate contains these lines:

        demography = msprime.Demography.from_tree_sequence(ts)
        # must set pop sizes to >0 even though we merge immediately
        for pop in demography.populations:
            pop.initial_size=1.0
# ...
        # the split has to come slightly longer ago than slim's tick
        # since that's when all the linages are at, and otherwise the event
        # won't apply to them
        demography.add_population_split(
                np.nextafter(
                    ts.metadata['SLiM']['tick'],
                    2 * ts.metadata['SLiM']['tick'],
                ),
                derived=derived_names,
                ancestral=ancestral_name,
        )

So, what's happening is that I'm creating a bottleneck with Ne=1 for 1 generation in the case that we started and saved in late(). For why this happens, see the docs.

@bhaller
Copy link
Collaborator

bhaller commented Jun 14, 2023

Just putting a note here to reflect the chat Peter and I just had. If checking all the trees doesn't take an unreasonable amount of time, then it would be preferable because it would potentially catch some user errors that are not caught just be checking the first tree. @petrelharp will decide which way to go on that, perhaps after asking the tskit folks what they think about the performance of the scan through all the trees. Apart from that, I understand the issue and agree with the fix, and a new pyslim will be rolled with it when possible, and then (once people have a new pyslim version to switch to) we will announce the bug publicly. :-<

@bodkan
Copy link

bodkan commented Jun 21, 2023

Hey @petrelharp,

I saw your note about the release of pyslim v1.0.2 and started working on adding it as a pinned dependency into slendr.

I'm having some trouble with this so I wanted to double check that I'm doing things right.

  1. I created this Python virtual environment:
python -m venv tsreprex
source tsreprex/bin/activate
pip install msprime tskit pyslim==1.0.2 # I *think* this is the correct way to do this?
  1. I then run early.slim and late.slim like this:
for x in early late; do slim -s 23 ${x}.slim; done
  1. But then, when I run the testing Python script from your first message in this issue, I'm still getting something this (difference up to a random seed):
early: 4567.607664007309, late: 2663.6896942803423

So, I think I see the same problem?

Did I somehow mess up in getting the latest pyslim version containing the fix? Embarrassingly enough, although I'm OK with Python the language, I know extremely little about distributing and pinning Python packages, dependencies, etc. 😬

For the record, here's the list of packages in the environment I'm running the Python code in:

> pip list
Package          Version
---------------- -------
attrs            23.1.0
demes            0.2.3
jsonschema       4.17.3
msprime          1.2.0
newick           1.9.0
numpy            1.25.0
pip              22.0.4
pyrsistent       0.19.3
pyslim           1.0.2   <--- this should be the latest fixed version?
ruamel.yaml      0.17.32
ruamel.yaml.clib 0.2.7
setuptools       58.1.0
svgwrite         1.4.3
tskit            0.5.5

And also the SLiM version:

> slim --version
SLiM version 4.0, built Aug 10 2022 10:19:32
Git commit SHA-1: unknown (not defined under Xcode)

Sorry if I missed something obvious. I did scroll through the changes you've made in the associated PR but I hadn't had the time yet to do an in-depth study of what is really going on with the recapitation at the lowest level (where the bug originated). Thanks for your help!

@petrelharp
Copy link
Contributor Author

OMG, fuck this. I did the whole fix to figuring out what time to recapitate at EXCEPT putting it in where it was supposed to go.

THANK YOU for catching that. (Why didn't I run this check earlier?!?!)

@bodkan
Copy link

bodkan commented Jun 22, 2023

Hey @petrelharp , no worries. I'm sorry this whole thing caused so much stress for you. I know all too well how stressful it is to keep track of everything that's going on and make sure everything is right, especially after making changes in old code when upgrading internal bits. And that's under normal circumstances. It's even worse in exceptionally bad situations which can potentially affect other people, while trying to push out a fix as soon as possible.


I can now confirm, after locking pyslim to v1.0.3 inside slendr and re-running the "pure slendr" version of the debugging pipeline (whose "pure SLiM/msprime/tskit/pyslim" version I originally submitted here) that all is well. 🙂

First diversity across three different tree sequences vs popgen theory (the blue boxplot was way too low before):

image

[I don't think I've ever been this happy to see three boxplots aligned next to each other in such a beautifully boring way! 😃]

And, for good measure, another analysis of AFS I did while I was trying to debug this on my own at first. This is what originally showed that it's the counts of the highest-frequency alleles -- so the oldest polymorphisms -- which deviated from theory only in recapitated tree sequences (and we now know why that was). The blue line was completely off for higher frequency alleles -- not anymore though!

image

Also tagging @bhaller who might be happy to see these beautifully consistent results visually confirming the fix. 🙂

(The non-slendr version of the same analysis gives the same results, of course!)

Thanks for all your work @petrelharp!


debugging.R.zip

@bhaller
Copy link
Collaborator

bhaller commented Jun 22, 2023

Great stuff, @bodkan! I'm very glad that you're doing this sort of testing, it is much needed (as shown already by the fact that it exposed a bug that has been around for ages!). We need much more of this.

@bodkan
Copy link

bodkan commented Jun 22, 2023

I 100% agree.

I have some code lying around that can generate random phylogenetic trees of arbitrary complexity and convert them into slendr models -- one lineage for one slendr population, with randomized population sizes and divergence times, and with the ability to sprinkle random gene-flow events onto those models.

I was hoping to use this for more extensive testing. I.e., for a series of randomly generated demographic models, pick a random set of recorded individuals and compute all kinds of tskit statistics -- pick random pairs, triplets and quadruplets of individuals and compute f2, f3, f4 statistics, compute 1D, 2D, etc. AFS spectra, and always compare an slendr/msprime run to a slendr/SLiM run... This has been done since last summer, but I never got to merge it into slendr as a new set of statistical unit tests.

I've been working on a list of smaller Master's-level projects we organize here in Cph and I think I'll put this on a higher priority now. This particular project would be a fun way for a student to learn simulations and popgen, and for all of us to get something useful out of it too!

(In fact, this particular bug was discovered in a Master's thesis work I supervised.)

@petrelharp
Copy link
Contributor Author

Oh, good!! :phew: Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants