-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Docs for simulating with multiple chromosomes #1299
Conversation
📖 Docs for this PR can be previewed here |
Codecov Report
@@ Coverage Diff @@
## main #1299 +/- ##
=======================================
Coverage 90.35% 90.35%
=======================================
Files 26 26
Lines 8861 8861
Branches 1839 1839
=======================================
Hits 8006 8006
Misses 436 436
Partials 419 419
Flags with carried forward coverage won't be shown. Click here to find out more. Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Superb, thanks @apragsdale!
It would be good to change the code-block:: python
sections into Jupyter-sphinx blocks, though, so we're running live code chunks. We can make the example much smaller if it's slow. I wonder if there's any way we could illustrate that the trees on the different chromosomes are correlated? Or perhaps illustrate how we'd access the trees on the different chromosomes?
Ok, switched the code blocks to jupyter sphinx - which was a good call, since I caught some mistakes in the code blocks! For accessing trees on different chromosomes, there's nothing in the tree sequence that specifies anything about which segments belong to which chromosome, so the user would need to know themselves where positions map to along their genome. I'm not sure how we'd show correlations between trees without doing a bunch of simulations and plotting statistics as in Dominic's DTWF paper (but that isn't really feasible within some docs juptyper blocks). Did you have other thoughts here? |
That's true, but I guess we could give a little recipe here to make that easier. We could put in the lengths as variables to make this possible, I guess, but it's not totally clear how to do this nicely.
I was thinking more along the lines of a small example, where we show visually that the tree on the end of one chromosome is very similar to the start of another one, and that this wouldn't be true if we simulated them independently. Maybe this is the subject for a tutorial though, tbh. |
Yeah, I think it would be straightforward to have a few extra lines to explain how to get trees for one chromosome vs the other. Is there a simple way now to start a tree iterator with a specific start and end position in the tree sequence? I suppose you could use the
I agree, that does sound a bit more like a tutorial section instead of a short illustration here. Can we punt this for now and come back to it later when msprime 1.0 is a bit more documented and filled in? |
Fantastic idea! Can't believe we didn't think of that before. So, we'd have something like ts_chroms = []
for j in range(len(recomb_map.position) - 1):
start, end = recomb_map.position[j: j + 1]
chrom_ts = ts.keep_intervals([start, end], simplify=False) # simplify is important so the nodes retain identity
ts_chroms.append(chrom_ts.trim()) And that should do it.
Yes, let's do that. Can we add a ".. todo::" section here saying we need to link to a tutorial here explaining how to access |
Yes, great idea - I think this improves this section quite a bit.
Yep, added the todo. |
Why aren't you using a discrete recombination map, btw? |
My understanding (which could be wrong - I'm just figuring out the new methods this week) is that the recombination map has been replaced with a generic rate map, and discrete coordinates are used by default in E.g., the following always creates two trees, even though the rate is very large separating the two segments: import msprime
positions = [0, 1000, 1001, 2000]
rates = [0, 2, 0]
rate_map = msprime.RateMap(positions, rates)
ts = msprime.sim_ancestry(
2, population_size=1000,
recombination_rate=rate_map, model="dtwf")
print(ts.num_trees)
print(ts.first().interval)
print(ts.last().interval) with output
Edit: if you set |
oh, right - so, you are using a discrete map! In that case, I don't think this is necessary any more:
|
A suggestion: instead of saying "effectively unlink" why not just say they have a 50% chance of being co-inherited? Like maybe:
Also, it might be helpful to explain what the correlations between chromosomes are, e.g. "For instance, in the continuous-time Hudson model, each recombination event creates a new ancestor; so that with 22 chromosomes, a genome might have tens of distinct ancestors going back only one or two time units. Simulating using the DTWF model allows chromosomes to be inherited together, and is completely equivalent to independent assortment." (this could be improved, but you get the idea?) |
This is something I'm not clear on. Is the rate still a Poisson rate on that interval? If that's the case, then a rate of 0.5 wouldn't end up giving us the probability of drawing odd number of recombination events close to 1/2. Or is it really that every bp position within that interval (which is just the single site) has probability 1/2 of having a recombination event per unit time? In which case, yes, rate of 0.5 would work.
Good suggestions! Thanks |
Well, our former selves didn't know either. It happens here, and I think that with a discrete genome, the upshot is that a breakpoint in a region of length 1 with rate If so, we should set the rate so that Also note that if it's really a Poisson number, then there's a different equation to get it set to exactly 1/2 (details in the SLiM manual). |
My lazy unthinking brain says to just simulate it and calculate the LD. |
If it's a Poisson number, the probability that the number is odd is always less than 0.5, and only approaches 0.5 in the limit of the rate going to infinity. Though practically it converges quickly, so even with a rate of 2, the even/odd split is ~ 0.51/0.49. But I think you're right that recombination events are drawn differently in the discrete genome, and they follow something like |
Oh, sorry, you're right about that.
Oh, nice experiment! I got
So, I think the recommendation is to set recombination rate to |
I see - that makes sense to me! I didn't realize that setting number of samples to 2 is actually simulating 4 samples, so I was dividing by the wrong number of samples. I'm not a fan that the default behavior for number of samples now doubles the number of sampled sequences ( Interestingly, it looks to still be valid using the Hudson model:
I'll update the PR with all this in mind - thanks for the help @petrelharp! |
Switching the focus to individuals! Since that's what people work/think with usually anyhow. This was discussed elsewhere extensively, anyhow, and is why we're switching from thank you! |
I see - I think I missed those discussions while away for a month or two recently due to a move :( I think I've also been confused because my impression was that |
Just to reassure you @apragsdale - |
Ok, sorry for making all this noise about it after the discussions had all been settled! Like I said, I just got worried yesterday, and turns out I had no reason to be.
Because `simulate()` also supports `discrete_genome` with a discrete genetic map, and if those aspects of the simulation work the same was as here, we can also simulate multiple chromosomes with `simulate()` in the same way. Might be worth a mention in those docs with a Multiple Chromosome header, with just something short and the analogous `similate(...)` options as needed, and then point to this page for the discussion since it's more or less the same.
Does that sound reasonable, or are those sections of the docs in the process of moving and I should wait to add it there until things settle down?
|
The current plan is for I'd rather move away from the language of a "discrete genetic map" now too, as the recombination map and whether or not we have a discrete genome are fully decoupled. |
Ah-ha! Ok, I can understand that. In that case, this is probably good to go. I cleaned it up a bit more as well. It could maybe use one more quick read-through to make sure it all reads clearly.
That makes sense. Thanks a lot for taking the time to clarify. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great, thanks @apragsdale! Ready to merge after a squash.
Not at all - thanks for taking the time to try out the new APIs! |
e007ffa
to
fcfb2da
Compare
It's really nice with some great new features! Ok - all squashed. |
This was not immediately clear to me, so I thought I'd add that here in case it's useful for anyone (and also in case I made any mistake and other people can spot it).
|
Closes #1005