Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docs for simulating with multiple chromosomes #1299

Merged
merged 1 commit into from
Nov 15, 2020

Conversation

apragsdale
Copy link
Contributor

Closes #1005

@AdminBot-tskit
Copy link
Collaborator

📖 Docs for this PR can be previewed here

@codecov
Copy link

codecov bot commented Nov 12, 2020

Codecov Report

Merging #1299 (fcfb2da) into main (6aead5f) will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##             main    #1299   +/-   ##
=======================================
  Coverage   90.35%   90.35%           
=======================================
  Files          26       26           
  Lines        8861     8861           
  Branches     1839     1839           
=======================================
  Hits         8006     8006           
  Misses        436      436           
  Partials      419      419           
Flag Coverage Δ
C 90.35% <ø> (ø)
python 98.66% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.


Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6aead5f...fcfb2da. Read the comment docs.

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Superb, thanks @apragsdale!

It would be good to change the code-block:: python sections into Jupyter-sphinx blocks, though, so we're running live code chunks. We can make the example much smaller if it's slow. I wonder if there's any way we could illustrate that the trees on the different chromosomes are correlated? Or perhaps illustrate how we'd access the trees on the different chromosomes?

@apragsdale
Copy link
Contributor Author

Ok, switched the code blocks to jupyter sphinx - which was a good call, since I caught some mistakes in the code blocks!

For accessing trees on different chromosomes, there's nothing in the tree sequence that specifies anything about which segments belong to which chromosome, so the user would need to know themselves where positions map to along their genome.

I'm not sure how we'd show correlations between trees without doing a bunch of simulations and plotting statistics as in Dominic's DTWF paper (but that isn't really feasible within some docs juptyper blocks). Did you have other thoughts here?

@jeromekelleher
Copy link
Member

For accessing trees on different chromosomes, there's nothing in the tree sequence that specifies anything about which segments belong to which chromosome, so the user would need to know themselves where positions map to along their genome.

That's true, but I guess we could give a little recipe here to make that easier. We could put in the lengths as variables to make this possible, I guess, but it's not totally clear how to do this nicely.

I'm not sure how we'd show correlations between trees without doing a bunch of simulations and plotting statistics as in Dominic's DTWF paper (but that isn't really feasible within some docs juptyper blocks). Did you have other thoughts here?

I was thinking more along the lines of a small example, where we show visually that the tree on the end of one chromosome is very similar to the start of another one, and that this wouldn't be true if we simulated them independently. Maybe this is the subject for a tutorial though, tbh.

@apragsdale
Copy link
Contributor Author

That's true, but I guess we could give a little recipe here to make that easier. We could put in the lengths as variables to make this possible, I guess, but it's not totally clear how to do this nicely.

Yeah, I think it would be straightforward to have a few extra lines to explain how to get trees for one chromosome vs the other. Is there a simple way now to start a tree iterator with a specific start and end position in the tree sequence?

I suppose you could use the keep_intervals function to pull out each chromosome - in a similar vein, is there a way to do something like keep_intervals on each chromosome that resets the position indexes to zero and erases everything outside of that kept chromosome, instead of just removing branches and mutations? That would give you tree sequences, one per chrom, as though you simulated them independently, but they would then have the correct correlation structure across chroms.

I was thinking more along the lines of a small example, where we show visually that the tree on the end of one chromosome is very similar to the start of another one, and that this wouldn't be true if we simulated them independently. Maybe this is the subject for a tutorial though, tbh.

I agree, that does sound a bit more like a tutorial section instead of a short illustration here. Can we punt this for now and come back to it later when msprime 1.0 is a bit more documented and filled in?

@jeromekelleher
Copy link
Member

I suppose you could use the keep_intervals function to pull out each chromosome - in a similar vein, is there a way to do something like keep_intervals on each chromosome that resets the position indexes to zero and erases everything outside of that kept chromosome, instead of just removing branches and mutations? That would give you tree sequences, one per chrom, as though you simulated them independently, but they would then have the correct correlation structure across chroms.

Fantastic idea! Can't believe we didn't think of that before. So, we'd have something like

ts_chroms = []
for j in range(len(recomb_map.position) - 1):
    start, end = recomb_map.position[j: j + 1]
    chrom_ts = ts.keep_intervals([start, end], simplify=False) # simplify is important so the nodes retain identity
    ts_chroms.append(chrom_ts.trim())

And that should do it.

I agree, that does sound a bit more like a tutorial section instead of a short illustration here. Can we punt this for now and come back to it later when msprime 1.0 is a bit more documented and filled in?

Yes, let's do that. Can we add a ".. todo::" section here saying we need to link to a tutorial here explaining how to access

@apragsdale
Copy link
Contributor Author

And that should do it.

Yes, great idea - I think this improves this section quite a bit.

Can we add a ".. todo::" section here saying we need to link to a tutorial here explaining how to access

Yep, added the todo.

@petrelharp
Copy link
Contributor

Why aren't you using a discrete recombination map, btw?

@apragsdale
Copy link
Contributor Author

apragsdale commented Nov 12, 2020

Why aren't you using a discrete recombination map, btw?

My understanding (which could be wrong - I'm just figuring out the new methods this week) is that the recombination map has been replaced with a generic rate map, and discrete coordinates are used by default in sim_ancestry. So by default all recombination events occur at integer positions.

E.g., the following always creates two trees, even though the rate is very large separating the two segments:

import msprime
positions = [0, 1000, 1001, 2000]
rates = [0, 2, 0]
rate_map = msprime.RateMap(positions, rates)
ts = msprime.sim_ancestry(
    2, population_size=1000,
    recombination_rate=rate_map, model="dtwf")
print(ts.num_trees)
print(ts.first().interval)
print(ts.last().interval)

with output

2
Interval(left=0.0, right=1000.0)
Interval(left=1000.0, right=2000.0)

Edit: if you set discrete_genome=False within the sim_ancestry call, you'll find a ton more trees all within the interval 1000 to 1001.

@petrelharp
Copy link
Contributor

oh, right - so, you are using a discrete map! In that case, I don't think this is necessary any more:

Note that in defining the recombination map, the rates separating chromosomes are much larger than 0.5. ...
  • you should just set the rate to be 0.5 on that interval (but it's worth pointing out that this is a discrete map...).

@petrelharp
Copy link
Contributor

A suggestion: instead of saying "effectively unlink" why not just say they have a 50% chance of being co-inherited? Like maybe:

 By simulating using sim_ancestry with discrete_genome=True (the default setting)
and a rate of 0.5 on a single-bp segment separating two regions,
we ensure that the two regions have a 50% chance of being both inherited from the same parent,
just as real chromosomes (that assort independently). 

Also, it might be helpful to explain what the correlations between chromosomes are, e.g. "For instance, in the continuous-time Hudson model, each recombination event creates a new ancestor; so that with 22 chromosomes, a genome might have tens of distinct ancestors going back only one or two time units. Simulating using the DTWF model allows chromosomes to be inherited together, and is completely equivalent to independent assortment." (this could be improved, but you get the idea?)

@apragsdale
Copy link
Contributor Author

you should just set the rate to be 0.5 on that interval (but it's worth pointing out that this is a discrete map...).

This is something I'm not clear on. Is the rate still a Poisson rate on that interval? If that's the case, then a rate of 0.5 wouldn't end up giving us the probability of drawing odd number of recombination events close to 1/2. Or is it really that every bp position within that interval (which is just the single site) has probability 1/2 of having a recombination event per unit time? In which case, yes, rate of 0.5 would work.

A suggestion: instead of saying "effectively unlink" why not just say they have a 50% chance of being co-inherited? Like maybe:

Good suggestions! Thanks

@petrelharp
Copy link
Contributor

This is something I'm not clear on. Is the rate still a Poisson rate on that interval?

Well, our former selves didn't know either. It happens here, and I think that with a discrete genome, the upshot is that a breakpoint in a region of length 1 with rate r has either 0 or 1 breakpoints, and has 1 breakpoint with probability 1-exp(-r). (But, I'm not sure: @DomNelson? @jeromekelleher?)

If so, we should set the rate so that exp(-r) = 1/2, i.e., r = log(2).

Also note that if it's really a Poisson number, then there's a different equation to get it set to exactly 1/2 (details in the SLiM manual).

@grahamgower
Copy link
Member

My lazy unthinking brain says to just simulate it and calculate the LD.

@apragsdale
Copy link
Contributor Author

Also note that if it's really a Poisson number, then there's a different equation to get it set to exactly 1/2 (details in the SLiM manual).

If it's a Poisson number, the probability that the number is odd is always less than 0.5, and only approaches 0.5 in the limit of the rate going to infinity. Though practically it converges quickly, so even with a rate of 2, the even/odd split is ~ 0.51/0.49.

But I think you're right that recombination events are drawn differently in the discrete genome, and they follow something like 1-exp(-r). Though from some experimenting, it looks like we get 1 breakpoint with probability 1-exp(-2*r) instead. I checked this by running a single generation of "dtwf" with a single locus hotspot and no recombination on either side, with two samples. We can then check if neither sample had a recombination event in that one generation, which should be in 1/4 of the replicates if the probability of coinheritance on each side of the hotspot is 1/2. The value of r that gave that proportion was log(sqrt(2)) (dashed red lines in the figure). Not sure what to make of this, if that is what the correct rate should be.

rec_decay

@petrelharp
Copy link
Contributor

If it's a Poisson number, the probability that the number is odd is always less than 0.5, and only approaches 0.5 in the limit of the rate going to infinity.

Oh, sorry, you're right about that.

1 - exp(-2r)

Oh, nice experiment! I got 1 - exp(-r), doing something similar, though:

r = np.log(2)
n = 10000
ts = msprime.sim_ancestry(n, recombination_rate=r, population_size=1e10, model='dtwf', end_time=2, sequence_length=2)
print("expected proportion of recombinants:",  (1 - np.exp(-r)))
print('observed proportion of recombinants:', np.sum(ts.tables.edges.right == 1) / (2 * n))
# expected proportion of recombinants: 0.5
# observed proportion of recombinants: 0.50095

So, I think the recommendation is to set recombination rate to log(2) = 0.69314718 for the base pairs between chromosomes, with an explanation that in the DTWF, the probability of recombination is 1 - exp(-r)? This is a teeny bit problematic because when you switch to Hudson you'd really like a recombination rate of 0.5, but I can't imagine this would make a difference in practice? I guess this is implementing the "at least one recombination event creates a crossing over" instead of "odd number of recombination events".

@apragsdale
Copy link
Contributor Author

I see - that makes sense to me! I didn't realize that setting number of samples to 2 is actually simulating 4 samples, so I was dividing by the wrong number of samples. I'm not a fan that the default behavior for number of samples now doubles the number of sampled sequences (ts = msprime.sim_ancestry(2, ...); ts.num_samples == 4), but maybe there's a good reason for that? Anyway, not relevant to this PR

Interestingly, it looks to still be valid using the Hudson model:

r = np.log(2)
n = 10000
ts = msprime.sim_ancestry(n, recombination_rate=r, population_size=1e10, model='hudson', end_time=1, sequence_length=2)
print("expected proportion of recombinants:",  (1 - np.exp(-r)))
print('observed proportion of recombinants:', np.sum(ts.tables.edges.right == 1) / (2 * n))
# expected proportion of recombinants: 0.5
# observed proportion of recombinants: 0.4986

I'll update the PR with all this in mind - thanks for the help @petrelharp!

@petrelharp
Copy link
Contributor

I'm not a fan that the default behavior for number of samples now doubles the number of sampled sequences (ts = msprime.sim_ancestry(2, ...); ts.num_samples == 4), but maybe there's a good reason for that?

Switching the focus to individuals! Since that's what people work/think with usually anyhow. This was discussed elsewhere extensively, anyhow, and is why we're switching from simulate( ) (which will retain its old behavior) to sim_ancestry( ).

thank you!

@apragsdale
Copy link
Contributor Author

Switching the focus to individuals! Since that's what people work/think with usually anyhow. This was discussed elsewhere extensively, anyhow, and is why we're switching from simulate( ) (which will retain its old behavior) to sim_ancestry( ).

thank you!

I see - I think I missed those discussions while away for a month or two recently due to a move :( I think I've also been confused because my impression was that sim_ancestry( ) is essentially replacing simulate( ) in msprime 1.0, which would be deprecated eventually - but I guess both will continue to be supported? If that's the case, then I don't have as much to worry about I think!

@jeromekelleher
Copy link
Member

Just to reassure you @apragsdale - simulate() will be supported indefinitely. Nearly all existing code should continue working in the long term after 1.0. (The exceptions are where we were doing tricky stuff with the recombination maps with num_loci - we had no choice but to break this.)

@apragsdale
Copy link
Contributor Author

apragsdale commented Nov 14, 2020 via email

@jeromekelleher
Copy link
Member

Because simulate() also supports discrete_genome with a discrete genetic map, and if those aspects of the simulation work the same was as here, we can also simulate multiple chromosomes with simulate() in the same way. Might be worth a mention in those docs with a Multiple Chromosome header, with just something short and the analogous similate(...) options as needed, and then point to this page for the discussion since it's more or less the same.

The current plan is for simulate() to not support the discrete_genome option in an attempt to make the code simpler (#1297). I've already done a lot of the changes in #1287

I'd rather move away from the language of a "discrete genetic map" now too, as the recombination map and whether or not we have a discrete genome are fully decoupled.

@apragsdale
Copy link
Contributor Author

The current plan is for simulate() to not support the discrete_genome option in an attempt to make the code simpler (#1297). I've already done a lot of the changes in #1287

Ah-ha! Ok, I can understand that. In that case, this is probably good to go. I cleaned it up a bit more as well. It could maybe use one more quick read-through to make sure it all reads clearly.

I'd rather move away from the language of a "discrete genetic map" now too, as the recombination map and whether or not we have a discrete genome are fully decoupled.

That makes sense. Thanks a lot for taking the time to clarify.

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, thanks @apragsdale! Ready to merge after a squash.

@jeromekelleher
Copy link
Member

That makes sense. Thanks a lot for taking the time to clarify.

Not at all - thanks for taking the time to try out the new APIs!

@apragsdale
Copy link
Contributor Author

Not at all - thanks for taking the time to try out the new APIs!

It's really nice with some great new features! Ok - all squashed.

@mergify mergify bot merged commit 3750039 into tskit-dev:main Nov 15, 2020
@apragsdale apragsdale deleted the multiple_chrom_docs branch March 25, 2021 21:13
@diegovelizo
Copy link

diegovelizo commented May 13, 2024

This was not immediately clear to me, so I thought I'd add that here in case it's useful for anyone (and also in case I made any mistake and other people can spot it).
When stitching together recombination map files in hapmap format, you need to add 69.314 cM to the genetic position. The third column seems to be completely ignored. (In an earlier version of this comment I incorrectly said that either the third or fourth column could be edited to specify the start of the next chromosome, I was wrong.)

data = io.StringIO("""\
Chromosome  Position(bp)  Rate(cM/Mb)  Map(cM)
chr10       48232         0       100.002664
chr10       48486         0       100.002705
chr10       50009         0        100.002947
chr10       52147         0       100.003287
chr10       52148         0       169.31800522
""")
rate_map = msprime.RateMap.read_hapmap(data)
print(rate_map)
┌───────────────────────────────────────────┐
│left   │right  │      mid│   span│     rate│
├───────────────────────────────────────────┤
│0      │48232  │    24116│  48232│  2.1e-05│
│48232  │48486  │    48359│    254│  1.6e-09│
│48486  │50009  │  49247.5│   1523│  1.6e-09│
│50009  │52147  │    51078│   2138│  1.6e-09│
│52147  │52148  │  52147.5│      1│     0.69│
└───────────────────────────────────────────┘
data = io.StringIO("""\
Chromosome  Position(bp)  Rate(cM/Mb)  Map(cM)
chr10       48232         0.1614       0
chr10       48486         0.1589       0
chr10       50009         0.159        0
chr10       52147         69314718.06  0
chr10       52148         0.1574       0
""")
rate_map = msprime.RateMap.read_hapmap(data)

print(rate_map)
┌────────────────────────────────────────┐
│left   │right  │      mid│   span│  rate│
├────────────────────────────────────────┤
│0      │48232  │    24116│  48232│   nan│
│48232  │48486  │    48359│    254│     0│
│48486  │50009  │  49247.5│   1523│     0│
│50009  │52147  │    51078│   2138│     0│
│52147  │52148  │  52147.5│      1│     0│
└────────────────────────────────────────┘

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Docs] Update section on "simulating multiple chromosomes"
6 participants