Docs for simulating with multiple chromosomes #1299

apragsdale · 2020-11-12T05:06:21Z

AdminBot-tskit · 2020-11-12T05:08:17Z

📖 Docs for this PR can be previewed here

codecov · 2020-11-12T05:09:18Z

Codecov Report

Merging #1299 (fcfb2da) into main (6aead5f) will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##             main    #1299   +/-   ##
=======================================
  Coverage   90.35%   90.35%           
=======================================
  Files          26       26           
  Lines        8861     8861           
  Branches     1839     1839           
=======================================
  Hits         8006     8006           
  Misses        436      436           
  Partials      419      419

Flag	Coverage Δ
C	`90.35% <ø> (ø)`
python	`98.66% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6aead5f...fcfb2da. Read the comment docs.

jeromekelleher

Superb, thanks @apragsdale!

It would be good to change the code-block:: python sections into Jupyter-sphinx blocks, though, so we're running live code chunks. We can make the example much smaller if it's slow. I wonder if there's any way we could illustrate that the trees on the different chromosomes are correlated? Or perhaps illustrate how we'd access the trees on the different chromosomes?

apragsdale · 2020-11-12T15:26:40Z

Ok, switched the code blocks to jupyter sphinx - which was a good call, since I caught some mistakes in the code blocks!

For accessing trees on different chromosomes, there's nothing in the tree sequence that specifies anything about which segments belong to which chromosome, so the user would need to know themselves where positions map to along their genome.

I'm not sure how we'd show correlations between trees without doing a bunch of simulations and plotting statistics as in Dominic's DTWF paper (but that isn't really feasible within some docs juptyper blocks). Did you have other thoughts here?

jeromekelleher · 2020-11-12T15:59:05Z

For accessing trees on different chromosomes, there's nothing in the tree sequence that specifies anything about which segments belong to which chromosome, so the user would need to know themselves where positions map to along their genome.

That's true, but I guess we could give a little recipe here to make that easier. We could put in the lengths as variables to make this possible, I guess, but it's not totally clear how to do this nicely.

I'm not sure how we'd show correlations between trees without doing a bunch of simulations and plotting statistics as in Dominic's DTWF paper (but that isn't really feasible within some docs juptyper blocks). Did you have other thoughts here?

I was thinking more along the lines of a small example, where we show visually that the tree on the end of one chromosome is very similar to the start of another one, and that this wouldn't be true if we simulated them independently. Maybe this is the subject for a tutorial though, tbh.

apragsdale · 2020-11-12T16:51:34Z

That's true, but I guess we could give a little recipe here to make that easier. We could put in the lengths as variables to make this possible, I guess, but it's not totally clear how to do this nicely.

Yeah, I think it would be straightforward to have a few extra lines to explain how to get trees for one chromosome vs the other. Is there a simple way now to start a tree iterator with a specific start and end position in the tree sequence?

I suppose you could use the keep_intervals function to pull out each chromosome - in a similar vein, is there a way to do something like keep_intervals on each chromosome that resets the position indexes to zero and erases everything outside of that kept chromosome, instead of just removing branches and mutations? That would give you tree sequences, one per chrom, as though you simulated them independently, but they would then have the correct correlation structure across chroms.

I was thinking more along the lines of a small example, where we show visually that the tree on the end of one chromosome is very similar to the start of another one, and that this wouldn't be true if we simulated them independently. Maybe this is the subject for a tutorial though, tbh.

I agree, that does sound a bit more like a tutorial section instead of a short illustration here. Can we punt this for now and come back to it later when msprime 1.0 is a bit more documented and filled in?

jeromekelleher · 2020-11-12T19:13:49Z

I suppose you could use the keep_intervals function to pull out each chromosome - in a similar vein, is there a way to do something like keep_intervals on each chromosome that resets the position indexes to zero and erases everything outside of that kept chromosome, instead of just removing branches and mutations? That would give you tree sequences, one per chrom, as though you simulated them independently, but they would then have the correct correlation structure across chroms.

Fantastic idea! Can't believe we didn't think of that before. So, we'd have something like

ts_chroms = []
for j in range(len(recomb_map.position) - 1):
    start, end = recomb_map.position[j: j + 1]
    chrom_ts = ts.keep_intervals([start, end], simplify=False) # simplify is important so the nodes retain identity
    ts_chroms.append(chrom_ts.trim())

And that should do it.

I agree, that does sound a bit more like a tutorial section instead of a short illustration here. Can we punt this for now and come back to it later when msprime 1.0 is a bit more documented and filled in?

Yes, let's do that. Can we add a ".. todo::" section here saying we need to link to a tutorial here explaining how to access

apragsdale · 2020-11-12T20:39:02Z

And that should do it.

Yes, great idea - I think this improves this section quite a bit.

Can we add a ".. todo::" section here saying we need to link to a tutorial here explaining how to access

Yep, added the todo.

petrelharp · 2020-11-12T22:30:55Z

Why aren't you using a discrete recombination map, btw?

apragsdale · 2020-11-12T22:38:58Z

Why aren't you using a discrete recombination map, btw?

My understanding (which could be wrong - I'm just figuring out the new methods this week) is that the recombination map has been replaced with a generic rate map, and discrete coordinates are used by default in sim_ancestry. So by default all recombination events occur at integer positions.

E.g., the following always creates two trees, even though the rate is very large separating the two segments:

import msprime
positions = [0, 1000, 1001, 2000]
rates = [0, 2, 0]
rate_map = msprime.RateMap(positions, rates)
ts = msprime.sim_ancestry(
    2, population_size=1000,
    recombination_rate=rate_map, model="dtwf")
print(ts.num_trees)
print(ts.first().interval)
print(ts.last().interval)

with output

2
Interval(left=0.0, right=1000.0)
Interval(left=1000.0, right=2000.0)

Edit: if you set discrete_genome=False within the sim_ancestry call, you'll find a ton more trees all within the interval 1000 to 1001.

petrelharp · 2020-11-12T23:48:18Z

oh, right - so, you are using a discrete map! In that case, I don't think this is necessary any more:

Note that in defining the recombination map, the rates separating chromosomes are much larger than 0.5. ...

you should just set the rate to be 0.5 on that interval (but it's worth pointing out that this is a discrete map...).

petrelharp · 2020-11-12T23:56:22Z

A suggestion: instead of saying "effectively unlink" why not just say they have a 50% chance of being co-inherited? Like maybe:

 By simulating using sim_ancestry with discrete_genome=True (the default setting)
and a rate of 0.5 on a single-bp segment separating two regions,
we ensure that the two regions have a 50% chance of being both inherited from the same parent,
just as real chromosomes (that assort independently).

Also, it might be helpful to explain what the correlations between chromosomes are, e.g. "For instance, in the continuous-time Hudson model, each recombination event creates a new ancestor; so that with 22 chromosomes, a genome might have tens of distinct ancestors going back only one or two time units. Simulating using the DTWF model allows chromosomes to be inherited together, and is completely equivalent to independent assortment." (this could be improved, but you get the idea?)

apragsdale · 2020-11-13T00:17:23Z

you should just set the rate to be 0.5 on that interval (but it's worth pointing out that this is a discrete map...).

This is something I'm not clear on. Is the rate still a Poisson rate on that interval? If that's the case, then a rate of 0.5 wouldn't end up giving us the probability of drawing odd number of recombination events close to 1/2. Or is it really that every bp position within that interval (which is just the single site) has probability 1/2 of having a recombination event per unit time? In which case, yes, rate of 0.5 would work.

A suggestion: instead of saying "effectively unlink" why not just say they have a 50% chance of being co-inherited? Like maybe:

Good suggestions! Thanks

petrelharp · 2020-11-13T06:26:21Z

This is something I'm not clear on. Is the rate still a Poisson rate on that interval?

Well, our former selves didn't know either. It happens here, and I think that with a discrete genome, the upshot is that a breakpoint in a region of length 1 with rate r has either 0 or 1 breakpoints, and has 1 breakpoint with probability 1-exp(-r). (But, I'm not sure: @DomNelson? @jeromekelleher?)

If so, we should set the rate so that exp(-r) = 1/2, i.e., r = log(2).

Also note that if it's really a Poisson number, then there's a different equation to get it set to exactly 1/2 (details in the SLiM manual).

grahamgower · 2020-11-13T08:49:01Z

My lazy unthinking brain says to just simulate it and calculate the LD.

apragsdale · 2020-11-13T20:30:56Z

Also note that if it's really a Poisson number, then there's a different equation to get it set to exactly 1/2 (details in the SLiM manual).

If it's a Poisson number, the probability that the number is odd is always less than 0.5, and only approaches 0.5 in the limit of the rate going to infinity. Though practically it converges quickly, so even with a rate of 2, the even/odd split is ~ 0.51/0.49.

But I think you're right that recombination events are drawn differently in the discrete genome, and they follow something like 1-exp(-r). Though from some experimenting, it looks like we get 1 breakpoint with probability 1-exp(-2*r) instead. I checked this by running a single generation of "dtwf" with a single locus hotspot and no recombination on either side, with two samples. We can then check if neither sample had a recombination event in that one generation, which should be in 1/4 of the replicates if the probability of coinheritance on each side of the hotspot is 1/2. The value of r that gave that proportion was log(sqrt(2)) (dashed red lines in the figure). Not sure what to make of this, if that is what the correct rate should be.

petrelharp · 2020-11-13T22:13:07Z

If it's a Poisson number, the probability that the number is odd is always less than 0.5, and only approaches 0.5 in the limit of the rate going to infinity.

Oh, sorry, you're right about that.

1 - exp(-2r)

Oh, nice experiment! I got 1 - exp(-r), doing something similar, though:

r = np.log(2)
n = 10000
ts = msprime.sim_ancestry(n, recombination_rate=r, population_size=1e10, model='dtwf', end_time=2, sequence_length=2)
print("expected proportion of recombinants:",  (1 - np.exp(-r)))
print('observed proportion of recombinants:', np.sum(ts.tables.edges.right == 1) / (2 * n))
# expected proportion of recombinants: 0.5
# observed proportion of recombinants: 0.50095

So, I think the recommendation is to set recombination rate to log(2) = 0.69314718 for the base pairs between chromosomes, with an explanation that in the DTWF, the probability of recombination is 1 - exp(-r)? This is a teeny bit problematic because when you switch to Hudson you'd really like a recombination rate of 0.5, but I can't imagine this would make a difference in practice? I guess this is implementing the "at least one recombination event creates a crossing over" instead of "odd number of recombination events".

apragsdale · 2020-11-13T23:07:20Z

I see - that makes sense to me! I didn't realize that setting number of samples to 2 is actually simulating 4 samples, so I was dividing by the wrong number of samples. I'm not a fan that the default behavior for number of samples now doubles the number of sampled sequences (ts = msprime.sim_ancestry(2, ...); ts.num_samples == 4), but maybe there's a good reason for that? Anyway, not relevant to this PR

Interestingly, it looks to still be valid using the Hudson model:

r = np.log(2)
n = 10000
ts = msprime.sim_ancestry(n, recombination_rate=r, population_size=1e10, model='hudson', end_time=1, sequence_length=2)
print("expected proportion of recombinants:",  (1 - np.exp(-r)))
print('observed proportion of recombinants:', np.sum(ts.tables.edges.right == 1) / (2 * n))
# expected proportion of recombinants: 0.5
# observed proportion of recombinants: 0.4986

I'll update the PR with all this in mind - thanks for the help @petrelharp!

petrelharp · 2020-11-14T04:15:19Z

I'm not a fan that the default behavior for number of samples now doubles the number of sampled sequences (ts = msprime.sim_ancestry(2, ...); ts.num_samples == 4), but maybe there's a good reason for that?

Switching the focus to individuals! Since that's what people work/think with usually anyhow. This was discussed elsewhere extensively, anyhow, and is why we're switching from simulate( ) (which will retain its old behavior) to sim_ancestry( ).

thank you!

apragsdale · 2020-11-14T04:28:37Z

Switching the focus to individuals! Since that's what people work/think with usually anyhow. This was discussed elsewhere extensively, anyhow, and is why we're switching from simulate( ) (which will retain its old behavior) to sim_ancestry( ).

thank you!

I see - I think I missed those discussions while away for a month or two recently due to a move :( I think I've also been confused because my impression was that sim_ancestry( ) is essentially replacing simulate( ) in msprime 1.0, which would be deprecated eventually - but I guess both will continue to be supported? If that's the case, then I don't have as much to worry about I think!

jeromekelleher · 2020-11-14T11:00:46Z

Just to reassure you @apragsdale - simulate() will be supported indefinitely. Nearly all existing code should continue working in the long term after 1.0. (The exceptions are where we were doing tricky stuff with the recombination maps with num_loci - we had no choice but to break this.)

apragsdale · 2020-11-14T15:22:16Z

Ok, sorry for making all this noise about it after the discussions had all been settled! Like I said, I just got worried yesterday, and turns out I had no reason to be. Because `simulate()` also supports `discrete_genome` with a discrete genetic map, and if those aspects of the simulation work the same was as here, we can also simulate multiple chromosomes with `simulate()` in the same way. Might be worth a mention in those docs with a Multiple Chromosome header, with just something short and the analogous `similate(...)` options as needed, and then point to this page for the discussion since it's more or less the same. Does that sound reasonable, or are those sections of the docs in the process of moving and I should wait to add it there until things settle down?

jeromekelleher · 2020-11-14T17:01:28Z

Because simulate() also supports discrete_genome with a discrete genetic map, and if those aspects of the simulation work the same was as here, we can also simulate multiple chromosomes with simulate() in the same way. Might be worth a mention in those docs with a Multiple Chromosome header, with just something short and the analogous similate(...) options as needed, and then point to this page for the discussion since it's more or less the same.

The current plan is for simulate() to not support the discrete_genome option in an attempt to make the code simpler (#1297). I've already done a lot of the changes in #1287

I'd rather move away from the language of a "discrete genetic map" now too, as the recombination map and whether or not we have a discrete genome are fully decoupled.

apragsdale · 2020-11-15T00:16:04Z

The current plan is for simulate() to not support the discrete_genome option in an attempt to make the code simpler (#1297). I've already done a lot of the changes in #1287

Ah-ha! Ok, I can understand that. In that case, this is probably good to go. I cleaned it up a bit more as well. It could maybe use one more quick read-through to make sure it all reads clearly.

I'd rather move away from the language of a "discrete genetic map" now too, as the recombination map and whether or not we have a discrete genome are fully decoupled.

That makes sense. Thanks a lot for taking the time to clarify.

jeromekelleher

This looks great, thanks @apragsdale! Ready to merge after a squash.

jeromekelleher · 2020-11-15T14:19:19Z

That makes sense. Thanks a lot for taking the time to clarify.

Not at all - thanks for taking the time to try out the new APIs!

apragsdale · 2020-11-15T15:25:28Z

Not at all - thanks for taking the time to try out the new APIs!

It's really nice with some great new features! Ok - all squashed.

diegovelizo · 2024-05-13T00:33:11Z

This was not immediately clear to me, so I thought I'd add that here in case it's useful for anyone (and also in case I made any mistake and other people can spot it).
When stitching together recombination map files in hapmap format, you need to add 69.314 cM to the genetic position. The third column seems to be completely ignored. (In an earlier version of this comment I incorrectly said that either the third or fourth column could be edited to specify the start of the next chromosome, I was wrong.)

data = io.StringIO("""\
Chromosome  Position(bp)  Rate(cM/Mb)  Map(cM)
chr10       48232         0       100.002664
chr10       48486         0       100.002705
chr10       50009         0        100.002947
chr10       52147         0       100.003287
chr10       52148         0       169.31800522
""")
rate_map = msprime.RateMap.read_hapmap(data)
print(rate_map)
┌───────────────────────────────────────────┐
│left   │right  │      mid│   span│     rate│
├───────────────────────────────────────────┤
│0      │48232  │    24116│  48232│  2.1e-05│
│48232  │48486  │    48359│    254│  1.6e-09│
│48486  │50009  │  49247.5│   1523│  1.6e-09│
│50009  │52147  │    51078│   2138│  1.6e-09│
│52147  │52148  │  52147.5│      1│     0.69│
└───────────────────────────────────────────┘

data = io.StringIO("""\
Chromosome  Position(bp)  Rate(cM/Mb)  Map(cM)
chr10       48232         0.1614       0
chr10       48486         0.1589       0
chr10       50009         0.159        0
chr10       52147         69314718.06  0
chr10       52148         0.1574       0
""")
rate_map = msprime.RateMap.read_hapmap(data)

print(rate_map)
┌────────────────────────────────────────┐
│left   │right  │      mid│   span│  rate│
├────────────────────────────────────────┤
│0      │48232  │    24116│  48232│   nan│
│48232  │48486  │    48359│    254│     0│
│48486  │50009  │  49247.5│   1523│     0│
│50009  │52147  │    51078│   2138│     0│
│52147  │52148  │  52147.5│      1│     0│
└────────────────────────────────────────┘

jeromekelleher reviewed Nov 12, 2020

View reviewed changes

petrelharp mentioned this pull request Nov 14, 2020

Update the ancestry/genome properties section #1278

Closed

jeromekelleher approved these changes Nov 15, 2020

View reviewed changes

docs for using sim_ancestry with multiple chromosomes

fcfb2da

apragsdale force-pushed the multiple_chrom_docs branch from e007ffa to fcfb2da Compare November 15, 2020 15:22

jeromekelleher added the AUTOMERGE-REQUESTED label Nov 15, 2020

mergify bot merged commit 3750039 into tskit-dev:main Nov 15, 2020

mergify bot removed the AUTOMERGE-REQUESTED label Nov 15, 2020

hyanwong mentioned this pull request Nov 28, 2020

Simulation of multiple chromosomes #848

Closed

apragsdale deleted the multiple_chrom_docs branch March 25, 2021 21:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Docs for simulating with multiple chromosomes #1299

Docs for simulating with multiple chromosomes #1299

apragsdale commented Nov 12, 2020

AdminBot-tskit commented Nov 12, 2020

codecov bot commented Nov 12, 2020 •

edited

Loading

jeromekelleher left a comment •

edited

Loading

apragsdale commented Nov 12, 2020

jeromekelleher commented Nov 12, 2020

apragsdale commented Nov 12, 2020

jeromekelleher commented Nov 12, 2020

apragsdale commented Nov 12, 2020

petrelharp commented Nov 12, 2020

apragsdale commented Nov 12, 2020 •

edited

Loading

petrelharp commented Nov 12, 2020

petrelharp commented Nov 12, 2020

apragsdale commented Nov 13, 2020

petrelharp commented Nov 13, 2020

grahamgower commented Nov 13, 2020

apragsdale commented Nov 13, 2020

petrelharp commented Nov 13, 2020

apragsdale commented Nov 13, 2020

petrelharp commented Nov 14, 2020

apragsdale commented Nov 14, 2020

jeromekelleher commented Nov 14, 2020

apragsdale commented Nov 14, 2020 via email •

edited

Loading

jeromekelleher commented Nov 14, 2020

apragsdale commented Nov 15, 2020

jeromekelleher left a comment

jeromekelleher commented Nov 15, 2020

apragsdale commented Nov 15, 2020

diegovelizo commented May 13, 2024 •

edited

Loading

Docs for simulating with multiple chromosomes #1299

Docs for simulating with multiple chromosomes #1299

Conversation

apragsdale commented Nov 12, 2020

AdminBot-tskit commented Nov 12, 2020

codecov bot commented Nov 12, 2020 • edited Loading

Codecov Report

jeromekelleher left a comment • edited Loading

Choose a reason for hiding this comment

apragsdale commented Nov 12, 2020

jeromekelleher commented Nov 12, 2020

apragsdale commented Nov 12, 2020

jeromekelleher commented Nov 12, 2020

apragsdale commented Nov 12, 2020

petrelharp commented Nov 12, 2020

apragsdale commented Nov 12, 2020 • edited Loading

petrelharp commented Nov 12, 2020

petrelharp commented Nov 12, 2020

apragsdale commented Nov 13, 2020

petrelharp commented Nov 13, 2020

grahamgower commented Nov 13, 2020

apragsdale commented Nov 13, 2020

petrelharp commented Nov 13, 2020

apragsdale commented Nov 13, 2020

petrelharp commented Nov 14, 2020

apragsdale commented Nov 14, 2020

jeromekelleher commented Nov 14, 2020

apragsdale commented Nov 14, 2020 via email • edited Loading

jeromekelleher commented Nov 14, 2020

apragsdale commented Nov 15, 2020

jeromekelleher left a comment

Choose a reason for hiding this comment

jeromekelleher commented Nov 15, 2020

apragsdale commented Nov 15, 2020

diegovelizo commented May 13, 2024 • edited Loading

codecov bot commented Nov 12, 2020 •

edited

Loading

jeromekelleher left a comment •

edited

Loading

apragsdale commented Nov 12, 2020 •

edited

Loading

apragsdale commented Nov 14, 2020 via email •

edited

Loading

diegovelizo commented May 13, 2024 •

edited

Loading