Add special case for merging non-overlapping segments #1931

jeromekelleher · 2021-11-24T11:21:13Z

Our analysis of the running time of Hudson's algorithm in the msprime 1.0 paper predicts that a lot of the time spent for long genomes will be in events that merge two widely separated chunks of ancestry. It may therefore be worthwhile adding a special case in the merge_ancestors code path to deal with this case, where given two segment chains a and b, we have the right-most segment of a is < the left most segment of b. However, we currently don't record the extremities of the segments per individual, so we would need to refactor things somewhat.

This would also only be worth doing if the number of segments in a and b we reasonably large, so a first step would be to do some exploratory work to see what the average number of segments in this "gather" phase of the simulation really is. If anyone is interested in making their Drosophila simulations go faster, then this would be a good place to start.

The text was updated successfully, but these errors were encountered:

GertjanBisschop · 2022-11-07T09:44:39Z

I have run simulations with Drosophila-like parameters:
num_reps = 500, num_samples = 20, rho = 4 * 1e6 * 8.4e-9, sequence_length = 5e6

Every $0.5 * 2 N_e$ generations information was gathered on

the number of lineages present
the average width of the hull (as a fraction of the sequence length) = right edge of last interval - left edge of first interval containing ancestral material.
the average amount of ancestral material (as a fraction of the sequence length) per lineage

Zoomed-in version:

Heatmap showing the hull width (yellow) for a single run for all lineages present at time = $5 * 2 N_e$ generations. Sequence length (x-axis) has been scaled by a factor of 10 on the plot. Each row represents a single lineage.

jeromekelleher · 2022-11-08T10:50:12Z

Very interesting thanks @GertjanBisschop! I think we need one more bit of information: what is the average number of segments per lineage at these time points?

It looks like the probability of two randomly chosen lineages overlapping is small (as expected), but our special case is worthwhile only if the number of segments the lineages carry is more than (say) 5.

GertjanBisschop · 2022-11-10T09:18:01Z

Median (left) and mean (right) number of segments per lineage:

As expected, the number of segments per lineage comes down pretty quickly. There is only a small window during the coalescent process where the mean hull width is already quite small but where lineages on average still consist of quite a few segments.

jeromekelleher · 2022-11-10T09:33:15Z

Hmm, ok, so this means that this special case we're talking about above won't have much effect.

Great work @GertjanBisschop - you just saved us a whole bunch of refactoring that would have ended in disappointment!

I'm going to close issue. We may want to implement the special case later if we change the data structures around a bit so that it's easy (i.e., we store the head and tail of the lineages segments, not just the head) but it's not worth doing that refactoring just for this minor optimisation.

jeromekelleher added the Performance label Nov 24, 2021

jeromekelleher closed this as completed Nov 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add special case for merging non-overlapping segments #1931

Add special case for merging non-overlapping segments #1931

jeromekelleher commented Nov 24, 2021

GertjanBisschop commented Nov 7, 2022

jeromekelleher commented Nov 8, 2022

GertjanBisschop commented Nov 10, 2022

jeromekelleher commented Nov 10, 2022

Add special case for merging non-overlapping segments #1931

Add special case for merging non-overlapping segments #1931

Comments

jeromekelleher commented Nov 24, 2021

GertjanBisschop commented Nov 7, 2022

jeromekelleher commented Nov 8, 2022

GertjanBisschop commented Nov 10, 2022

jeromekelleher commented Nov 10, 2022