Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add special case for merging non-overlapping segments #1931

Closed
jeromekelleher opened this issue Nov 24, 2021 · 4 comments
Closed

Add special case for merging non-overlapping segments #1931

jeromekelleher opened this issue Nov 24, 2021 · 4 comments

Comments

@jeromekelleher
Copy link
Member

Our analysis of the running time of Hudson's algorithm in the msprime 1.0 paper predicts that a lot of the time spent for long genomes will be in events that merge two widely separated chunks of ancestry. It may therefore be worthwhile adding a special case in the merge_ancestors code path to deal with this case, where given two segment chains a and b, we have the right-most segment of a is < the left most segment of b. However, we currently don't record the extremities of the segments per individual, so we would need to refactor things somewhat.

This would also only be worth doing if the number of segments in a and b we reasonably large, so a first step would be to do some exploratory work to see what the average number of segments in this "gather" phase of the simulation really is. If anyone is interested in making their Drosophila simulations go faster, then this would be a good place to start.

@GertjanBisschop
Copy link
Member

I have run simulations with Drosophila-like parameters:
num_reps = 500, num_samples = 20, rho = 4 * 1e6 * 8.4e-9, sequence_length = 5e6

Every $0.5 * 2 N_e$ generations information was gathered on

  • the number of lineages present
  • the average width of the hull (as a fraction of the sequence length) = right edge of last interval - left edge of first interval containing ancestral material.
  • the average amount of ancestral material (as a fraction of the sequence length) per lineage

all_data

Zoomed-in version:
zoomed

Heatmap showing the hull width (yellow) for a single run for all lineages present at time = $5 * 2 N_e$ generations. Sequence length (x-axis) has been scaled by a factor of 10 on the plot. Each row represents a single lineage.

single_run_sorted

@jeromekelleher
Copy link
Member Author

Very interesting thanks @GertjanBisschop! I think we need one more bit of information: what is the average number of segments per lineage at these time points?

It looks like the probability of two randomly chosen lineages overlapping is small (as expected), but our special case is worthwhile only if the number of segments the lineages carry is more than (say) 5.

@GertjanBisschop
Copy link
Member

Median (left) and mean (right) number of segments per lineage:
mean_num_segments
As expected, the number of segments per lineage comes down pretty quickly. There is only a small window during the coalescent process where the mean hull width is already quite small but where lineages on average still consist of quite a few segments.

@jeromekelleher
Copy link
Member Author

Hmm, ok, so this means that this special case we're talking about above won't have much effect.

Great work @GertjanBisschop - you just saved us a whole bunch of refactoring that would have ended in disappointment!

I'm going to close issue. We may want to implement the special case later if we change the data structures around a bit so that it's easy (i.e., we store the head and tail of the lineages segments, not just the head) but it's not worth doing that refactoring just for this minor optimisation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants