Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long ancient ancestors / slowness in HGDP analysis #276

Closed
hyanwong opened this issue Jun 15, 2020 · 25 comments · Fixed by #467
Closed

Long ancient ancestors / slowness in HGDP analysis #276

hyanwong opened this issue Jun 15, 2020 · 25 comments · Fixed by #467

Comments

@hyanwong
Copy link
Member

hyanwong commented Jun 15, 2020

Because of the way we build ancestors, the oldest ancestors end up being extremely long (as nearly all sites are younger, and hence are skipped over when deciding when to stop building to left and right.

So far we have managed to get away with this because in the datasets we've previously been using there are few high frequency sites, but that's changing. For instance, in the 1000 genomes data there are 2120 sites at frequency N-2, but in the HGDP there are over twice that number (4863). That means in HGDP there are a lot of ancient ancestors that span the entire genome (~17,000 out of 65000, versus 10,000 such sites out of 700,000 in the TGP). Here's a plot of the median length of ancestors binned by time of the ancestor (i.e. frequency of the focal site). You can see that in the HGDP there are a lot more full-length ancient ancestors - the pattern is the same regardless of whether we include or excluding missing sites, although because there are a sixth of the number of sites when excluding missing data, the actual number of long ancestors in that case is relatively fewer.

Screenshot 2020-06-15 at 12 30 19

Moreover, in the 1000G, the 10,000 full length ancestors are distributed over only 87 time values, which allows easy parallelisation. In HGDP with missing data the 17,000 full length ancestors occur over 2136 separate times, and about half these time points have only one ancestor. Note that if we exclude missing data from HGDP it's much more manageable: we only have 2987 full length ancestors over 70 time slices.

Both the large number of long ancestors and the few of these ancestors per time point is causing inference to be extremely slow for HGDP with missing data. I've been running it using all 40 processes on holly for 40 hours days and we're only 2% through match ancestors. This is basically the same as @awohns found. It also explains why I didn't find any effect of including missing sites when I tried inference with smaller subsets of the HGDP, because by subsetting I end up with very few high freq sites, so that inferences with vs without missing data are comparable when normalising by the number of sites (i.e. there's nothing in the missing sites code per se that makes it slower).

We have several possibilities to speed this up:

  1. (crude) exclude any sites from inference that have a freq greater that, say 99.9%.
  2. (slightly more sophisticated) exclude from inference only those at high frequency that also have some missing data (this would reduce the number of full length ancestor time points).
  3. group the frequencies together so that e.g. an ancestor at time=freq=0.97956989 cannot copy from one at time=freq=0.98030635, which will allow us to parallelise more efficiently. This is something that @awohns has previously done
  4. since we expect old ancestors to be short, we could artificially constrain the length of the oldest ancestors somehow, either simply by total distance, or by changing the algorithm that determines when to break (e.g. don't buffer changes, or stop building when the tracked sample set S is (say) 3/4 of the original size, rather than 1/2)
@jeromekelleher
Copy link
Member

Very interesting, thanks @hyanwong. I think a mixture of something along the lines of 3 and 4 will be the long term solutions to this problem. In terms of making things parallelise better, there's probably a spatial component that we can introduce (old ancestors for focal sites at either end of the genome are not expected to copy from each other), but would need a little head scratching to do properly. We'd want to decide whether we want to do things exactly or whether we want some approximation also.

A couple of questions:

  1. What's the rough processor utilisation when doing the early stages of HGDP (if it's hovering around 100% then parallelisation is a big issue)?
  2. How much of these full-length ancient ancestors do we end up using? I.e., how long to the nodes span in the eventual tree sequence (for 1000G, say)? If it's quite small then we're well justified in taking some drastic cutoffs to shorten them.

@jeromekelleher
Copy link
Member

In the short term either (1) or (2) are fine though, I'd imagine.

@awohns
Copy link
Member

awohns commented Jun 15, 2020

Thanks for this in-depth post @hyanwong. I'm unsure how (3) would affect the quality of the inference, but I imagine it can't matter that much? 2136 separate times seems like much too many, especially if there's a new time slice for every possible combination of # of missing sites and frequency @ nonmissing sites.

Digging into why there are so many full length ancestors, you say:

In HGDP with missing data the 17,000 full length ancestors occur over 2136 separate times, and about half these time points have only one ancestor. Note that if we exclude missing data from HGDP it's much more manageable: we only have 2987 full length ancestors over 70 time slices.

Should tsinfer be making ancestors with missing data so long? Is it that when missing data is included, many of these long ancestors are missing a lot of data and are only at high frequency at non-missing sites?

@hyanwong
Copy link
Member Author

  1. What's the rough processor utilisation when doing the early stages of HGDP (if it's hovering around 100% then parallelisation is a big issue)?

I'm not sure how I find this out, programatically? I'm fairly sure that when I've looked at the progress bar, about half the time it's been stuck with nanc=1 displayed.

  1. How much of these full-length ancient ancestors do we end up using? I.e., how long to the nodes span in the eventual tree sequence (for 1000G, say)? If it's quite small then we're well justified in taking some drastic cutoffs to shorten them.

the mean length of the oldest (say) 800 nodes in the 1000G tree sequence is 0.1% of the total length of the sequence (i.e. about 60kb). The median is smaller (0.01%) and the maximum length of any old ancestor is ~ 9% of the entire TS.

@hyanwong
Copy link
Member Author

I would guess all these full length ancestors are occurring at separate times because of the missing data?

@awohns - the old ancestors with only one ancestral haplotype in a time slice are likely to come from sites where there is missing data. But it need not be at a low freq if missing are treated as ancestral. I suspect it is where there are a few samples (but not loads) with missing data and only one or two have the ancestral state.

Should tsinfer be making ancestors with missing data so long? Is it that when missing data is included,

It's an artefact of the way we make ancestors. Old ones don't have many contradictory sites, simply because they are old, so we just carry on extending them. We have just as many long ancestors (proportionally) in the HGDP even when we remove all the missing sites (that's what the graph shows), so it's not a feature of the missingness in particular. The missingness simply makes it harder to parallelize.

@jeromekelleher
Copy link
Member

I'm not sure how I find this out, programatically? I'm fairly sure that when I've looked at the progress bar, about half the time it's been stuck with nanc=1 displayed.

Just look at top. If the process is using 4000% CPU, things are good. If it's 100%, not so good.

@hyanwong
Copy link
Member Author

Just look at top. If the process is using 4000% CPU, things are good. If it's 100%, not so good.

Good point. Mostly 100%.

@jeromekelleher
Copy link
Member

the mean length of the oldest (say) 800 nodes in the 1000G tree sequence is 0.1% of the total length of the sequence (i.e. about 60kb). The median is smaller (0.01%) and the maximum length of any old ancestor is ~ 9% of the entire TS.

OK, good, so most of the older ancestral haplotypes we're generating are garbage. So the question is, how to do we stop making these old ancestors so long?

Good point. Mostly 100%.

OK, good also. So we're basically processing these long garbage ancestors one-by-one. This is good news, we just have a technical problem to solve rather than a fundamental problem.

@hyanwong
Copy link
Member Author

hyanwong commented Jun 15, 2020

OK, good, so most of the older ancestral haplotypes we're generating are garbage. So the question is, how to do we stop making these old ancestors so long?

Yes, exactly. The obvious thing (to me) is to change the percent cutoff of the sample set where we terminate (i.e. min_sample_set_size here). I'm experiment with this now. Meanwhile, I will leave the full version running, as I estimate it will only take another day to complete, and then we'll have something for comparison against other methods.

OK, good also. So we're basically processing these long garbage ancestors one-by-one. This is good news, we just have a technical problem to solve rather than a fundamental problem.

Yes, the missing data forces one-by-one-ness. The obvious work around is to group the ancestors into time slices, as @awohns did to speed up inference when using his properly dated ancestors in the iterative step of tsinfer+tsdate inference.

@hyanwong
Copy link
Member Author

hyanwong commented Jun 18, 2020

Here's one way of illustrating the issue. Above the y=1 line, the value shows the proportion of the ancestor that is unneeded excess (i.e. the amount to the left and right of the true ancestral node that we should need to build, as a proportion of the true node length). Below the y=1 line, we measure the proportion of the true ancestral node that is actually missing from our reconstructed ancestor (i.e. 1=complete overlap with no excess, 0=no overlap at all between true and reconstructed ancestors). This is for an OOA simulation with 60 samples

Screenshot 2020-06-18 at 13 25 47

It's the insanely high excess on the right that is causing such slow inference (these are excessively long ancestors: note the log scales). What we want to do is to bring down the curve on the right without lowering the number of points below the line.

@hyanwong
Copy link
Member Author

Interestingly, from the plot above, the most egregious lengths of missing ancestral haplotypes are in the doubletons (the first set of points in the vertical line on the left), which constitute over a third of the total number of too-short ancestors. If reduce the min_sample_set_size to less than the current value of num_focal_samples/2, then the ancestors associated with doubletons will end up extending across the whole sequence, which could kill performance too (especially given that doubletons are so common). But I'm wondering if there is another way to keep extending the doubleton ancestors a bit further, @jeromekelleher ?

@hyanwong
Copy link
Member Author

hyanwong commented Jun 19, 2020

I've been experimenting, and weirdly, increasing the length of ancestors associated with doubletons actually increases both the number of edges and (more particularly) the number of mutations, across the board, for all mut_rate values. However, we seem to be able to increase the min_sample_set_size value for the oldest ancestors without affecting accuracy (which is not surprising, as these are invariably thousands of times longer than the real ancestors.

I have been doing a bit of experimenting with the cutoff proportion, defined such that e.g. if the proportion is 0.75 then we stop building when the sample set size gets to 75% of the original number of focal samples (the previous algorithm was equivalent to using a constant cutoff proportion of 0.5). An easy function that will do pretty much as we require is cutoff_proportion = (freq ^ x + 1) /2. This is 0.5 for low frequencies, but increases to near 1 for high frequencies. The value x, or "cutoff exponent" determines how quickly we go towards 1, and a value of x=4 seems to give a reasonable balance between not overly shortening medium-aged ancestors, but setting sensibly high cutoff proportions for the oldest ancestors. In fact, this hardly deviates from 0.5 until we are well over a frequency of a half:
Screenshot 2020-06-19 at 09 29 25

Note that since we never have inference sites at a frequency of 1, the cutoff_proportion is never one (if, say, we have a million samples, then even with a site at the maximum frequency of 999999, we will allow 3 samples to disagree before stopping the ancestor builder: (((999999/1000000)^4+1)/2)*999999 = 999997).

On simulated data with error, this seems to work: I've plotted the overlap & excess as in the comment above with the original values in blue, as before, and the new values in red. The left is an OOA simulation with error and GC of a little over 5Mb of sequence with 60 samples, the right is with 600 samples. In both cases the new algorithm hardly increases the number of ancestors with overlap <100% (for 60 samples, was 1511 not-long-enough ancestors, now 1519; for 600 samples, was 6784 now 6793), but as you can see from the top end, the ancestors associated with the highest freq are noticably shorter (there's a lower excess, seen as a red patch below the blue, also reflected in the av. lengths of high frequency ancestors, printed out at the top). As hoped, the number of full length ancestors also decreases quite notably (from 248 to 106 in the small example, and from 718 to 197 in the larger)

image

This seems a reasonably conservative way to make our inference perhaps 4 times faster, without impacting accuracy. I'm running it on full data sets now. Running this version on the 1000G samples file cuts the number of full-length ancestors from 9097 to 1945 out of a total of 699386 ancestors. Running it on the HGDP with missing data reduces the number of full length ancestors from 16084 to 5446 out of 642344.

@hyanwong
Copy link
Member Author

hyanwong commented Jun 19, 2020

Some feedback from TGP chr20 full estimates. The original cutoff, min_sample_set_size = sample_set_size/2 takes about 53870 CPU minutes to run on cycloid. The new cutoff: min_sample_set_size = sample_set_size * (pow((double) sample_set_size/num_samples, 4) + 1.0)/2.0 takes about 14100 CPU minutes on cycloid, so reducing the number of full-length ancestors from 9097 to 1945 using my new cutoff function makes the inference algorithm about 4 times faster. We lose a very, very tiny bit of accuracy: there's a 0.075% increase in the number of trees and a 0.03 % increase in the number of edges+mutations, but that's easily within the margin of variation given by slightly different parameter values.

@hyanwong
Copy link
Member Author

I'v also just noticed that 1943 / 1945 full length ancestors in the TGP ancestors file are at the highest possible frequency. With the current algorithm, they can't ever conflict with anything, which is why they span the entire genome. If we could cut these down in size, I think we would run blazingly fast, and we could run e.g. TGP chromosome 1 in a few hours.

One possibility for ancestors at the oldest time points is to allow a site at the same frequency to conflict during the ancestor building phase. Normally, adjacent sites at the same frequency are ignored.

@jeromekelleher
Copy link
Member

I'v also just noticed that 1943 / 1945 full length ancestors in the TGP ancestors file are at the highest possible frequency. With the current algorithm, they can't ever conflict with anything, which is why they span the entire genome. If we could cut these down in size, I think we would run blazingly fast, and we could run e.g. TGP chromosome 1 in a few hours.

Well spotted! I think it's worth trying this out by hand to see if it really will make a difference. Maybe we could edit the ancestors file to make these ancestors a few kb long (say) instead of the full span and see how much this improves run time? I hadn't thought that tweaking this dimension would make such a big perf difference, so just having a bit of informal data here would be very helpful.

@hyanwong
Copy link
Member Author

hyanwong commented Jun 29, 2020

So the initial profiling runs for a relatively large whack of HGDP chr20 (20,000 sites) are finished. Here's what I found. In the plot below, the numbers below each pair of points are the "cutoff-exponent" value described above, which I previously and somewhat arbitrarily suggested setting to 4 (when it is Inf it is equivalent to the previous behaviour).

It looks like there is a roughly linear relationship trading off file size (edges+muts) against speed, but the file size changes only by a very small amount, so, for instance, an increase in muts+edges of 0.386% (from x=inf to x=4) gives a roughly 3x speedup for HGDP - not quite as good as I had hoped, but still not too shabby.

The idea of setting the oldest ancestors to break if they conflict sufficiently even with ancestors in the same timeslice as themselves (i.e. treating the oldest ones as "special") seems to basically make no difference for HGDP. I suppose this might be because of the missing data effect: there are not likely to be many individuals at the oldest time point if 80% of the data has missing sites.

Rplot

This was referenced Jul 2, 2020
@awohns
Copy link
Member

awohns commented Feb 11, 2021

We've returned to this issue when inferring a combined tree sequence of multiple datasets.

@hyanwong proposed the following approach, and my implementation seems to make a minimal difference in simulations. It also seems to really cut down runtime on real human chromosomes.

  1. Generate ancestors as normal
  2. Look at the longest ancestor (i.e. greatest span) existing in the time (frequency) band from 0.4-0.6. Set twice this value this equal to max_length. The factor of 2 is arbitrary, we could remove it
  3. Look at all ancestors older than 0.6, if they are longer than max_length, only allow the leftmost focal site to extend max_length/2 to the left and the rightmost focal site to extend max_length/2 to the right.
  4. Change the ancestors_start, ancestors_end, ancestors_haplotype as appropriate

I propose adding this to generate_ancestors, toggling it as a parameter called truncate_ancestors which is set to False by default. I'm running hgdp+sgdp+1kg chromosome 20 with this truncation to see how output changes. If the change is minimal and bigger simulations look good too, then I think this is a reasonable adjustment to include.

Here's a visualisation of what this truncation does.

import msprime
import tsinfer
import matplotlib.pyplot as plt

sim = msprime.sim_ancestry(150, recombination_rate=1e-8, sequence_length=5e6,
                               population_size=10000, random_seed=1)
mutated = msprime.sim_mutations(sim, rate=1e-8, random_seed=1)
samples = tsinfer.formats.SampleData.from_tree_sequence(mutated, use_sites_time=False)
ancestors = tsinfer.generate_ancestors(samples, num_threads=40)
ancestor_data = truncate_ancestors(ancestors)
fig, axes = plt.subplots(1,2, figsize=(15,5), sharex=True, sharey=True)
axes[0].scatter(np.array(ancestors.ancestors_time),ancestors.ancestors_length)
axes[1].scatter(np.array(ancestor_data.ancestors_time),ancestor_data.ancestors_length)
axes[0].set_title("Untruncated Ancestors")
axes[1].set_title("Truncated Ancestors")
for ax in axes.ravel():
    ax.set_xlabel("Ancestor Time")
    ax.set_ylabel("Ancestor Length")

image

I'll PR the truncation code if we think this the way to go

@jeromekelleher
Copy link
Member

Yes, this seems like a good pragmatic approach. I agree with the plan.

@hyanwong
Copy link
Member Author

SGTM. I wonder whether we should add it as a parameter to generate ancestors, or have it as a method on an AncestorData object, that returns a modified AncestorData instance? That way we can parameterise it (with sensible defaults), like

new_ancestors = ancestors.truncate_oldest(above_time=0.6, lower_time_bound=0.4, length_multiplier=2)

You can probably think of better parameter names. Remember that a user might provide times other that 0..1 (as we do in the second round of iteration - although in our pipeline we are presumably retaining the shorter ancestors from the previous round). So I wonder if there's a more general way of specifying the "middle" of the time range which doesn't rely on times being in 0..1. I'm probably just being too pernickety though.

On this note, it strikes me as pretty useful knowing the meaning of "time" in an AncestorsData instance, which would help us to know if things like this are sensible, and would also (presumably) be passed on or checked against the SampleData file before being added to the TS metadata. I'll open this up as an issue.

@hyanwong
Copy link
Member Author

hyanwong commented Feb 11, 2021

Tiny nitpick, if we are actually going to add this to tsinfer. We should probably explicitly check if the leftmost site is within max_length/2 of 0, and if so, add the extra on to the rightmost cutoff, and vice versa. That means even if the factor of 2 is not in max_length, we never cut down a haplotype to less than max_length.

The way we do it at the moment, a haplotype with a focal site at 0 could end up being max_length/2 bases long, not max_length bases long.

Unrelated to this, I think we should keep the factor of 2 (or something like that) in there anyway, to be conservative. Fingers crossed that this solves one of our problems.

@awohns
Copy link
Member

awohns commented Feb 11, 2021

Great points, thanks @hyanwong. Will write up the PR

@awohns
Copy link
Member

awohns commented Feb 11, 2021

Actually, I'm just thinking over your second point now @hyanwong and I think what you're proposing is that we try and balance all the ancestors, not just the ones near the edges. If we check the condition you mention, we won't catch unbalanced ancestors elsewhere: for instance what about an ancestor with one focal site, where the haplotype extends 10 Mb to the left and 500 kb to the right where the max_length is 2 Mb? Following your logic ("balanced haplotypes") we would keep 1.5 Mb to the left and 500 kb to the right. Seems sensible.
Doing the "balanced ancestor" approach works for single focal sites, but what about when there are two focal sites. Say we have an ancestor with two focal sites, 1.9 Mb apart on a haplotype that is 10 Mb and the max_length is 2 Mb. It seems to me that in this case we actually do want to go max_length/2 to the right and left of the rightmost and leftmost focal sites.
Also, I think we just have to accept that ancestors with >2 focal sites which are >max_length apart will be longer than max_length.

@awohns
Copy link
Member

awohns commented Feb 11, 2021

I guess the simpler thing in all cases to do is just go max_length/2 to the right of the rightmost ancestor and to the left of the leftmost ancestor, but I'm ready to be convinced that balancing the haplotypes is really necessary

@hyanwong
Copy link
Member Author

Yes, it's simpler just to go max_length/2 (which justifies the doubling factor in max_length), but it seems a little less principled. I guess the principle here is to never trim down to less than max_length in total (unless the haplotype is actually smaller than that anyway)

It's probably not too bad. You just need to do something like this, I think?

min_left = max(0, start)
max_right = min(seq_len, end)
left_overhang = min(focal_site_pos) - min_left
right_overhang = max_right - max(focal_site_pos)
truncate_left = min(left_overhang, max_length/2)
truncate_right = min(right_overhang, max_length/2)
if truncate_left < max_length/2:
    truncate_right += (max_length/2 - truncate_left)
elif truncate_right < max_length/2:
    truncate_left += (max_length/2 - truncate_right)
new_start = max(min_left, start-truncate_left)
new_end = min(max_right, end+truncate_right)

Or something like that. I haven't checked the logic.

@jeromekelleher
Copy link
Member

guess the simpler thing in all cases to do is just go max_length/2 to the right of the rightmost ancestor and to the left of the leftmost ancestor, but I'm ready to be convinced that balancing the haplotypes is really necessary

Agree with this - let's do the simplest thing we can think of first. I don't think there's any point in trying to make guarantees about the lengths of ancestors, this is all highly heuristic in the first place.

So, just truncate at max_length / 2 on each side. Take the left-most focal site on the left, and the right-most focal site on the right. (Nearly always, there will be one focal site - it's probably not worth worrying about what the right thing to do is when we have more then one.)

Anyway, let's see if the simplest rule we can come up with works or not before we start discussing whether we should make it more complicated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants