Ancestor generation requires entire genotype array to be in-memory. #806

benjeffery · 2023-02-21T23:56:48Z

As currently implemented, AncestorsGenerator.add_sites loads the entire sample data genotype array into memory. As one of our currently intended inference targets has 1.8TB of genotypes this is not possible.

We can of course read the genotypes from the zarr-backed sample data as needed (in break_ancestor, make_ancestor and compute_ancestral_state) but care will need to be taken to have some kind of decoded chunk caching mechanism. I thought this might be simple to do with zarr, but zarr-developers/zarr-python#306 has been open for a long time.
A simple FIFO or LRU cache might not be too difficult to implement, as long as the cache is larger than the typical ancestor length chunks shouldn't need to be loaded more than once.

@savitakartik This is what is OOM killing your jobs!

The text was updated successfully, but these errors were encountered:

jeromekelleher · 2023-02-22T09:26:09Z

I'm not sure we can do that @benjeffery - it's the C implementation of AncestorBuilder that's being called, and that's making some hard assumptions about fitting the genotypes into memory IIRC.

We'll need to think carefully about this.

benjeffery · 2023-02-22T09:41:11Z

Yes, this gets tricky pretty fast. One way would be to give the C code a callback to get genotypes, but the memory management will be fiddly.
I'll do some more reading to get my head around the access pattern, if we can do this in (overlapping?) chunks then it would be ideal, but I guess it isn't that simple.

jeromekelleher · 2023-02-22T09:46:25Z

If we packed down to 1 bit per genotype, would we manage? We can exclude singletons etc, so not all sites need to be considered.

benjeffery · 2023-02-22T09:50:23Z

Maybe simpler to do a numba version of the ancestor builder and keep it all in python?
Even with 1bit we're at ~250GB and that's for the smallest chromosome. (Although as you say singletons will reduce this)

jeromekelleher · 2023-02-22T10:02:30Z

Yeah, I was thinking along the same lines, but good to look at the possibilities.

jeromekelleher · 2023-02-22T10:03:44Z

Numba/zarr won't be trivial either because we'll have to explicitly work in chunks

benjeffery · 2023-02-22T10:09:13Z

It is my understanding that the algorithm sweeps left and right from the focal site until half the samples are evicted. If we had a way to determine a rough distance from the focal site where (say) 99% of ancestor finding is contained we can load that chunk and keep the existing code the same (except for adding an additional stopping criteria when you hit the end of the loaded chunk).
If that 1% of long ancestors is essential, then the process can halt while the chunks further away are loaded, but that is more complex.

Will have a think on my bike ride in.

jeromekelleher · 2023-02-22T14:56:12Z

@benjeffery and I discussed this and we think a pragmatic approach may be to bit-pack the genotype data to reduce the memory requirements four-fold.

tomwhite · 2023-03-03T09:41:01Z

Can you run it on a large machine in the cloud? On GCP for example, the M2 machine series has up to 12TB of memory.

jeromekelleher · 2023-03-03T09:56:06Z

It's a good point - we're definitely looking at the "lets throw some money at it" approach, but there are complexities on where these datasets can be processed because of data access agreements.

Closes tskit-dev#806

jeromekelleher added a commit to jeromekelleher/tsinfer that referenced this issue Apr 3, 2023

Add mmap_temp_dir option to generate_ancestors

7cb7161

Closes tskit-dev#806

jeromekelleher added a commit to jeromekelleher/tsinfer that referenced this issue Apr 5, 2023

Add mmap_temp_dir option to generate_ancestors

99fb936

Closes tskit-dev#806

jeromekelleher added a commit to jeromekelleher/tsinfer that referenced this issue Apr 5, 2023

Add mmap_temp_dir option to generate_ancestors

c126563

Closes tskit-dev#806

jeromekelleher added a commit to jeromekelleher/tsinfer that referenced this issue Apr 5, 2023

Add mmap_temp_dir option to generate_ancestors

ff30771

Closes tskit-dev#806

jeromekelleher added a commit to jeromekelleher/tsinfer that referenced this issue Apr 5, 2023

Add mmap_temp_dir option to generate_ancestors

0437b60

Closes tskit-dev#806

mergify bot closed this as completed in 5123fbf Apr 13, 2023

benjeffery pushed a commit to benjeffery/tsinfer that referenced this issue Apr 18, 2023

Add mmap_temp_dir option to generate_ancestors

8f1bc7e

Closes tskit-dev#806

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ancestor generation requires entire genotype array to be in-memory. #806

Ancestor generation requires entire genotype array to be in-memory. #806

benjeffery commented Feb 21, 2023

jeromekelleher commented Feb 22, 2023

benjeffery commented Feb 22, 2023

jeromekelleher commented Feb 22, 2023

benjeffery commented Feb 22, 2023 •

edited

Loading

jeromekelleher commented Feb 22, 2023

jeromekelleher commented Feb 22, 2023

benjeffery commented Feb 22, 2023

jeromekelleher commented Feb 22, 2023

tomwhite commented Mar 3, 2023

jeromekelleher commented Mar 3, 2023

Ancestor generation requires entire genotype array to be in-memory. #806

Ancestor generation requires entire genotype array to be in-memory. #806

Comments

benjeffery commented Feb 21, 2023

jeromekelleher commented Feb 22, 2023

benjeffery commented Feb 22, 2023

jeromekelleher commented Feb 22, 2023

benjeffery commented Feb 22, 2023 • edited Loading

jeromekelleher commented Feb 22, 2023

jeromekelleher commented Feb 22, 2023

benjeffery commented Feb 22, 2023

jeromekelleher commented Feb 22, 2023

tomwhite commented Mar 3, 2023

jeromekelleher commented Mar 3, 2023

benjeffery commented Feb 22, 2023 •

edited

Loading