Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ancestor generation requires entire genotype array to be in-memory. #806

Closed
benjeffery opened this issue Feb 21, 2023 · 10 comments
Closed

Comments

@benjeffery
Copy link
Member

As currently implemented, AncestorsGenerator.add_sites loads the entire sample data genotype array into memory. As one of our currently intended inference targets has 1.8TB of genotypes this is not possible.

We can of course read the genotypes from the zarr-backed sample data as needed (in break_ancestor, make_ancestor and compute_ancestral_state) but care will need to be taken to have some kind of decoded chunk caching mechanism. I thought this might be simple to do with zarr, but zarr-developers/zarr-python#306 has been open for a long time.
A simple FIFO or LRU cache might not be too difficult to implement, as long as the cache is larger than the typical ancestor length chunks shouldn't need to be loaded more than once.

@savitakartik This is what is OOM killing your jobs!

@jeromekelleher
Copy link
Member

I'm not sure we can do that @benjeffery - it's the C implementation of AncestorBuilder that's being called, and that's making some hard assumptions about fitting the genotypes into memory IIRC.

We'll need to think carefully about this.

@benjeffery
Copy link
Member Author

Yes, this gets tricky pretty fast. One way would be to give the C code a callback to get genotypes, but the memory management will be fiddly.
I'll do some more reading to get my head around the access pattern, if we can do this in (overlapping?) chunks then it would be ideal, but I guess it isn't that simple.

@jeromekelleher
Copy link
Member

If we packed down to 1 bit per genotype, would we manage? We can exclude singletons etc, so not all sites need to be considered.

@benjeffery
Copy link
Member Author

benjeffery commented Feb 22, 2023

Maybe simpler to do a numba version of the ancestor builder and keep it all in python?
Even with 1bit we're at ~250GB and that's for the smallest chromosome. (Although as you say singletons will reduce this)

@jeromekelleher
Copy link
Member

Yeah, I was thinking along the same lines, but good to look at the possibilities.

@jeromekelleher
Copy link
Member

Numba/zarr won't be trivial either because we'll have to explicitly work in chunks

@benjeffery
Copy link
Member Author

It is my understanding that the algorithm sweeps left and right from the focal site until half the samples are evicted. If we had a way to determine a rough distance from the focal site where (say) 99% of ancestor finding is contained we can load that chunk and keep the existing code the same (except for adding an additional stopping criteria when you hit the end of the loaded chunk).
If that 1% of long ancestors is essential, then the process can halt while the chunks further away are loaded, but that is more complex.

Will have a think on my bike ride in.

@jeromekelleher
Copy link
Member

@benjeffery and I discussed this and we think a pragmatic approach may be to bit-pack the genotype data to reduce the memory requirements four-fold.

@tomwhite
Copy link

tomwhite commented Mar 3, 2023

Can you run it on a large machine in the cloud? On GCP for example, the M2 machine series has up to 12TB of memory.

@jeromekelleher
Copy link
Member

It's a good point - we're definitely looking at the "lets throw some money at it" approach, but there are complexities on where these datasets can be processed because of data access agreements.

jeromekelleher added a commit to jeromekelleher/tsinfer that referenced this issue Apr 3, 2023
jeromekelleher added a commit to jeromekelleher/tsinfer that referenced this issue Apr 5, 2023
jeromekelleher added a commit to jeromekelleher/tsinfer that referenced this issue Apr 5, 2023
jeromekelleher added a commit to jeromekelleher/tsinfer that referenced this issue Apr 5, 2023
jeromekelleher added a commit to jeromekelleher/tsinfer that referenced this issue Apr 5, 2023
@mergify mergify bot closed this as completed in 5123fbf Apr 13, 2023
benjeffery pushed a commit to benjeffery/tsinfer that referenced this issue Apr 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants