Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chunked data loading for large datasets #17

Open
zhonge opened this issue Aug 7, 2020 · 6 comments
Open

Chunked data loading for large datasets #17

zhonge opened this issue Aug 7, 2020 · 6 comments
Labels
enhancement New feature or request

Comments

@zhonge
Copy link
Collaborator

zhonge commented Aug 7, 2020

The default behavior is to load the whole dataset in memory for training, however for particularly large datasets that don't fit in memory, an option is currently provided (--lazy) to load images from disk on the fly during training. This is a very bad filesystem access pattern, and the latency of disk access can be a severe bottleneck in some cases.

Probably what makes more sense is to preprocess the data into chunks (how big?) and train on each chunk sequentially. Slightly less randomness in the mini-batches, but assuming there's no order in the dataset, this likely doesn't matter. Could also add FFT + normalization in this preprocessing step too. The main downside I see is additional disk space usage for storing the chunks.

@zhonge zhonge added the enhancement New feature or request label Aug 7, 2020
@aleksspasic
Copy link

Hello @zhonge

I wonder if you have any updates regarding this issue. We are working on a very large dataset (~2.5M images) which can't fit in the ram and using a --lazy flag makes the training too slow to be useful.

Thanks in advance,
Alex

@zhonge
Copy link
Collaborator Author

zhonge commented Jan 12, 2021

Thanks for the heads up. I can prioritize this feature.

@zhonge
Copy link
Collaborator Author

zhonge commented Apr 21, 2021

Just as an additional data point -- for a 1.4M particle dataset (D=128) I'm trying out, the training time goes from 43min -> 5:50hr per epoch if I load the whole dataset into memory vs. using the --lazy flag.

Todo: Look into the access patterns for processing large datasets in RELION/cryoSPARC/etc.

@Guillawme
Copy link
Contributor

@aleksspasic have you tried temporarily copying the particle stack to an SSD (assuming you have one) and running the cryoDRGN training reading particles from there? Optimizing disk access patterns will only get you so far if the data is on regular hard drives.

@zhonge
Copy link
Collaborator Author

zhonge commented Jul 10, 2021

I added a new script cryodrgn preprocess which preprocesses images before training and significantly reduces the memory requirement of cryodrgn train_vae. This is now available in the top of tree (commit d4b2195). I'm going to beta test this a little further before officially releasing.

Some brief documentation here (linked to in the tutorial): https://www.notion.so/cryodrgn-preprocess-d84a9d9df8634a6a8bfd32d6b5e737ef

@zhonge
Copy link
Collaborator Author

zhonge commented Nov 15, 2022

@vineetbansal, we should think about how to implement chunked data loading instead of the current options of either 1) loading the whole dataset into memory or 2) accessing each image on the fly.

One issue is that images are usually ordered (e.g. by defocus). One options is to shuffle the entire dataset, but this seems extremely suboptimal for many reasons... Another option could be to randomly sample a couple of smaller chunks, load all the chunks, then train on random minibatches within the combined chunk.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants