Chunked data loading for large datasets #17

zhonge · 2020-08-07T15:11:17Z

The default behavior is to load the whole dataset in memory for training, however for particularly large datasets that don't fit in memory, an option is currently provided (--lazy) to load images from disk on the fly during training. This is a very bad filesystem access pattern, and the latency of disk access can be a severe bottleneck in some cases.

Probably what makes more sense is to preprocess the data into chunks (how big?) and train on each chunk sequentially. Slightly less randomness in the mini-batches, but assuming there's no order in the dataset, this likely doesn't matter. Could also add FFT + normalization in this preprocessing step too. The main downside I see is additional disk space usage for storing the chunks.

The text was updated successfully, but these errors were encountered:

aleksspasic · 2021-01-07T22:38:52Z

Hello @zhonge

I wonder if you have any updates regarding this issue. We are working on a very large dataset (~2.5M images) which can't fit in the ram and using a --lazy flag makes the training too slow to be useful.

Thanks in advance,
Alex

zhonge · 2021-01-12T01:53:45Z

Thanks for the heads up. I can prioritize this feature.

zhonge · 2021-04-21T19:21:53Z

Just as an additional data point -- for a 1.4M particle dataset (D=128) I'm trying out, the training time goes from 43min -> 5:50hr per epoch if I load the whole dataset into memory vs. using the --lazy flag.

Todo: Look into the access patterns for processing large datasets in RELION/cryoSPARC/etc.

Guillawme · 2021-04-22T08:57:04Z

@aleksspasic have you tried temporarily copying the particle stack to an SSD (assuming you have one) and running the cryoDRGN training reading particles from there? Optimizing disk access patterns will only get you so far if the data is on regular hard drives.

zhonge · 2021-07-10T19:05:07Z

I added a new script cryodrgn preprocess which preprocesses images before training and significantly reduces the memory requirement of cryodrgn train_vae. This is now available in the top of tree (commit d4b2195). I'm going to beta test this a little further before officially releasing.

Some brief documentation here (linked to in the tutorial): https://www.notion.so/cryodrgn-preprocess-d84a9d9df8634a6a8bfd32d6b5e737ef

zhonge · 2022-11-15T15:22:06Z

@vineetbansal, we should think about how to implement chunked data loading instead of the current options of either 1) loading the whole dataset into memory or 2) accessing each image on the fly.

One issue is that images are usually ordered (e.g. by defocus). One options is to shuffle the entire dataset, but this seems extremely suboptimal for many reasons... Another option could be to randomly sample a couple of smaller chunks, load all the chunks, then train on random minibatches within the combined chunk.

zhonge added the enhancement New feature or request label Aug 7, 2020

zhonge mentioned this issue May 7, 2021

GPU RAM Requirement #59

Closed

lainahall mentioned this issue Jan 4, 2024

Way to import CisTEM .star and .mrc files in to cryodrgn? #330

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chunked data loading for large datasets #17

Chunked data loading for large datasets #17

zhonge commented Aug 7, 2020

aleksspasic commented Jan 7, 2021

zhonge commented Jan 12, 2021

zhonge commented Apr 21, 2021

Guillawme commented Apr 22, 2021

zhonge commented Jul 10, 2021

zhonge commented Nov 15, 2022

Chunked data loading for large datasets #17

Chunked data loading for large datasets #17

Comments

zhonge commented Aug 7, 2020

aleksspasic commented Jan 7, 2021

zhonge commented Jan 12, 2021

zhonge commented Apr 21, 2021

Guillawme commented Apr 22, 2021

zhonge commented Jul 10, 2021

zhonge commented Nov 15, 2022