High memory usage with datasets (specifically when multi procs are used) #1498

albertz · 2024-01-18T09:53:11Z

For single GPU training, without PyTorch DataLoader multiprocessing or without MultiProcDataset, the memory usage of the dataset is maybe not too much of a problem. However, it is not uncommon to have one or multiple of these:

Using distributed multi GPU training, i.e. having multiple workers, causing multiple instances of the dataset in memory.
Using PyTorch DataLoader multiprocessing. This has the dataset then in the main proc (but it is freed there if the dataset supports finish_epoch with free_resources, see also MultiProcDataset, high memory usage #1443) and also in all the DataLoader workers.
Using MultiProcDataset has the dataset in each worker.
You usually have train, dev, maybe also devtrain.

When multiple of those are used together, the amount of dataset instances in memory is multiplied by quite a high factor.

In my case, I even have all three together. This leads to 34 instances (see #1443 (comment)).

See the related issue #1443, which is specifically about MultiProcDataset.

This issue here is to discuss potential further solutions on the problem. Those solutions probably involve a new type of dataset which has only minimal memory requirements, and e.g. mmaps the data somehow, or share the memory somehow. Probably we would use some other library for that, which does this for us. I'm not sure how HDF or Apache Arrow or similar might already do exactly that.

But this is also an ongoing discussion for PyTorch or other frameworks. E.g. see the ongoing discussions here:
pytorch/pytorch#13246
pytorch/pytorch#101699

The text was updated successfully, but these errors were encountered:

albertz mentioned this issue Jan 18, 2024

MultiProcDataset, high memory usage #1443

Closed

albertz mentioned this issue May 29, 2024

Support for larger scale datasets #1519

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High memory usage with datasets (specifically when multi procs are used) #1498

High memory usage with datasets (specifically when multi procs are used) #1498

albertz commented Jan 18, 2024

High memory usage with datasets (specifically when multi procs are used) #1498

High memory usage with datasets (specifically when multi procs are used) #1498

Comments

albertz commented Jan 18, 2024