You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For single GPU training, without PyTorch DataLoader multiprocessing or without MultiProcDataset, the memory usage of the dataset is maybe not too much of a problem. However, it is not uncommon to have one or multiple of these:
Using distributed multi GPU training, i.e. having multiple workers, causing multiple instances of the dataset in memory.
Using PyTorch DataLoader multiprocessing. This has the dataset then in the main proc (but it is freed there if the dataset supports finish_epoch with free_resources, see also MultiProcDataset, high memory usage #1443) and also in all the DataLoader workers.
Using MultiProcDataset has the dataset in each worker.
You usually have train, dev, maybe also devtrain.
When multiple of those are used together, the amount of dataset instances in memory is multiplied by quite a high factor.
In my case, I even have all three together. This leads to 34 instances (see #1443 (comment)).
See the related issue #1443, which is specifically about MultiProcDataset.
This issue here is to discuss potential further solutions on the problem. Those solutions probably involve a new type of dataset which has only minimal memory requirements, and e.g. mmaps the data somehow, or share the memory somehow. Probably we would use some other library for that, which does this for us. I'm not sure how HDF or Apache Arrow or similar might already do exactly that.
For single GPU training, without PyTorch DataLoader multiprocessing or without MultiProcDataset, the memory usage of the dataset is maybe not too much of a problem. However, it is not uncommon to have one or multiple of these:
finish_epoch
withfree_resources
, see also MultiProcDataset, high memory usage #1443) and also in all the DataLoader workers.When multiple of those are used together, the amount of dataset instances in memory is multiplied by quite a high factor.
In my case, I even have all three together. This leads to 34 instances (see #1443 (comment)).
See the related issue #1443, which is specifically about MultiProcDataset.
This issue here is to discuss potential further solutions on the problem. Those solutions probably involve a new type of dataset which has only minimal memory requirements, and e.g. mmaps the data somehow, or share the memory somehow. Probably we would use some other library for that, which does this for us. I'm not sure how HDF or Apache Arrow or similar might already do exactly that.
But this is also an ongoing discussion for PyTorch or other frameworks. E.g. see the ongoing discussions here:
pytorch/pytorch#13246
pytorch/pytorch#101699
The text was updated successfully, but these errors were encountered: