New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug][Dataloader] unable to mmap 2048 bytes from file <filename not specified>: Cannot allocate memory (12) #92134
Comments
Based on the log, it implies you just ran out of shared memory. It would be good if you can share a minimum script. |
Thank you for your attention. We encounter this error when running the pytorch example for RN50 in the below link: We enable this model script on XPU and got this mmap issue. (Cannot allocate the memory) We are using 64bit Linux machine. The mmap can map files into a excessive virtual memory space with large size in theory. The memory mapping segments could be very large. Is it possible the munmap is demand releasing memory, so the virtual memory mapped by previous iterations are not immediately released, then the memory mapping segments don't have enough size for next iteration dataloader's mapping ? Additionally, this issue needs to run much time to reproduce. It needs about running 50 epochs. |
If you can't allocate memory, can you please try to use |
@ejguan Thank you for sharing the link. We found the community also encountered this issue. Thank you. |
You might want to profile your training script to see when you run out of memory. You might be able to rely on |
Hi, @ejguan I check the memory usage with
For the shared memory usage, there is no sharply increasing found.
|
Hi, @ejguan Sorry to interrupt you. I have some confusion about how the pytorch dataloader works with multi workers. Here the workers will put their result data into the data queue. The worker process will call I track the syscall in the main process, here is a mmap. I wonder what this mmap is called for. Thank you. |
We need to move the underlying data to shared memory then the main process would be able to get data from worker process. |
Thank you for your reply. I see the worker will immediately delete the data when it finishes the processing and put data into the queue. The code is here: pytorch/torch/utils/data/_utils/worker.py Line 323 in 728dfee
Does it mean the data queue will totally copy this data from worker's virtual memory space into the /dev/shm so the worker can safely del data ?Then the data queue will wait for main process to use mmap to map this memory from /dev/shm into its virtual memory space ?
|
Right. |
Thank you for your explanation. :) |
Hi @zejun-chen, did you find a solution to the problem? |
Hi, @ASMIftekhar |
Hello, I get this idea from: this blog and this issue |
Hello, @ASMIftekhar |
same here |
i meet a same problem. and it is solved when i init 'data_list' in 'getitem()' instead of 'init()' in a Dataset class.
|
Hello, I met the problem, too. Finally, I found the problem might be in the customized allocation function adopted by the As I utilized Anyway, anyone who is confronted with the same problem could try this solution, I believe the problem might be caused by not just one reason, but the approach might also be the solution of one of them :) |
Maybe you could try to expand vm.max_map_count. |
I believe this is the key, as we encountered this problem just like you. |
馃悰 Describe the bug
Hi,
Here is the results i observed when i was running my workload with PyTorch 1.13 on Ubuntu 20.04 to train RN50 with imageNet:
When i run 25 epochs, the error is thrown as below:
This Runtime Error is thrown from torch dataloader. I found the community has had this issue: https://discuss.pytorch.org/t/pytorch-cannot-allocate-memory/134754.
I set the dataloader workers to be 0 and the error is missing, but the training time increases too much when using the single process to fetch and decode the dataset. I also check the host memory usage and no oom happen. The fd limitation of one process is also not exceeded. Thus we wonder if the issue can be fixed, otherwise it may block the model training with PyTorch.
Versions
PyTorch 1.13
Python 3.9
Torchvision 1.14
Running BS: 256
Model: ResNet50
Dataset: ImageNet
Ubuntu 20.04
Host Mem 128G
CPU: Intel Xeon Gold 6342
cc @ssnl @VitalyFedyunin @ejguan @NivekT
The text was updated successfully, but these errors were encountered: