Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WinError 1455] The paging file is too small for this operation to complete. Error loading "C:\ProgramData\Anaconda3\lib\site-packages\torch\lib\caffe2_detectron_ops_gpu.dll" or one of its dependencies #1643

Closed
ardeal opened this issue Jan 6, 2021 · 34 comments
Labels
question Further information is requested Stale Stale and schedule for closing soon

Comments

@ardeal
Copy link

ardeal commented Jan 6, 2021

Hi,

My environment:
Windows 10
python 3.8.5
CPU 10700K + 16GB RAM
GPU 3060Ti (8GB memory)
CUDA 11.0.3_451.82_win10
numpy 1.19.3
torch 1.7.1+cu110
torchvision 0.8.2+cu110

on master branch, follow the section at: https://github.com/ultralytics/yolov3/wiki/Train-Custom-Data, and modify batch-size = 2 on my 3060Ti(8G memory). I got the following issue:

Traceback (most recent call last):
  File "C:\ProgramData\Anaconda3\lib\multiprocessing\spawn.py", line 125, in _main
    prepare(preparation_data)
  File "C:\ProgramData\Anaconda3\lib\multiprocessing\spawn.py", line 236, in prepare
    _fixup_main_from_path(data['init_main_from_path'])
  File "C:\ProgramData\Anaconda3\lib\multiprocessing\spawn.py", line 287, in _fixup_main_from_path
    main_content = runpy.run_path(main_path,
  File "C:\ProgramData\Anaconda3\lib\runpy.py", line 265, in run_path
    return _run_module_code(code, init_globals, run_name,
  File "C:\ProgramData\Anaconda3\lib\runpy.py", line 97, in _run_module_code
    _run_code(code, mod_globals, init_globals,
  File "C:\ProgramData\Anaconda3\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "D:\code_python\har_hailiang\yolov3\train.py", line 12, in <module>
    import torch.distributed as dist
  File "C:\ProgramData\Anaconda3\lib\site-packages\torch\__init__.py", line 117, in <module>
    raise err
OSError: [WinError 1455] The paging file is too small for this operation to complete. Error loading "C:\ProgramData\Anaconda3\lib\site-packages\torch\lib\caffe2_detectron_ops_gpu.dll" or one of its dependencies.
python-BaseException

Is the issue related with CUDA or GPU memory size?

Thanks and Best Regards,
Ardeal

@ardeal ardeal added the question Further information is requested label Jan 6, 2021
@ardeal
Copy link
Author

ardeal commented Jan 6, 2021

OOOOOOOOOOOOOOOOOO
I solved the issue by change nw = 1.
in the code, nw = min([os.cpu_count(), batch_size if batch_size > 1 else 0, 8]) # number of workers
if nw = 8 that mean 8 CPU core will take part in the work, it needs much RAM.
so, it works if we decrease nw to 1.
OOOOOOOOOO
:)

@github-actions
Copy link

github-actions bot commented Feb 6, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the Stale Stale and schedule for closing soon label Feb 6, 2021
@ardeal ardeal closed this as completed Feb 7, 2021
@tufail117
Copy link

Any update on this? I am also facing the same issue. Have tried many things for the last 3 days, but no success.

@tufail117
Copy link

Well, i managed to resolve this.
open "advanced system setting". Go to the advanced tab then click settings related to performance.
Again click on advanced tab--> change --> unselect 'automatically......'. for all the drives, set 'system managed size'. Restart your pc.

@mondrasovic
Copy link

Well, i managed to resolve this.
open "advanced system setting". Go to the advanced tab then click settings related to performance.
Again click on advanced tab--> change --> unselect 'automatically......'. for all the drives, set 'system managed size'. Restart your pc.

This works, but only temporarily. Nowadays I am facing the problem of encountering a crash after few hours of training. It usually happens at the beginning of the epoch, when it is loading.

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "C:\Program Files\Python37\lib\multiprocessing\spawn.py", line 105, in spawn_main
    exitcode = _main(fd)
  File "C:\Program Files\Python37\lib\multiprocessing\spawn.py", line 114, in _main

    prepare(preparation_data)
  File "C:\Program Files\Python37\lib\multiprocessing\spawn.py", line 225, in prepare
    _fixup_main_from_path(data['init_main_from_path'])
  File "C:\Program Files\Python37\lib\multiprocessing\spawn.py", line 277, in _fixup_main_from_path
    run_name="__mp_main__")
  File "C:\Program Files\Python37\lib\runpy.py", line 263, in run_path
    pkg_name=pkg_name, script_name=fname)
  File "C:\Program Files\Python37\lib\runpy.py", line 96, in _run_module_code
    mod_name, mod_spec, pkg_name, script_name)
  File "C:\Program Files\Python37\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "E:\projects\siamfc\src\train.py", line 13, in <module>
    import torch
  File "E:\venvs\general\lib\site-packages\torch\__init__.py", line 123, in <module>
    raise err
OSError: [WinError 1455] The paging file is too small for this operation to complete. Error loading "E:\venvs\general\lib\site-packages\torch\lib\caffe2_detectron_ops_gpu.dll" or one of its dependencies.
Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x0000025934FA8048>
Traceback (most recent call last):
  File "E:\venvs\general\lib\site-packages\torch\utils\data\dataloader.py", line 1324, in __del__
    self._shutdown_workers()
  File "E:\venvs\general\lib\site-packages\torch\utils\data\dataloader.py", line 1291, in _shutdown_workers
    if self._persistent_workers or self._workers_status[worker_id]:
AttributeError: '_MultiProcessingDataLoaderIter' object has no attribute '_workers_status'

My environment:

  • Windows 10
  • NVidia CUDA 11.1
  • Python 3.7.9 (tags/v3.7.9:13c94747c7, Aug 17 2020, 18:58:18) [MSC v.1900 64 bit (AMD64)] on win32
  • torch==1.8.0+cu111
  • torchvision==0.9.0+cu111
  • numpy==1.19.5

An interesting and at the same time the reproducible crash happened when I loaded the Microsoft Teams application. Even MS Teams reported an exception regarding virtual memory. No other app stopped working. Thus, MS Teams and PyTorch training became "mutually exclusive". After I applied the trick mentioned above, the problem remains only on the PyTorch side, and only sometimes. A lot of ambiguous words, I know, but that's how it is.

@XuChang2020
Copy link

1.try counting down the num_workers to 1or 0.
2.try modifying batch-size = 2 or 1.
Hope to help u.

@PonyPC
Copy link

PonyPC commented Jun 9, 2021

reduce number of workers will reduce train speed efficiently.

@krisstern
Copy link

I was having the same error thrown with yolov5, which was fixed by changing the number of workers nw to 4 manually in the "datasets.py" file.

@PonyPC
Copy link

PonyPC commented Sep 29, 2021 via email

@glenn-jocher
Copy link
Member

glenn-jocher commented Sep 29, 2021

@ardeal @krisstern @PonyPC you can set dataloader workers during training, i.e.:

python train.py --workers 16

https://github.com/ultralytics/yolov5/blob/76d301bd21b4de3b0f0d067211da07e6de74b2a0/train.py#L454

It seems like a lot of windows users are encountering this problem, but as @PonyPC mentioned reducing workers will generally also result in slower training. Are you guys encountering this during DDP or single-GPU training?

EDIT: just realized this is YOLOv3 repo and not YOLOv5. I would strongly encourage all users to migrate to YOLOv5, which is much better maintained. It's possible this issue is already resolved there.

@PonyPC
Copy link

PonyPC commented Sep 29, 2021

YOLOv5 has same problem.
@glenn-jocher

@glenn-jocher
Copy link
Member

glenn-jocher commented Sep 29, 2021

@PonyPC please raise a bug report issue citing a reproducible example in the YOLOv5 repo in that case.

@cobryan05
Copy link

I have managed to mitigate (although not completely solve) this issue. I posted a more detailed explanation on a related StackOverflow link but basically try this:

Download fixNvPe.py:
https://gist.github.com/cobryan05/7d1fe28dd370e110a372c4d268dcb2e5

Install dependency:
python -m pip install pefile

Run (for OPs paths) (NOTE: THIS WILL MODIFY YOUR DLLS [although it will back them up]):
python fixNvPe.py --input C:\ProgramData\Anaconda3\lib\site-packages\torch\lib\*.dll

@abhishekstha98
Copy link

python fixNvPe.py --input C:\ProgramData\Anaconda3\lib\site-packages\torch\lib*.dll

this fixed it (although you mentioned "not completely") this has been a better suggestion than anything found elsewhere.

@mercy-thuyle
Copy link

python fixNvPe.py --input C:\ProgramData\Anaconda3\lib\site-packages\torch\lib*.dll

this fixed it (although you mentioned "not completely") this has been a better suggestion than anything found elsewhere.

Hello, I don't get it clearly. All those step is put the fixNvPe.py in the C:\ProgramData\Anaconda3\lib\site-packages\torch\lib*.dll ?

If I wrong please explain for me the step. Thank you so much.

@cobryan05
Copy link

python fixNvPe.py --input C:\ProgramData\Anaconda3\lib\site-packages\torch\lib*.dll

Hello, I don't get it clearly. All those step is put the fixNvPe.py in the C:\ProgramData\Anaconda3\lib\site-packages\torch\lib*.dll ?

If I wrong please explain for me the step. Thank you so much.

You can place fixNvPe.py wherever you want. You then run this script using python, and you tell it what files to run on by passing an --input parameter with the path of the files you want to modify.

For example, OP's error message was

OSError: [WinError 1455] The paging file is too small for this operation to complete. Error loading "C:\ProgramData\Anaconda3\lib\site-packages\torch\lib\caffe2_detectron_ops_gpu.dll" or one of its dependencies.

It is failing to load C:\ProgramData\Anaconda3\lib\site-packages\torcfh\lib\caffe2_detectron_ops_gpu.dll. You could pass exactly this as the --input parameter to fixNvPe.py, or you could replace caffe2_detectron_ops_gpu.dll with *.dll to instead process every DLL in that directory (the '*' is a wildcard, so *.dll is 'every file ending in .dll).

For example, if you downloaded fixNvPe.py to C:\Downloads\fixNvPe.py then you could open a command prompt and type something like

python C:\Downloads\fixNvPe.py --input=C:\ProgramData\Anaconda3\lib\site-packages\torch\lib\*.dll

This will 'fix' all of the DLL files in the torch\lib directory. Your specific computer may use different paths, and you may have to run this tool on multiple folders, depending on your exact setup. Just look at the error message you are getting for a hint on what the correct paths are.

If you get an error message about failing to import pefile then you need to first run python -m pip install pefile

@VahidFe96
Copy link

This problem is about DataLoader.
you must reduce the value of num_workers.
in folder : python\Lib\site-packages\torch\utils\data open dataloader.py and in line 189 write self.num_workers = 2.

@cobryan05
Copy link

cobryan05 commented Dec 3, 2021

This problem is about DataLoader.
you must reduce the value of num_workers.
in folder : python\Lib\site-packages\torch\utils\data open dataloader.py and in line 189 write self.num_workers = 2.

The issue is with how multi-process Python works on Windows with the pytorch/cuda DLLs. The number of workers you set in the DataLoader directly relates to how many Python processes are created.

Each time a Python process imports pytorch it loads several DLLs. These DLLs have very large sections of data in them that aren't really used, but space is reserved for them in memory anyways. We're talking in the range of hundreds of megabytes to a couple gigabytes, per DLL.

When Windows is asked to reserve memory, if it says that it returned memory then it guarantees that memory will be available to you, even if you never end up using it.

Linux allows overcommitting. By default on Linux, when you ask it to reserve memory, it says "Yeah sure, here you go" and tells you that it reserved the memory. But it hasn't actually done this. It will reserve it when you try to use it, and hopes that there is something available at that time.

So, if you allocate memory on Windows, you can be sure you can use that memory. If you allocate memory on Linux, it is possible that when you actually try to use the memory that it will not be there, and your program will crash.

On Linux, when it spawns num_workers processes and each one reserves several gigabytes of data, Linux is happy to say it reserved this, even though it didn't. Since this "reserved memory" is never actually used, everything is good. You can create tons of worker processes. Just because pytorch allocated 50GB of memory, as long as it never actually uses it it won't be a problem. (Note: I haven't actually ran pytorch on Linux. I am just describing how Linux would not have this crash even if it attempted to allocate the same amount of memory. I do not know for a fact that pytorch/CUDA overallocate on Linux)

On Windows, when you spawn num_workers processes and each one reserves several gigabytes of data, Windows insists that it can actually satisfy this request should the memory be used. So, if Python tries to allocate 50GB of memory, then your total RAM + page file size must have space for 50GB.

So, on Windows NumPythonProcesses*MemoryPerProcess < RAM + PageFileSize must be true or you will hit this error.

Your suggestion of lowering num_workers decreases NumPythonProcesses. The suggestions to modify the page file size increase PageFileSize. My FixNvPe.py script decreases MemoryPerProcess.

The trick is to find a balance of all of these variables that keeps that equation true.

@PonyPC
Copy link

PonyPC commented Dec 3, 2021

Wow, that's working @cobryan05
Please port this to yolov5 troubleshooting @glenn-jocher

@glenn-jocher
Copy link
Member

glenn-jocher commented Dec 3, 2021

@PonyPC @cobryan05 hi, I'm not following the convo exactly since we don't have any windows instances here, but if you have any improvements you'd like to implement I'd recommend submitting a PR.

The fastest and easiest way to incorporate your ideas into the official codebase is to submit a Pull Request (PR) implementing your idea, and if applicable providing before and after profiling/inference/training results to help us understand the improvement your feature provides. This allows us to directly see the changes in the code and to understand how they affect workflows and performance.

Please see our ✅ Contributing Guide to get started.

@cobryan05
Copy link

cobryan05 commented Dec 3, 2021

@glenn-jocher Unfortunately I don't think that this is something that can be fixed within yolov5.

This is an issue with CUDA and pytorch DLLs. My 'fix' just changes some flags on the DLLs to make them allocate less memory. This likely would be a job for NVidia to fix the flags on their CUDA DLLs (eg, cusolver64_*.dll in CUDA release). Perhaps 'pytorch' could help some as well, since they also package some of these (eg, caffe2_detectron_ops_gpu.dll)... although they use NVidia tools to do this, so the blame probably falls back to NVidia.

Even with my changes to these flags, these DLLs still reserve a whole lot more memory than they actually use. I don't know who is to blame, and since my flag changes got me going I'm not digging further into it.

edit: I went ahead and submitted the info as a 'bug report' to NVIDIA. Whether or not anything happens with it, or any of the appropriate people at NVIDIA ever see it, who knows? But maybe they'll pick it up and do something about it.

@Neltherion
Copy link

Neltherion commented Jul 21, 2022

I've been having this problem also with TensorFlow and as described in detail by @cobryan05, the problem resides in how Windows does multiprocessing and DLLs.

@cobryan05 is it possible to paste the link to the NVIDIA page where you posted this problem? I also want to go there and whine to them about it.

@FathUMinUllah3797
Copy link

I faced the same problem when the batch size was 8 and num_workers was 6. How I solved this problem by making the following changes
Batch size= 2
num_workers = 2.

@szan12
Copy link

szan12 commented Nov 23, 2022

OOOOOOOOOOOOOOOOOO I solved the issue by change nw = 1. in the code, nw = min([os.cpu_count(), batch_size if batch_size > 1 else 0, 8]) # number of workers if nw = 8 that mean 8 CPU core will take part in the work, it needs much RAM. so, it works if we decrease nw to 1. OOOOOOOOOO :)

Can I know where should I go to change the number of workers or put this line of code? I'm using jupyter notebook.

@glenn-jocher
Copy link
Member

@szan12 i.e. python train.py --workers 4

@bit-scientist
Copy link

I had the same error on win 10 today and following

open "advanced system setting". Go to the advanced tab then click settings related to performance.
Again click on advanced tab--> change --> unselect 'automatically......'. for all the drives, set 'system managed size'. Restart your pc.

didn't help. Then, I suddenly remembered that I had installed cuda 11.7 along with already exisiting cuda 11.3 and 11.2 version. I had moved up lib and libvvp paths variables in system variables at that time. Therefore, decided to install packages )related to cuda 11.7) with conda install pytorch torchvision torchaudio pytorch-cuda=11.6 -c pytorch -c nvidia, reversed the above quoted process (select automatically...), restarted the PC and now it's working well.

@kv1830
Copy link

kv1830 commented Apr 5, 2023

@glenn-jocher Unfortunately I don't think that this is something that can be fixed within yolov5.

This is an issue with CUDA and pytorch DLLs. My 'fix' just changes some flags on the DLLs to make them allocate less memory. This likely would be a job for NVidia to fix the flags on their CUDA DLLs (eg, cusolver64_*.dll in CUDA release). Perhaps 'pytorch' could help some as well, since they also package some of these (eg, caffe2_detectron_ops_gpu.dll)... although they use NVidia tools to do this, so the blame probably falls back to NVidia.

Even with my changes to these flags, these DLLs still reserve a whole lot more memory than they actually use. I don't know who is to blame, and since my flag changes got me going I'm not digging further into it.

edit: I went ahead and submitted the info as a 'bug report' to NVIDIA. Whether or not anything happens with it, or any of the appropriate people at NVIDIA ever see it, who knows? But maybe they'll pick it up and do something about it.

hello, this problem may be solved now!
my environments:
torch 1.13.1+cu117
torchvision 0.14.1+cu117
cuda: 11.8
cudnn: 8.8.1.3_cuda11
or:
cuda: 12.1
cudnn: 8.8.1.3_cuda12

I use yolov5-6.2, --batch-size 16 --workers 16, the virtual memory it need is much less than before! (It need more than 100GB before)

image

image

why I use torch 1.13.1+cu117 and cuda 11.8?
Actually I try with torch 2.0+cu118 and cuda 11.8(or cuda12.1), but something wrong with the amp, so I change to torch 1.13.1+cu117 firstly, and it works(cuda11.8 and cuda 12.1 both work), so I don't want to try cuda11.7 any more ~~

@francescobodria
Copy link

I solved increasing the page file limit of windows

@glenn-jocher
Copy link
Member

Thank you for sharing your solution. It's great to hear that increasing the page file limit of Windows helped in resolving the issue. It seems that managing the page file size effectively contributed to stability during the training process. If you encounter any more issues or have further questions, feel free to reach out.

@kevinoldman
Copy link

1.try counting down the num_workers to 1or 0. 2.try modifying batch-size = 2 or 1. Hope to help u.

it works, but the entire training process became too slow, is there any better way to solve this? I got two days wasted on this.
Thank you.

@ardeal
Copy link
Author

ardeal commented Dec 13, 2023

1.try counting down the num_workers to 1or 0. 2.try modifying batch-size = 2 or 1. Hope to help u.

it works, but the entire training process became too slow, is there any better way to solve this? I got two days wasted on this. Thank you.

There is no better solution.
This issue is related with your computer performance. If you would like to speed up the training, you have to improve your computer hardware performance. Such as, increase your memory, or use better GPU, or use server CPU and etc.

@glenn-jocher
Copy link
Member

@ardeal hi there! It seems you've already tried the recommended solutions. As for improving speed, upgrading your hardware such as increasing memory, using a stronger GPU, or leveraging a server CPU may help expedite the training process. If you have further queries or need additional assistance, feel free to ask.

@siddtmb
Copy link

siddtmb commented Mar 21, 2024

It is not really related with the computer performance but rather the fact that:

  1. memory management on pytorch+windows sucks
  2. ultralytics dataloader is constantly memory leaking
  3. python multithreading sucks, and there are various things u can do to mitigate its issues (like using numpy arrays or torch tensors) which are not done in ultralytics dataloader, hence point 2.

Even on linux it will slowly eat up all of your memory and any swap partition you have till it drives training to a halt. good thing on linux is that you can just have oom killer and resume the training (though not an option on large datasets, those will still memory leak into oblivion). But on windows the only solution is to clear pagefile.sys with a hard reboot.

@glenn-jocher
Copy link
Member

@siddtmb hi! Thanks for your insights. Memory management, particularly in a Windows environment, can indeed introduce challenges. We're continuously working on improving the efficiency of our data loader and overall memory usage within YOLOv3 and appreciate your feedback.

For mitigating memory leaks or high memory usage issues:

  • Ensuring the latest version of PyTorch is used can sometimes alleviate memory management issues, as improvements and bug fixes are regularly released.
  • Experimenting with reducing --num-workers and --batch-size in your training command may provide immediate relief from memory pressure, though at the expense of training speed.
  • Utilizing torch.utils.data.DataLoader with pin_memory=True and carefully managing tensor operations can help in some situations.

We recognize the importance of efficient memory usage and are committed to making improvements. Contributions and pull requests are always welcome if you have suggestions or optimizations to share with the community. Your feedback is valuable in guiding those efforts. Thank you for bringing this to our attention.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested Stale Stale and schedule for closing soon
Projects
None yet
Development

No branches or pull requests