-
Notifications
You must be signed in to change notification settings - Fork 22.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reliably repeating pytorch system crash/reboot when using imagenet examples #3022
Comments
I have experienced random system reboots once due to motherboard - GPU incompatibility. This has manifested during long training. Do other frameworks (e.g. caffe) succeed in training on ImageNet? |
Haven't tried that yet. However ran some long running graphics bench ;) with no problems. I could probably look at giving other frameworks a shot, what's your recommendation. Caffe? Do keep in mind, the crash I reported happens practically immediately (mnist-cuda example runs to completion many times without an issue). So I doubt that it's a h/w incompatibility issue. |
Can you try triggering the crash once again and see if anything relevant is printed in |
Zero entries related to this in either dmesg or kern.log. The machine does an audible click and resets, so I think it's the h/w registers or memory being twiddled in a way it doesn't like. No real notice to kernel to log anything. Reboots at the same line of code each time, at least the few times I stepped through it. |
That's weird. To be honest I don't have any good ideas for debugging such issues. My guess would be that it's some kind of a hardware problem, but I don't really know. |
it's definitely a hardware issue as well. Whether it's at an nvidia driver level, or a bios / hardware failure. |
For future reference, the issue was due to steep power ramp of 1080ti's triggering server power supply over voltage protection. Only some pytorch examples caused it to show up. |
@castleguarders Have you figured out how to solve this issue? It seems that even 1200W "platinum" power supply is not enough for just 2X 1080Ti, it reboots from time to time. |
@castleguarders I am having similar issues, how did you found that that was the problem? |
@pmcrodrigues There was an audible click whenever the issue happened. I used nvidia-smi to soft control the power draw, this allowed the tests a bit longer, but trip anyways. I switched to a 825W Delta powersupply and it took care of the issue fully. Furmark makes easy work of testing this if you run windows. I ran it fully pegged for a couple of days, while driving the CPUs 100% with a different script. It's zero issues since then. @yurymalkov I only have 1x 1080ti, didn't dare to put a second one. |
@pmcrodrigues @castleguarders |
@castleguarders @yurymalkov Thank you both. I have also tried to reduce power draw via nvidia-smi and it stopped crashing the system. But with stress tests at full power draw simultaneously on my 2 xeons (with http://people.seas.harvard.edu/~apw/stress/) and the 4 1080ti (with https://github.com/wilicc/gpu-burn) didn't make it crash. So for now I have only seen this problem on pytorch. Maybe I need other stress tests? |
@pmcrodrigues gpuburn seems to be a bad test for this, as it does not create steep power ramps. The problem reproduces on some other frameworks (e.g. tensorflow), but it seems that pytorch scripts are the best test, probably because of the highly synchronous nature. |
I am having the same issue. Has anybody found any soft solution to this? |
@gurkirt For now, I am only using 2 GPUs with my 1500W PSU. If you want to test reducing the power draw you can use "nvidia-smi -pl X" where X is the new power draw. For my gtx 1080i I used "nvidia-smi -pl 150" whereas standard draw is 250W. I am waiting on more potent PSU to test if it solves the problem. Currently I have a measuring device to measure the power coming directly from the wall, but even when I am using 4 GPUs it does not pass the 1000W. It can still be some weird peaks that are not being registered but something is off. Either way, we probably need to go with the dual 1500W PSUs. |
@pmcrodrigues thanks a lot for quick response. I have another system which has 2000W with 4 1080Ti's. That works just fine. I will try plugging that power supply in this machine and see if 2000W is enough on this machine. |
@pmcrodrigues did you find any log/warning/ crash report somewhere? |
@gurkirt None. |
I’m having a similar problem- audible click, complete system shutdown. It seems that it only occurs with BatchNorm layers in place. Does that match with your experience? |
I was using the resenet at that time. It is a problem of inadequate power supply problem. It is a hardware problem. I needed to upgrade the power supply. According to my searches online, the power surge is a problem of pytorch. I upgraded the power supply from 1500W to 1600W. The problem still appears now and then but only when the room temperature is a bit higher. I think there are two factors at play, room temperature and another major being the power supply. |
I have the same problem with a 550W power supply and a GTX1070 graphics cards. I start the learning and about a second later the power cuts. But this made me thinking that perhaps it would be possible to trick/convince the PSU that everything is ok by creating a ramp-up function that e.g. mixes between sleeps and gpu activity and gradially increases the load. Has anyone tried this? Does someone have minimal code that reliably triggers the power cut? |
Had the same issue with GTX1070 but reboots were not random. What did NOT work:
Not sure if this is related to native BIOS issues. I am running 1203 version while the latest is 3 releases ahead -- 1503 and they put
into the description of each of those 3. Asus X299-A BIOS versions One of those releases had also
So there is a chance this is fixed. |
For the record, my problem was a broken power supply. I diagnosed this by running https://github.com/wilicc/gpu-burn on Linux and then FurMark on Windows, under the assumption, that unless I can reproduce the crash on Windows, they won't talk to me in my computer shop. Both these tests failed for me, wherupon I took the computer for repair and got a new power supply. Since then, I have been running pytorch for hours without any crashes. |
Has anyone found a way to fix this. I have a similar error where my computer restarts shortly after I start training. I have a 750w psu and only 1 gpu (1080ti) so I don't think it is a power problem. Also, I did not see an increased wattage going to my gpu before it restarts. |
1 similar comment
Has anyone found a way to fix this. I have a similar error where my computer restarts shortly after I start training. I have a 750w psu and only 1 gpu (1080ti) so I don't think it is a power problem. Also, I did not see an increased wattage going to my gpu before it restarts. |
If I can add some more information about vwvolodya great comment. Our motherboard/cpu configuration was a ASUS TUF X299 MARK 2 with i9-7920x. The Bios version was at 1401. The only thing that can prevent the system to reboot/shutdown was to turn off : Turbo Mode. For now, after updating to 1503 the problem seems to be solved with Turbo Mode activated. Have a great day guys ! |
@yaynouche @vwvolodya similar issues happened on a ASUS WS-X299 SAGE with i9-9920X. Turning off Turbo Mode is the only solution right now, with latest BIOS (Version 0905 which officially supports i9-9920X). UPDATE: turns out, I must enable turbo mode in BIOS and use commands like UPDATE 2: I think turning off Turbo Mode can only lower the chance of my issue, not eliminate it. |
facing the same problem. 4 GTX 1080Ti with 1600W PSU (With redundancy) . Tried to use gpu burn to test it and it's stable like a rock. |
@Suley personally I think this is more of a CPU problem; basically, pytorch invokes CPU to execute a series of instructions which draws too much power from motherboard. |
I am not sure what I face is the same as the problem. My computer uses 1080Ti, and if the GPU Memory usage is close to 100%, i.e. uses almost 11GB memory, it will reboot. But if I reduce the batch size of the network in order to decrease the memory usage, the reboot problem will not happen without upgrade the power. If someone meets the reboot problem, I hope my condition might help you. |
I face the same problem with 1080 Ti and a 450 W PSU and tried to reduce power consumption by typing command "sudo nvidia-smi -pl X" as a temporal solution. However, this did not work at first try. After that, I noticed that if you limit the power consumption first and type "nvidia-smi -lms 50" on another terminal to check the power and memory usage of the GPU just before starting the training, the I can train the network without problem. I'm waiting for a new PSU right now for a permanent solution. |
I too had this issue and was able to reproduce it with a Pytorch script without using any GPUs (only CPU). So I agree with @zym1010 for me it's a CPU issue. I updated my BIOS (ASUS WS X299 SAGE LGA 2066 Intel X299) and it seems to have stopped the issue from happening. However considering the comments in this thread I'm not entirely sure the issue is fixed... @soumith Don't you think Pytorch contributors should look into this issue rather than just closing it? Pytorch seems to stress GPU/CPU in a way GPU/CPU stress tests do not. This is not an expected behaviour, and the problem affects many people. It seems like a rather interesting issue as well! |
@Caselles are you referring to BIOS version 1001? I saw it some time ago on ASUS website but seems that it has been taken down somehow. |
The BIOS I installed is this one: "WS X299 SAGE Formal BIOS 0905 Release". |
In my experience, this issue comes with different Thermaltake PSUs. In the last case, changing the PSU from Thermaltake platinum 1500W to Corsair HX1200 solved the problem on a two-2080Ti setup. |
I have this issue with both CPU and GPU, which means rebooting happens even when I physically uninstall the GPU and only train the network on CPU without using dataloader My power supply is EVGA 850w gold power supply, and CPU: i7-8700k, GPU: GTX 1080ti (just 1 piece) And I have a ECO switch on my power supply, if I switch it to "on", it happens more often. Just like what others said, the pressure test on both CPU and GPU pass. So, a conclusion here:
|
My hardware details:
Bios Version: 0905. Then updated to 1201. Tested using https://github.com/wilicc/gpu-burn. All gpus are ok. Whenever I am training maskrcnn_resnet50_fpn on coco dataset using 4 GPUs with batch size 4, the system is rebooting immediately. But, when I am using 3 GPUs with batch size 4 or 4 GPUs with batch size 2, it's training. What could be the reason? Power supply? |
I also have this issue using 4 x Geforce RTX2080 TI - 11GB and a 1600W EVGA SuperNOVA Platinum PSU (I also tried swapping the PSU with a 1600W SuperNOVA EVGA Gold PSU) and the issue still occurs when using PyTorch with the 4 GPUs. |
From my experience, reboot occur often when nvidia-persistenced is not installed and running. Updating the Bios is also a crucial part of the solution. Hope it helps. Best regards, Yassine |
@gurkirt what are your other system specs? I also have 4 x RTX 2080tis and a corsair 1600i psu but my pc still shuts down after awhile when using all 4 gpus. |
Hey just FYI I was experiencing this issue on multiple machines (all X299 with multiple 2080Tis), and after trying 4 different PSUs the Corsair AX1600I is the only one that I did not encounter reboots with. |
I have the same issue. Output of
Following is the log file before the system crashed I think. I found it in -
How can I stop this from happening again ie. stop pytorch training and not crash my system? |
@theairbend3r I'm not sure if you're having the same issue as the one here. As I understand it, when starting training with torch, the GPUs and CPU(s) ramp up so quickly that it can exceed normal power draw and trigger overload protection on the PSU. I was always experiencing this before the first epoch ended. Sorry I don't have any more useful suggestions for you. |
Several possible solutions: (not sure if anyone of them could fix the problem independently)
|
Faced this issue with 2x 2080ti on multiple PCs with platiunum 1000W and 1200W. Worked fine when using only 1 GPU, but not 2. Solved by upgrading the PSU to 1600W. |
Had the same issue with 2080 Ti on 750W G2 Gold PSU. Solved after changing the PSU to 1600W P2. |
It worked ,when i used nvidia-persistenced. But the computer will be rebooted after a while. |
It worked for me to deactivate the intel turbo boost which seems to indicate that it is a cpu problem, when I monitor the temperature of the cpu cores before deactivating boost it rose very quickly to 60 degrees now it stays at less than 50 |
Only in pytorch, it's fine in tensorflow |
I solved similar problem (RTX 3090 gpu) by limited gpu power (from 350W to 250W):
|
Just to let those with the same issue know: We had this often reboot issue with one machine with 2 RTX 3090 + 1000W PSU . Initially we suspected it was related to the PSU as reported by many here, because using just 1 RTX while running inferences, it worked. When using both GPUs, it used to reboot and just worked by decreasing the GPUs power from 370W to 200W using But then we experimented with another machine with just one RTX 3090 + 1400W PSU and the reboot happened almost all time while running inferences. Because of a previous post here suggesting CPU turbo boost could be an issue (disabling it didn't work for us), we suspected the GPU overclocking/boost could be an issue. Then we saw RTX 3090 clock reaching ~1900MHz while running FurMark stress tests. So we limited it with I hope this helps others. |
@lfcnassif Which specs are you following, exactly? |
Been trying to run 2 RTX 3090's on a brand new Corsair 1500W PSU... with many sudden shutdowns. I'm thinking there's no way I don't have enough watts, right? I was worried it could be a temperature issue, so was obsessively logging temps for all devices, and running fans at max speed, right up to shutdown. GPUs were a "cool" ~60 C -- not enough to warrant causing the shutdown. Thanks to @lfcnassif -- indeed the trick was simply running: (where the -lgc sets min, max clock speeds) It turns out that max clock speed is not much lower than the ~1900 MHz that the GPU would naturally run at, but basically I think we're just preventing sudden synchronous large bursts of computational hunger... And after training a run for a long time now, I'm posting "confirmation", this works. :) |
I just had this issue on my windows 11 PC. Weirdly, training worked fine, but about once for every ten predictions my model made, the PC would suddenly turn off. Like a lot of people here with this problem, I have an ASUS motherboard, so I tried going into BIOS and turning off turbo mode and that instantly fixed this issue for me. |
I have similar hardware to @legel , specifically 2 x 3090s and Corsair HX1500i, and experienced my Epyc Rome machine shutting down on PyTorch workloads. The solution was to replace the Y splitter cable going from the PSU to the GPUs with single cables (each 3090 has two 8-pin connectors for power input and Corsair provides one cable that has 16-pins via a Y-splitter). Change that 16-pin cable to two 8-pin cables. For me, it worked immediately without needing to throttle the clock speeds of the GPUs. |
So I have a 100% repeatable system crash (reboot) when trying to run the imagenet example (2012 dataset). resnet18 defaults. The crash seems to happen at Variable.py at torch.autograd.backward(..) (line 158).
I am able to run the basic mnist example successfully.
Setup: Ubuntu 16.04, 4.10.0-35-generic #39~16.04.1-Ubuntu SMP Wed Sep 13 09:02:42 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
python --version Python 3.6.2 :: Anaconda, Inc.
/usr/local/cuda/bin/nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Sep__1_21:08:03_CDT_2017
Cuda compilation tools, release 9.0, V9.0.176
nvidia-smi output.
Sat Oct 7 23:51:53 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.81 Driver Version: 384.81 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... Off | 00000000:03:00.0 On | N/A |
| 14% 51C P8 18W / 250W | 650MiB / 11170MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1335 G /usr/lib/xorg/Xorg 499MiB |
| 0 2231 G cinnamon 55MiB |
| 0 3390 G ...-token=C6DE372B6D9D4FCD6453869AF4C6B4E5 93MiB |
+-----------------------------------------------------------------------------+
torch/vision was built locally on the machine from master. No issues at compile or install time, other than the normal compile time warnings...
Happy to help get further information..
The text was updated successfully, but these errors were encountered: