Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reliably repeating pytorch system crash/reboot when using imagenet examples #3022

Closed
castleguarders opened this issue Oct 8, 2017 · 48 comments

Comments

Projects
None yet
@castleguarders
Copy link

commented Oct 8, 2017

So I have a 100% repeatable system crash (reboot) when trying to run the imagenet example (2012 dataset). resnet18 defaults. The crash seems to happen at Variable.py at torch.autograd.backward(..) (line 158).

I am able to run the basic mnist example successfully.

Setup: Ubuntu 16.04, 4.10.0-35-generic #39~16.04.1-Ubuntu SMP Wed Sep 13 09:02:42 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

python --version Python 3.6.2 :: Anaconda, Inc.

/usr/local/cuda/bin/nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Sep__1_21:08:03_CDT_2017
Cuda compilation tools, release 9.0, V9.0.176

nvidia-smi output.
Sat Oct 7 23:51:53 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.81 Driver Version: 384.81 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... Off | 00000000:03:00.0 On | N/A |
| 14% 51C P8 18W / 250W | 650MiB / 11170MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1335 G /usr/lib/xorg/Xorg 499MiB |
| 0 2231 G cinnamon 55MiB |
| 0 3390 G ...-token=C6DE372B6D9D4FCD6453869AF4C6B4E5 93MiB |
+-----------------------------------------------------------------------------+

torch/vision was built locally on the machine from master. No issues at compile or install time, other than the normal compile time warnings...

Happy to help get further information..

@vadimkantorov

This comment has been minimized.

Copy link

commented Oct 8, 2017

I have experienced random system reboots once due to motherboard - GPU incompatibility. This has manifested during long training. Do other frameworks (e.g. caffe) succeed in training on ImageNet?

@castleguarders

This comment has been minimized.

Copy link
Author

commented Oct 8, 2017

Haven't tried that yet. However ran some long running graphics bench ;) with no problems. I could probably look at giving other frameworks a shot, what's your recommendation. Caffe?

Do keep in mind, the crash I reported happens practically immediately (mnist-cuda example runs to completion many times without an issue). So I doubt that it's a h/w incompatibility issue.

@apaszke

This comment has been minimized.

Copy link
Member

commented Oct 8, 2017

Can you try triggering the crash once again and see if anything relevant is printed in /var/log/dmesg.0 or /var/log/kern.log?

@castleguarders

This comment has been minimized.

Copy link
Author

commented Oct 9, 2017

Zero entries related to this in either dmesg or kern.log. The machine does an audible click and resets, so I think it's the h/w registers or memory being twiddled in a way it doesn't like. No real notice to kernel to log anything. Reboots at the same line of code each time, at least the few times I stepped through it.

@apaszke

This comment has been minimized.

Copy link
Member

commented Oct 10, 2017

That's weird. To be honest I don't have any good ideas for debugging such issues. My guess would be that it's some kind of a hardware problem, but I don't really know.

@soumith

This comment has been minimized.

Copy link
Member

commented Oct 10, 2017

it's definitely a hardware issue as well. Whether it's at an nvidia driver level, or a bios / hardware failure.
I'm closing the issue, as there's no action to be taken on the pytorch project side.

@soumith soumith closed this Oct 10, 2017

@castleguarders

This comment has been minimized.

Copy link
Author

commented Dec 7, 2017

For future reference, the issue was due to steep power ramp of 1080ti's triggering server power supply over voltage protection. Only some pytorch examples caused it to show up.

@yurymalkov

This comment has been minimized.

Copy link

commented Feb 10, 2018

@castleguarders Have you figured out how to solve this issue? It seems that even 1200W "platinum" power supply is not enough for just 2X 1080Ti, it reboots from time to time.

@pmcrodrigues

This comment has been minimized.

Copy link

commented Apr 17, 2018

@castleguarders I am having similar issues, how did you found that that was the problem?

@castleguarders

This comment has been minimized.

Copy link
Author

commented Apr 19, 2018

@pmcrodrigues There was an audible click whenever the issue happened. I used nvidia-smi to soft control the power draw, this allowed the tests a bit longer, but trip anyways. I switched to a 825W Delta powersupply and it took care of the issue fully. Furmark makes easy work of testing this if you run windows. I ran it fully pegged for a couple of days, while driving the CPUs 100% with a different script. It's zero issues since then.

@yurymalkov I only have 1x 1080ti, didn't dare to put a second one.

@yurymalkov

This comment has been minimized.

Copy link

commented Apr 19, 2018

@pmcrodrigues @castleguarders
I've also "solved" the problem by feeding the second GPU from a separate PSU (1000W+1200W for 2X 1080Ti). Reducing the power draw by 0.5X via nvidia-smi -pl also helped, but it killed the performance. Also tried different motherboards/GPUs but it didn't help.

@pmcrodrigues

This comment has been minimized.

Copy link

commented Apr 19, 2018

@castleguarders @yurymalkov Thank you both. I have also tried to reduce power draw via nvidia-smi and it stopped crashing the system. But with stress tests at full power draw simultaneously on my 2 xeons (with http://people.seas.harvard.edu/~apw/stress/) and the 4 1080ti (with https://github.com/wilicc/gpu-burn) didn't make it crash. So for now I have only seen this problem on pytorch. Maybe I need other stress tests?

@yurymalkov

This comment has been minimized.

Copy link

commented Apr 19, 2018

@pmcrodrigues gpuburn seems to be a bad test for this, as it does not create steep power ramps.
I.e. a machine could pass gpuburn with 4 gpus, but failed at 2 gpus with a pytorch script.

The problem reproduces on some other frameworks (e.g. tensorflow), but it seems that pytorch scripts are the best test, probably because of the highly synchronous nature.

@gurkirt

This comment has been minimized.

Copy link

commented May 2, 2018

I am having the same issue. Has anybody found any soft solution to this?
I have 4 GPU system with one CPU and 1500W power supply. Using 3 out of 4 or 4/4 causes the reboot.
@castleguarders @yurymalkov @pmcrodrigues How to reduce power draw via nvidia-smi?

@pmcrodrigues

This comment has been minimized.

Copy link

commented May 2, 2018

@gurkirt For now, I am only using 2 GPUs with my 1500W PSU. If you want to test reducing the power draw you can use "nvidia-smi -pl X" where X is the new power draw. For my gtx 1080i I used "nvidia-smi -pl 150" whereas standard draw is 250W. I am waiting on more potent PSU to test if it solves the problem. Currently I have a measuring device to measure the power coming directly from the wall, but even when I am using 4 GPUs it does not pass the 1000W. It can still be some weird peaks that are not being registered but something is off. Either way, we probably need to go with the dual 1500W PSUs.

@gurkirt

This comment has been minimized.

Copy link

commented May 2, 2018

@pmcrodrigues thanks a lot for quick response. I have another system which has 2000W with 4 1080Ti's. That works just fine. I will try plugging that power supply in this machine and see if 2000W is enough on this machine.

@gurkirt

This comment has been minimized.

Copy link

commented May 2, 2018

@pmcrodrigues did you find any log/warning/ crash report somewhere?

@pmcrodrigues

This comment has been minimized.

Copy link

commented May 2, 2018

@gurkirt None.

@lukepfister

This comment has been minimized.

Copy link

commented Aug 8, 2018

I’m having a similar problem- audible click, complete system shutdown.

It seems that it only occurs with BatchNorm layers in place. Does that match with your experience?

@gurkirt

This comment has been minimized.

Copy link

commented Aug 8, 2018

I was using the resenet at that time. It is a problem of inadequate power supply problem. It is a hardware problem. I needed to upgrade the power supply. According to my searches online, the power surge is a problem of pytorch. I upgraded the power supply from 1500W to 1600W. The problem still appears now and then but only when the room temperature is a bit higher. I think there are two factors at play, room temperature and another major being the power supply.

@dov

This comment has been minimized.

Copy link

commented Aug 16, 2018

I have the same problem with a 550W power supply and a GTX1070 graphics cards. I start the learning and about a second later the power cuts.

But this made me thinking that perhaps it would be possible to trick/convince the PSU that everything is ok by creating a ramp-up function that e.g. mixes between sleeps and gpu activity and gradially increases the load. Has anyone tried this? Does someone have minimal code that reliably triggers the power cut?

@vwvolodya

This comment has been minimized.

Copy link

commented Sep 6, 2018

Had the same issue with GTX1070 but reboots were not random.
I had a code that was able to make my PC reboot every time i run it after at most 1 epoch.
At first i thought it can be PSU since mine has only 500W. However after closer investigation and even setting max power consumption to lower values with nvidia-smi i realized the issue is somewhere else.
It was not an overheating problem as well so i started to think that it might be because of I7-7820x Turbo mode. After disabling Turbo mode in BIOS settings of my Asus X299-A and changing Ubuntu's configuration as stated here the issue seems to be gone.

What did NOT work:

  • Changing pin_memory for dataloaders.
  • Playing with batch size.
  • Increasing system shared memory limits.
  • Setting nvidia-smi -pl 150 out of 195 possible for my system.

Not sure if this is related to native BIOS issues. I am running 1203 version while the latest is 3 releases ahead -- 1503 and they put

improved stability

into the description of each of those 3. Asus X299-A BIOS versions One of those releases had also

Updated Intel CPU microcode.

So there is a chance this is fixed.

@dov

This comment has been minimized.

Copy link

commented Sep 6, 2018

For the record, my problem was a broken power supply. I diagnosed this by running https://github.com/wilicc/gpu-burn on Linux and then FurMark on Windows, under the assumption, that unless I can reproduce the crash on Windows, they won't talk to me in my computer shop. Both these tests failed for me, wherupon I took the computer for repair and got a new power supply. Since then, I have been running pytorch for hours without any crashes.

@DanielLongo

This comment has been minimized.

Copy link

commented Oct 22, 2018

Has anyone found a way to fix this. I have a similar error where my computer restarts shortly after I start training. I have a 750w psu and only 1 gpu (1080ti) so I don't think it is a power problem. Also, I did not see an increased wattage going to my gpu before it restarts.

1 similar comment
@DanielLongo

This comment has been minimized.

Copy link

commented Oct 22, 2018

Has anyone found a way to fix this. I have a similar error where my computer restarts shortly after I start training. I have a 750w psu and only 1 gpu (1080ti) so I don't think it is a power problem. Also, I did not see an increased wattage going to my gpu before it restarts.

@yaynouche

This comment has been minimized.

Copy link

commented Nov 12, 2018

If I can add some more information about vwvolodya great comment. Our motherboard/cpu configuration was a ASUS TUF X299 MARK 2 with i9-7920x. The Bios version was at 1401. The only thing that can prevent the system to reboot/shutdown was to turn off : Turbo Mode.

For now, after updating to 1503 the problem seems to be solved with Turbo Mode activated.

Have a great day guys !

@zym1010

This comment has been minimized.

Copy link
Contributor

commented Jan 20, 2019

If I can add some more information about vwvolodya great comment. Our motherboard/cpu configuration was a ASUS TUF X299 MARK 2 with i9-7920x. The Bios version was at 1401. The only thing that can prevent the system to reboot/shutdown was to turn off : Turbo Mode.

For now, after updating to 1503 the problem seems to be solved with Turbo Mode activated.

Have a great day guys !

@yaynouche @vwvolodya similar issues happened on a ASUS WS-X299 SAGE with i9-9920X. Turning off Turbo Mode is the only solution right now, with latest BIOS (Version 0905 which officially supports i9-9920X).

UPDATE: turns out, I must enable turbo mode in BIOS and use commands like echo "1" > /sys/devices/system/cpu/intel_pstate/no_turbo as in #3022 (comment) to disable the turbo via software. If I disable turbo mode in the BIOS, then still the machine will reboot.

UPDATE 2: I think turning off Turbo Mode can only lower the chance of my issue, not eliminate it.

@Suley

This comment has been minimized.

Copy link

commented Apr 7, 2019

I am having the same issue. Has anybody found any soft solution to this?
I have 4 GPU system with one CPU and 1500W power supply. Using 3 out of 4 or 4/4 causes the reboot.
@castleguarders @yurymalkov @pmcrodrigues How to reduce power draw via nvidia-smi?

facing the same problem. 4 GTX 1080Ti with 1600W PSU (With redundancy) . Tried to use gpu burn to test it and it's stable like a rock.

@zym1010

This comment has been minimized.

Copy link
Contributor

commented Apr 7, 2019

@Suley personally I think this is more of a CPU problem; basically, pytorch invokes CPU to execute a series of instructions which draws too much power from motherboard.

@Suley

This comment has been minimized.

Copy link

commented Apr 7, 2019

@Suley personally I think this is more of a CPU problem; basically, pytorch invokes CPU to execute a series of instructions which draws too much power from motherboard.

Thanks for your reply. I will test the CPUs to identify the problem

@Suley

This comment has been minimized.

Copy link

commented Apr 7, 2019

@Suley personally I think this is more of a CPU problem; basically, pytorch invokes CPU to execute a series of instructions which draws too much power from motherboard.

I ran cpu stress test and gpu stress test at the same time, no problem found.
My mobo supports 150 W TDP, my cpu's tdp is 115w tdp.
So my max power consumption would be: 115w * 2(CPU) + 250w *4(1080Ti) + 200W(Disk and other component) = 1430
It seems that 1600W is enough. and besides, there's two 1600W redundancy power which both outputs power, that means each PSU only carry half of the load.

2 GPU works OK.
3 GPU unstable. reboot after few minutes.
4 GPU crashed immediately. system reboot and no logs have been recorded.

@zym1010

This comment has been minimized.

Copy link
Contributor

commented Apr 7, 2019

@Suley

This comment has been minimized.

Copy link

commented Apr 9, 2019

I also tried running stress tests for CPU and GPU simultaneously; no issue at all. Maybe it’s due to type of instructions... not sure. Can you try disabling some CPU cores or underclock them? In my case, this decreased probability/frequency of reboot but not fix the problem. It’s based on the fact that reducing CPU load can make programs more stable (at least on my machine) that I think this is a CPU issue. Yimeng Zhang Sent from my iPhone

On Apr 7, 2019, at 1:04 PM, Suley @.***> wrote: @Suley personally I think this is more of a CPU problem; basically, pytorch invokes CPU to execute a series of instructions which draws too much power from motherboard. I ran cpu stress test and gpu stress test at the same time, no problem found. My mobo supports 150 W TDP, So my max power consumption would be: 115w * 2(CPU) + 250w *4(1080Ti) + 200W(Disk and other component) = 1430 It seems that 1600W is enough. and besides, there's two 1600W redundancy power which both outputs power, that means each PSU only carry half of the load. 2 GPU works OK. 3 GPU unstable. reboot after few minutes. 4 GPU crashed immediately. system reboot and no logs have been recorded. — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

Thanks. Currently there's a task running at the server. I will try it after the task is finished, and share my test result.
But still can't explain why stressing gpu and cpu works, but pytorch doesn't. Hope someone can dig into this and come out with a solution.

@Suley

This comment has been minimized.

Copy link

commented Apr 13, 2019

I also tried running stress tests for CPU and GPU simultaneously; no issue at all. Maybe it’s due to type of instructions... not sure. Can you try disabling some CPU cores or underclock them? In my case, this decreased probability/frequency of reboot but not fix the problem. It’s based on the fact that reducing CPU load can make programs more stable (at least on my machine) that I think this is a CPU issue. Yimeng Zhang Sent from my iPhone

On Apr 7, 2019, at 1:04 PM, Suley @.***> wrote: @Suley personally I think this is more of a CPU problem; basically, pytorch invokes CPU to execute a series of instructions which draws too much power from motherboard. I ran cpu stress test and gpu stress test at the same time, no problem found. My mobo supports 150 W TDP, So my max power consumption would be: 115w * 2(CPU) + 250w *4(1080Ti) + 200W(Disk and other component) = 1430 It seems that 1600W is enough. and besides, there's two 1600W redundancy power which both outputs power, that means each PSU only carry half of the load. 2 GPU works OK. 3 GPU unstable. reboot after few minutes. 4 GPU crashed immediately. system reboot and no logs have been recorded. — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

seems like that you are right. it's a cpu related bug. After I disabled all cpu cores except for cpu0, it worked.
But only one core works. Enabling half of the cores still crashed.

@zym1010

This comment has been minimized.

Copy link
Contributor

commented Apr 13, 2019

@Suley do you use X299 chipset? Seems that many builds with X299 have this problem.

@robotrovsky

This comment has been minimized.

Copy link

commented Apr 23, 2019

1600W PSU with 4x 2080 TI's facing the same problem. I attached a second 750W PSU with ADD2PSU and now i am running 1600W PSU = 3x2080Ti + System and 750W PSU = 1x2080Ti and everything seems stable. As commented by other, pytorch is the only application stressing the GPU's so much they run into current protection. Miners, Renderers, Stresstests all are comfortable with one 1600W PSU. So this was a hardware issue and from now on pytorch will be my GPU Stresstest :-) BTW: I have a X399 build

@gurkirt

This comment has been minimized.

Copy link

commented Apr 24, 2019

Yes, pytorch cause power surge at the time of network initialisation. 1600W PSU is enough if you PSU is platinum grade and up silver of gold grade PSU are not robust enough to handle the sudden change in power requirement. Your PSU is able to supply enough but it can not handle the sudden change from ~250W usage to 1000+W required within seconds. Check the grade of power supply. Also, turn off overclocking in bios settings.

@yurymalkov

This comment has been minimized.

Copy link

commented Apr 24, 2019

@gurkirt I had a "platinum grade" 1200W PSU which couldn't handle two 1080Ti GPUs. Although, it worked better than other PSUs that I had (1000W, different brands, not cheap).

@gurkirt

This comment has been minimized.

Copy link

commented Apr 28, 2019

I have corsair 1600W platinum with 4x1080Ti and it works fine.

@Suley

This comment has been minimized.

Copy link

commented Apr 28, 2019

Yes, pytorch cause power surge at the time of network initialisation. 1600W PSU is enough if you PSU is platinum grade and up silver of gold grade PSU are not robust enough to handle the sudden change in power requirement. Your PSU is able to supply enough but it can not handle the sudden change from ~250W usage to 1000+W required within seconds. Check the grade of power supply. Also, turn off overclocking in bios settings.

my psu is Platinum grade psu. Supermicro 7047GR barebone. and it's two 1600w, combining 3200w in total.

@Suley

This comment has been minimized.

Copy link

commented Apr 28, 2019

@gurkirt I had a "platinum grade" 1200W PSU which couldn't handle two 1080Ti GPUs. Although, it worked better than other PSUs that I had (1000W, different brands, not cheap).

Strange! I have two platinum grade PSU. (1600w). Can't handle 4 1080Ti !

@Suley

This comment has been minimized.

Copy link

commented Apr 28, 2019

@Suley do you use X299 chipset? Seems that many builds with X299 have this problem.

No. I use x79, which is quite old. my X99 server works well.

@ZhengRui

This comment has been minimized.

Copy link

commented May 16, 2019

I had the same issue with 4x2080ti + asus x299 sage + Rosewill Hercules 1600W PSU (or Corsair 1500i), disabling cpu turbo not help. After using Corsair 1600i Titanium , it works perfectly.

@zym1010

This comment has been minimized.

Copy link
Contributor

commented May 16, 2019

I had the same issue with 4x2080ti + asus x299 sage + Rosewill Hercules 1600W PSU (or Corsair 1500i), disabling cpu turbo not help. After using Corsair 1600i Titanium , it works perfectly.

@ZhengRui My machine also has 4x2080ti + x299 sage, but with a 2000W PSU; still failing... (maybe due to CPU difference? Mine is a 12 core i9-9920X).

@ZhengRui

This comment has been minimized.

Copy link

commented May 17, 2019

@zym1010 my cpu is 10core i9-9820

@gurkirt

This comment has been minimized.

Copy link

commented May 23, 2019

I had the same issue with 4x2080ti + asus x299 sage + Rosewill Hercules 1600W PSU (or Corsair 1500i), disabling cpu turbo not help. After using Corsair 1600i Titanium , it works perfectly.

I had a similar case, after upgrade to 1600i, it worked.

@jerry73204

This comment has been minimized.

Copy link

commented May 25, 2019

In my case my machine has 1080 and 550W PSU. Running my libtorch program in Rust once is fine. However, if I repeat killing and restarting my program every 30 seconds, the system either reliably shuts down, or the GPU gets offline. Eventually, the motherboard breaks cannot boot at all.

@gurkirt

This comment has been minimized.

Copy link

commented May 26, 2019

I think it is clear from the above discussion that mostly it is the fault of PSU, PSU not only has to have enough power outage but should be robust enough to withstand the power surge. My advice to you if you have this problem then try changing it to better PSU and keep the machine in a cold and dry place.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.