Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Machine restarts when running TensorFlow with GPU #8858

Closed
surmenok opened this issue Mar 31, 2017 · 54 comments

Comments

Projects
None yet
@surmenok
Copy link

commented Mar 31, 2017

A simple Python program which runs a few TensorFlow computations consequently crashes when running on GPU.

Code:

from __future__ import print_function
import numpy as np
import tensorflow as tf
from tensorflow.python.client import timeline

def train_model(run_number):
    image_size = 28
    num_labels = 10
    batch_size = 16
    layer1_neuron_count = 16384

    graph = tf.Graph()
    
    with graph.as_default():
        tf_valid_dataset = tf.constant(valid_dataset)

        # Variables.
        weights0 = tf.Variable(
            tf.truncated_normal([image_size * image_size, layer1_neuron_count]))
        biases0 = tf.Variable(tf.zeros([layer1_neuron_count]))

        weights1 = tf.Variable(
            tf.truncated_normal([layer1_neuron_count, num_labels]))
        biases1 = tf.Variable(tf.zeros([num_labels]))

        valid_layer0 = tf.nn.relu(tf.matmul(tf_valid_dataset, weights0) + biases0)
        valid_prediction = tf.matmul(valid_layer0, weights1) + biases1
    
    with tf.Session(graph=graph) as session:
        tf.global_variables_initializer().run()

        print('Validation')
        
        session.run(valid_prediction)
            
        print('Validation done')

valid_dataset = np.random.uniform(-1, 1, (10000, 784)).astype(dtype=np.float32)
valid_labels = np.random.uniform(0, 1, (10000, 10)).astype(dtype=np.float32)

for i in range(10):
    print("Run #{}".format(i))
    train_model(i)

It should run the same computation 10 times, recreating a graph and a session every time.
Works fine when I run it on CPU. When running on GPU, it fails on running computation for 2nd, 3rd or 4th session.

Console output:

I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcufft.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.8.0 locally
Run #0
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE3 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:910] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties:
name: GeForce GTX 980 Ti
major: 5 minor: 2 memoryClockRate (GHz) 1.076
pciBusID 0000:01:00.0
Total memory: 5.93GiB
Free memory: 5.83GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0:   Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 980 Ti, pci bus id: 0000:01:00.0)
Validation
Validation done
Run #1
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 980 Ti, pci bus id: 0000:01:00.0)
Validation
Validation done
Run #2
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 980 Ti, pci bus id: 0000:01:00.0)
Validation
Validation done
Run #3
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 980 Ti, pci bus id: 0000:01:00.0)
Validation

Then the machine just restarts.
There are no relevant messages in syslog before the restart.

What related GitHub issues or StackOverflow threads have you found by searching the web for your problem?

http://stackoverflow.com/questions/39122984/system-auto-reboot-when-tensorflow-model-is-too-large
http://stackoverflow.com/questions/41237115/computer-restarts-with-large-mini-batches-in-tensorflow

When running other TensorFlow programs, I noticed that sometimes such crashes happen when I use large tensors. Issues above seem to be related, at least symptoms are similar.

Environment info

GPU: GeForce GTX 980 Ti
Operating System: Ubuntu 16.04.2 LTS

Installed version of CUDA and cuDNN: CUDA 8.0.61, cuDNN 7.5
Output of ls -l /usr/local/cuda/lib64/libcud*:

-rw-r--r-- 1 root root 556000 Mar 30 18:05 /usr/local/cuda/lib64/libcudadevrt.a lrwxrwxrwx 1 root root 16 Mar 30 18:05 /usr/local/cuda/lib64/libcudart.so -> libcudart.so.8.0 lrwxrwxrwx 1 root root 19 Mar 30 18:05 /usr/local/cuda/lib64/libcudart.so.8.0 -> libcudart.so.8.0.61 -rwxr-xr-x 1 root root 415432 Mar 30 18:05 /usr/local/cuda/lib64/libcudart.so.8.0.61 -rw-r--r-- 1 root root 775162 Mar 30 18:05 /usr/local/cuda/lib64/libcudart_static.a lrwxrwxrwx 1 root root 13 Mar 30 19:42 /usr/local/cuda/lib64/libcudnn.so -> libcudnn.so.5 lrwxrwxrwx 1 root root 18 Mar 30 19:42 /usr/local/cuda/lib64/libcudnn.so.5 -> libcudnn.so.5.1.10 -rwxr-xr-x 1 root root 84163560 Mar 30 19:42 /usr/local/cuda/lib64/libcudnn.so.5.1.10 -rw-r--r-- 1 root root 70364814 Mar 30 19:42 /usr/local/cuda/lib64/libcudnn_static.a

TensorFlow:

  1. "pip install tensorflow-gpu". Version 1.0.1
  2. The output from python -c "import tensorflow; print(tensorflow.__version__)":

I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.8.0 locally I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcufft.so.8.0 locally I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcuda.so.1 locally I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.8.0 locally 1.0.1

What other attempted solutions have you tried?

Tried to reinstall Ubuntu/CUDA/cuDNN/TensorFlow, didn't help.

@jubjamie

This comment has been minimized.

Copy link
Contributor

commented Mar 31, 2017

@surmenok Are you able to monitor nvidia-smi as you run the code. I think you can run nvidia-smi -l 1 and it should auto-update every second.
I'm struggling to understand your code but if you are running your code example above 10 times (or similar) perhaps you are running out of memory? Now, perhaps this is a Tensorflow memory bug or maybe it's poor memory coding on your part. I'm not too well versed in that area!

If nvidia-smi is showing your graphics card powering through the memory or over-heating or similar then we can go from there. Can I ask what GPU options you have set up for your session calls please?

@surmenok

This comment has been minimized.

Copy link
Author

commented Mar 31, 2017

Formatting of my code in the original post was not good. I updated it, hopefully, it is more clear now.
Basically, the function train_model creates a tf.Graph and a session, runs a simple computation: get a variable, multiply a constant tf_valid_dataset with a variable weights0, then add biases0 variable, then run tf.nn.relu , then multiply with weights1 and add biases1.
Constant tf_valid_dataset is initialized randomly.
Then this train_model function is executed sequentially 10 times. It runs well first 2-3 times and then the machine crashes.
I doubt that there are any memory related issues in this code. I think TensorFlow is designed to throw nice OOM exceptions if there is not enough memory.

I tried to add a 3 second delay between function executions to be able to monitor nvidia-smi output better.
Output of nvidia-smi when program just started:

Fri Mar 31 13:03:17 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.26                 Driver Version: 375.26                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 980 Ti  Off  | 0000:01:00.0     Off |                  N/A |
| 26%   37C    P2    65W / 250W |   5827MiB /  6076MiB |      8%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0      1790    C   python                                        5825MiB |
+-----------------------------------------------------------------------------+

Output of nvidia-smi in the middle (when running the function 2nd or 3rd time):

Fri Mar 31 13:03:21 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.26                 Driver Version: 375.26                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 980 Ti  Off  | 0000:01:00.0     Off |                  N/A |
| 26%   39C    P2    77W / 250W |   5838MiB /  6076MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0      1790    C   python                                        5836MiB |
+-----------------------------------------------------------------------------+

The last output of nvidia-smi, one second or less before machine restarts:

Fri Mar 31 13:03:29 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.26                 Driver Version: 375.26                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 980 Ti  Off  | 0000:01:00.0     Off |                  N/A |
| 26%   40C    P2    77W / 250W |   5838MiB /  6076MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0      1790    C   python                                        5836MiB |
+-----------------------------------------------------------------------------+

Doesn't seem to overheat. Memory consumtpion is tricky to measure. I think TensorFlow grabs almost all GPU memory from the beginning and manages memory itself, so for nvidia-smi it always looks like memory consumption is near maximum.

Can I ask what GPU options you have set up for your session calls please?

Where can I see these options?
I don't pass any special options to the Session constructor in Python code. This is the code for session initialization:

with tf.Session(graph=graph) as session:

Are there any other options I should look at?

@surmenok

This comment has been minimized.

Copy link
Author

commented Mar 31, 2017

I simplified the code a bit more. This is the minimal failing code I could get:

from __future__ import print_function
import tensorflow as tf

def train_model(run_number):
    graph = tf.Graph()
    
    with graph.as_default():
        tf_dataset = tf.Variable(tf.truncated_normal([input_dimension, 1000]))
        weights0 = tf.Variable(tf.truncated_normal([1000, 10000]))
        result = tf.matmul(tf_dataset, weights0)
    
    with tf.Session(graph=graph) as session:
        tf.global_variables_initializer().run()

        print('Before computation')
        session.run(result)

input_dimension = 10000

for i in range(100):
    print("Run #{}".format(i))
    train_model(i)

The function creates 2 variables using tf.truncated_normal for initialization, then multiplies these variables. The function is called 100 times sequentially.
Seems like the bug is related to memory, because size of TensorFlow variables impacts behavior of the program. Behavior with different values of input_dimension variable (it changes size of one of TensorFlow variables used in computation):

1000 - works fine
2000 - reboots the machine on 61st execution
3000 - reboots the machine on 12th execution
4000 - reboots the machine on 21st execution
5000 - reboots the machine on 4th execution
10000 - reboots the machine on 3rd execution
25000 - reboots the machine on 2nd execution
29000 - reboots the machine on 9th execution
29500 - reboots the machine on 19th execution
29900 - reboots the machine on 10th execution
30000 - reboots the machine on 30th execution
40000 - works fine
50000 - works fine
100000 - works fine
500000 - OOM on 1st execution:
W tensorflow/core/common_runtime/bfc_allocator.cc:275] Ran out of memory trying to allocate 18.63GiB.  See logs for memory state.
W tensorflow/core/framework/op_kernel.cc:993] Resource exhausted: OOM when allocating tensor with shape[500000,10000]

It's not deterministic. For some time it was working fine with values up to 4260, but then it started failing for values as low as 2000.
Interesting that it works well with higher values like 40000.

If I change the program to run the train_model function only once and run the program (executing 'python test.py' in a terminal window) multiple times, it also crashes the machine after a few runs.

@gunan

This comment has been minimized.

Copy link
Member

commented Apr 1, 2017

This looks like it is an issue in your workstation.
You could check dmesg and syslog to see if the OS logged anything informative right before crashing.

@surmenok

This comment has been minimized.

Copy link
Author

commented Apr 1, 2017

There are no messages in dmesg or syslog logged right before the system crashes.

@aselle

This comment has been minimized.

Copy link
Member

commented Apr 1, 2017

I can't reproduce this example. I'd try upgrading or downgrading your nvidia kernel driver.

@gunan

This comment has been minimized.

Copy link
Member

commented Apr 2, 2017

Also looking at the stackoverflow responses, only possibility I can see here is a power supply issue on your machine.
When running TF, you will run your GPU and your CPU to the max, so they will draw as much power as they can get. With smaller power supplies, we also observed machines freezing, locking up. Other than that, I have no idea what can be causing this.

@surmenok

This comment has been minimized.

Copy link
Author

commented Apr 2, 2017

I don't think that it is a power supply issue. nvidia-smi shows that power consumption is very low. At the same time I can run training of large neural networks which bring GPU utilization to over 90% and it is running for many days without issues. Based on how differently TensorFlow programs on my machine react to tensors of different sizes, I think the problem is related to memory.

@surmenok

This comment has been minimized.

Copy link
Author

commented Apr 2, 2017

I tried a suggestion from StackOverflow to limit GPU power using "sudo nvidia-smi -pl 150", it didn't help.

@aselle

This comment has been minimized.

Copy link
Member

commented Apr 2, 2017

When you have strange hardware errors like this, it's best to swap in a new identical GPU or use your existing GPU in a different PC.

@surmenok

This comment has been minimized.

Copy link
Author

commented Apr 2, 2017

Thanks, Andrew. Currently, I don't have another GPU or PC to try. Will do when I have a chance.

@gunan

This comment has been minimized.

Copy link
Member

commented May 22, 2017

Closing due to inactivity.
Please reopen when you have more information.

@gunan gunan closed this May 22, 2017

@mtakala

This comment has been minimized.

Copy link

commented Jun 11, 2017

A system with two P6000 24GB GPU's just hangs soon after some Tensorflow calculations. Ubuntu 16.04.2 LTS. No messages in dmesg and syslog. The workstation has >1000 W PSU. nVidia drivers and CUDA toolkit are the latest as-of June 2nd, 2017. I will have to check the versions later.

@jubjamie

This comment has been minimized.

Copy link
Contributor

commented Jun 12, 2017

Do you have any example code? I dont mind trying to run it (although I only have one GPU). If this is happening with all Tensorflow code then it would seem to be your workstation. Odd that nothing shows up in dmesg and syslog.

@mtakala

This comment has been minimized.

Copy link

commented Jun 14, 2017

I have further tested the system. The power draw doesn't seem to be too large as reported by nvidia-smi tool, but the temperatures stabilized at 80 degrees Celcius while the fans remained at 30-40% at most.

I was also able to reproduce the freeze issue with Matlab code that ran large matrix calculations on both GPU's. It could be that nVidia's linux drivers might have a problem.

@jubjamie

This comment has been minimized.

Copy link
Contributor

commented Jun 15, 2017

Warning: As this doesn't appear to be a bug with Tensorflow, the devs may ask for this to be moved to Stack Ovefrlow.

Yes this does sound like you don't have a Tensorflow issue but instead have problems with your drivers perhaps. May i suggest re-installing your drivers/updating them. Also, have a search online for the exact driver you are using as sometimes nVidia drivers have issues that the community tend to report. They might suggest you have a bad driver a give you a version to roll back to.

Nonetheless, it seems like this isn't a Tensorflow problem so this will probably get closed. If you ask on Stack Overflow you'll probably have more luck there! Good luck getting it fixed!

@surmenok

This comment has been minimized.

Copy link
Author

commented Aug 5, 2017

Now I feel confident that the issue is related to power supply.
I changed the power supply from "Corsair CX750 Builder Series ATX 80 PLUS" to "Cooler Master V1000" and don't get the system crash anymore.
I think the problem is not in max output of the power supply (750W should be more than enough for one GPU), but perhaps with other qualities of power supply.

@jubjamie

This comment has been minimized.

Copy link
Contributor

commented Aug 9, 2017

Power supplies are fiddly things indeed. If you think your power supply is bust I would speak to Corsair. They're usually pretty good with sorting it out if it's in warranty. Glad this is solved. Thanks

@mtakala

This comment has been minimized.

Copy link

commented Aug 11, 2017

Our issue turned out to be an issue with power supply as well. Sadly, the workstations are provided by Lenovo (P910), so there are no other compatible PSUs on the market. Our GPUs were Quardo P6000. We are now trying to solve the issue by experimenting with limiting the GPU power consumption in Linux by command: nvidia-smi --power-limit=150

@WERAQS

This comment has been minimized.

Copy link

commented Aug 13, 2017

this is a power supply issue, I've been experiencing these problems with my workstation (which had 650W 80+) with i7 CPU & Titan X Pascal GPU (1 ssd, 16gig ram, no hdd, no extra hardware), but I had to connect another PSU to the case and feed 8 Pin power input from 2nd supply. no problems so far, but voltage fluctuations still continuing. Its better to run these machines with at least 800W+ PSU to be safe. 12V rail dancing between 11.1V (which is highly dangerous) to 12.1V.

@mtakala

This comment has been minimized.

Copy link

commented Aug 20, 2017

I must say that Lenovo's most powerful 1200 W PSU for P910 is not enough for two P6000's, as the PSU +12V rails are not powerful enough.

Just wanted to write this here in case anyone else comes up with this issue.

@bzamecnik

This comment has been minimized.

Copy link

commented Dec 18, 2017

@surmenok

This comment has been minimized.

Copy link
Author

commented Dec 18, 2017

I'm not sure either. I tried to run gpu_burn to utilize GPU 100% and it worked fine. But TensorFlow was able to bring it down. The problem was solved by replacing the power supply.

@kxhit

This comment has been minimized.

Copy link

commented Dec 18, 2017

@surmenok
I tired to run gpu_burn to utilize GPU 100% and it worked fine too. But when I run your code with input_dimension=10000 or more will make my system reboot suddenly. I'm using 600W power supply. I may have to replace the power supply for a try.

@chrschorn

This comment has been minimized.

Copy link

commented Dec 19, 2017

I fixed this problem temporarily by limiting the power available to my Titan X to 150W (instead of 250W):

sudo nvidia-smi --persistence-mode=1
sudo nvidia-smi --power-limit=150

Before, my Ubuntu 14.04 LTS machine would shutdown or reboot without any errors in the syslog when starting the training on a VGG16 + SSD network.

@mxdbld

This comment has been minimized.

Copy link

commented Jan 7, 2018

I replaced the PSU to a more powerfull one to correct a similar problem.

@metral

This comment has been minimized.

Copy link

commented Feb 6, 2018

I've encountered the same reboot / hard-crashes on a single Nvidia GTX 1080ti with an Ubuntu 17.10 host anytime I run some examples of TF models (e.g. nearest neighbor, linear regression etc.), and I can't come to a conclusion as to why the crashes continue.

I've tried various nvidia drivers, against different kernel versions, all with CUDA 9 + cuDNN 7 + TF 1.5.0:

  • Linux 4.14.x -> Nvidia 384.111, 387.34, 390.25, 390.12
  • Linux 4.13.x -> Nvidia 384.98

I'm also using a 1200W, 80+ Platinum Certified PSU which is more than plenty, so I doubt its a PSU issue. I tried @chrschorn's suggestion to limit the power to 150W instead of the default 250W, but it still crashes with TF jobs.

I don't currently have another machine or GPU to test against, so hoping anyone can provide some clarity or possible next steps. Thanks!

@nanoant

This comment has been minimized.

Copy link

commented Feb 8, 2018

Hello everyone.

I want to report similar problem. Script from @surmenok does not restart my machine. What does is Keras VGGNet training running on TensorFlow backend (see below) on Arch Linux with NVIDIA 390.25 drivers.

What's more interesting, it crashes only when using batch size of 64. Also, it reports few out of memory errors, continues to run, then my machine reboots after ~1.5 epochs.

$ python vggnet_keras.py 64
Using TensorFlow backend.
2018-02-08 15:04:21.206315: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2018-02-08 15:04:21.345612: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:892] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-02-08 15:04:21.346057: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties: 
name: GeForce GTX 1070 major: 6 minor: 1 memoryClockRate(GHz): 1.683
pciBusID: 0000:01:00.0
totalMemory: 7.92GiB freeMemory: 7.83GiB
2018-02-08 15:04:21.346073: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 1070, pci bus id: 0000:01:00.0, compute capability: 6.1)
Train on 1224 samples, validate on 136 samples
Epoch 1/500
2018-02-08 15:04:28.236834: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.05GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available.
2018-02-08 15:04:29.400414: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.45GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available.
1224/1224 [==============================] - 25s 20ms/step - loss: 3.3798 - acc: 0.1389 - val_loss: 14.8144 - val_acc: 0.0809
Epoch 2/500
 832/1224 [===================>..........] - ETA: 5s - loss: 3.0877 - acc: 0.1719
packet_write_wait: Connection to ... port 22: Broken pipe

Reducing batch size to 32, makes memory errors disappear, and everything works well, no restart.
Also I tried it on Windows (10) and it works well - no restarts, noticeably slower than on Linux though.

I also tried VGGNet using TFLearn but it fails to run with batch size of 64, claiming out of memory.

I believe this can be a indeed PSU problem, but also very likely some corner-case of NVIDIA GPU that draws some "out of specs" current when running specific TensorFlow setup, causing restarts on machines using weaker but within-spec PSUs.

I am using gaming PC (Lenovo Y710 Cube) that came with GTX 1070 and 450W PSU, and it was assembled by Lenovo - not me, so I believe the PSU should be good enough for this GPU. NOTE: This box is also sold with GTX 1080 and same 450W PSU. Moreover, I had no such a problem with this machine.

Now I wonder if we should report this to NVIDIA?

I was able to narrow down the example to following script:

# VGGNet learning with NVIDIA 1070 restarts my Linux machine
# Using Miniconda Python 3.6 + keras-gpu
# Ported from https://github.com/the-deep-learners/TensorFlow-LiveLessons/blob/master/notebooks/vggnet_in_keras.ipynb

import numpy as np
np.random.seed(42)

import keras
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten, Conv2D, MaxPooling2D
from keras.layers.normalization import BatchNormalization
from keras.callbacks import TensorBoard  # for part 3.5 on TensorBoard

# import tflearn.datasets.oxflower17 as oxflower17
X, Y = np.random.random((1360, 224, 224, 3)), np.random.random((1360, 17)) # oxflower17.load_data(one_hot=True)

model = Sequential()

model.add(Conv2D(64, 3, activation='relu', input_shape=(224, 224, 3)))
model.add(Conv2D(64, 3, activation='relu'))
model.add(MaxPooling2D(2, 2))
model.add(BatchNormalization())

model.add(Conv2D(128, 3, activation='relu'))
model.add(Conv2D(128, 3, activation='relu'))
model.add(MaxPooling2D(2, 2))
model.add(BatchNormalization())

model.add(Conv2D(256, 3, activation='relu'))
model.add(Conv2D(256, 3, activation='relu'))
model.add(Conv2D(256, 3, activation='relu'))
model.add(MaxPooling2D(2, 2))
model.add(BatchNormalization())

model.add(Conv2D(512, 3, activation='relu'))
model.add(Conv2D(512, 3, activation='relu'))
model.add(Conv2D(512, 3, activation='relu'))
model.add(MaxPooling2D(2, 2))
model.add(BatchNormalization())

model.add(Conv2D(512, 3, activation='relu'))
model.add(Conv2D(512, 3, activation='relu'))
model.add(Conv2D(512, 3, activation='relu'))
model.add(MaxPooling2D(2, 2))
model.add(BatchNormalization())

model.add(Flatten())
model.add(Dense(4096, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(4096, activation='relu'))
model.add(Dropout(0.5))

model.add(Dense(17, activation='softmax'))

model.compile(loss='categorical_crossentropy',
              optimizer='adam', metrics=['accuracy'])

import sys
batch_size = int(sys.argv[1]) if len(sys.argv) > 1 else 64
model.fit(X, Y,
          batch_size=batch_size,
          epochs=500, shuffle=True,
          verbose=1, validation_split=0.1)

I will try to find out if I can produce some more minimal example with TensorFlow rather than Keras and let you know.

@G-ram

This comment has been minimized.

Copy link

commented Feb 20, 2018

I also have this issue running a Nvidia 1070Ti and driver version 390.25. It happens when the system is under load – both CPU and GPU. It seems to not be tensorflow, though, since I can recreate it using gpu-burn and stress. Maybe a driver problem?

@mtakala

This comment has been minimized.

Copy link

commented Feb 20, 2018

It is indeed a corner case issue, as NVidia GPU's can be made to draw an average current at, for example, 150 W via the nvidia-smi --power-limit=150 command. However, the cards still silently spike up to much higher than that during heavy operations, and eventually the overcurrent protections in uncompatible power supplies trip.

@lppier

This comment has been minimized.

Copy link

commented Mar 8, 2018

I tried the suggestion from @chrschorn, it works.
Was running the code from here : https://github.com/lppier/Convolutional_Neural_Networks/blob/master/high_accuracy_cnn_minst.py
which is pretty basic tensorflow stuff.

sudo nvidia-smi --persistence-mode=1
sudo nvidia-smi --power-limit=150

Running Ubuntu 16.04LTS and NVIDIA GTX1080.
It feels like a driver problem - NVIDIA are you reading?

Edit: The above works for certain cases. Now when I'm running more intensive cases like using inception_v3, I still get the restarting sometimes. :(

@sdeneefe

This comment has been minimized.

Copy link

commented Mar 8, 2018

I am seeing a similar problem running tensorflow on AWS p2.8xlarge instances. It is inconsistent in that some instances seem to have this problem and others do not. In fact, I am only seeing the problem with spot instances and not regular (on-demand) instances.

@lppier

This comment has been minimized.

Copy link

commented Mar 22, 2018

I'm not seeing restarts in Tensorflow for the same code, Windows 10. Just in Ubuntu, so it doesn't make sense that it a hardware issue.

I filed a bug report following the following steps:

  1. Go to developer.nvidia.com
  2. If you are not already a registered developer, register to become a registered developer ("Join" link in upper right hand corner.)
  3. Wait until your registration is approved. Typically less than 48 hours.
  4. Once registration is approved, log in using your credentials. (login link in upper right corner)
  5. In the upper right hand corner, use the drop-down by your name to click on "my account"
  6. On the left hand side click on "My Bugs"
  7. On the right hand side click on "Submit a new bug"

If you can spare some time, do submit a bug report as well, so that we can get NVIDIA to look at this.

@JianbingDong

This comment has been minimized.

Copy link

commented Jun 6, 2018

I have encountered the same question! After the power consumption of GPU is limited to 70%, my computer could finally work.
the configuration is:
windows7;
CPU: i7-6700;
GPU: asus 1080Ti;
tf: 1.8.0;
python: 3.5.2
and the Power Supply Unit is 700W in total.

@Profetul

This comment has been minimized.

Copy link

commented Sep 6, 2018

CPU: i7-2700k
GPU: brand new Asus STRIX 1080TI
PSU: OCZ Fatal1ty 750W

Can't run one epoch of code posted by @nanoant and would cause a reboot.

I've monitored power using nvidia-smi and doesn't go over 64W.

I've ordered a new 1200W PSU but my feeling is that it's a combination of GPU instructions and memory access that's causing the reset.

UPDATE 1: Installed the 1200W PSU - no more crash ... so it's related to W usage ....

UPDATE 2: Sadly my computer still reboots while training - I'm training Mask R-CNN - Keras 2.2.3 + Tensorflow 1.10.0. PSU - FPS AURUM PT 1200W - so Platinum rated PSU

@Miffyli

This comment has been minimized.

Copy link

commented Sep 8, 2018

Ubuntu 16.04 (KDE Neon), Tensorflow 1.9.0, CUDA 9 + CuDNN 7.0
CPU: Xeon W-2125
GPU: Asus GTX 1080 Turbo with version 384 drivers (from distribution)
PSU: Fujitsu proprietary 800W

Having similar trouble when training VAE for longer than ~30mins, after which computer reboots without any errors in logs. Temperatures were at ok range (max 80C for GPU). Running Hashcat on GPU + stressing CPU with primesieve did not crash the computer over night.

After checking this thread I decided to record the power usage with nvidia-smi under different loads, if that would shed more light into this.

Recorded with nvidia-smi --loop-ms=20 --format=csv,noheader,nounits --query-gpu=power.draw > out.txt.

Imgur

Imgur

Imgur

Default max of this GTX 1080 is 180W. Draw drops with Hashcat because of throttling, but the draw is more or less constant over time. @nanoant 's code is a horrible mess for PSU to handle, peaking at 230W. My VAE training code is not that terrible, but could still cause issues.

General "wattage" of PSU does not guarantee its capability to handle spiky draws like these. Drawing too much from one line may cause drops in voltage and such, and especially cheap PSUs die under these conditions. A 8-pin power connector (150W) and power from PCI-E port (75W) supply supported 225W, so the 230W peak draws more power from either connector than specified.

Now that was an interesting little side-project to dig in, now to fix the situation... I guess the best bet is to reduce maximum allowed draw.

Edit: Changing from batch-size 64 to 32 reduced average power draw by 20W and training has lasted over a hour now (previously crashed well before 1h).
Edit2: Reducing batch size worked for me, and now run has been going for 48h without issues.

@cipri-tom

This comment has been minimized.

Copy link

commented Sep 12, 2018

Thank you @Miffyli ! I have re-done your analysis for a project of mine which was causing the machine to reboot and I also notice power draws beyond the GPU specification, namely up to 290 watts with a 1080Ti, which is rated at 250 W.

image
I should've labelled the axes: x is the sample, y is the power draw. The line at 30000 is the eval run after the first epoch

However, I have a PSU of 1000 W, and only 1 GPU out of 2 was in use, so I don't think I reached the limit of the PSU. I have previously done trainings on both GPUs at the same time (different architectures on each) and didn't have this problem.
Could there be a different issue, maybe something with the driver ?

@Miffyli

This comment has been minimized.

Copy link

commented Sep 12, 2018

@cipri-tom

I should have added that these components are designed to handle occasional and short spikes above their designed draw, so with good components these kind of draws shouldn't create issues. The PSU can be specified to provide X watts, but with cheap PSUs the voltage drops as you try to draw more power, which may cause computer to crash/shut down.

You could try reducing your overall draw by e.g. using smaller batches or forcing it via nvidia-smi with:

nvidia-smi -pm 1 
nvidia-smi -pl 200

For me the former worked and reduced average draw by 20W. I do not if setting power-limit via nvidia-smi is going to help with the spikes.

@mtakala

This comment has been minimized.

Copy link

commented Sep 12, 2018

Our experience with the Lenovo P910 workstation with the top of the line Lenovo PSU is that even when limiting the power using the nvidia-smi command to very lowest possible value, Tensorflow will make the PSU go to shutdown state due to the power spikes. The system may last a minute, an hour, a day or anything in between, so it is not predictable at all.

@rsippl

This comment has been minimized.

Copy link

commented Nov 14, 2018

I ran into this while running code that's almost identical to what @nanoant posted, and I have a very similar setup (Lenovo Y700 with GTX 1070 on Ubuntu 18.04) -- creepy :) The machine always restarts during the first epoch.

This works for me:
sudo nvidia-smi -pm 1 && sudo nvidia-smi -pl 120

I watch the power draw via
watch -n 0.5 nvidia-smi -q -d power

Values above 120 don't work. Reducing the batch size to 32 works for me, too, but is slightly slower than 64 with power limit (15s vs. 14s per epoch).

In order to avoid running into this issue, I believe it's best to set the power limit at boot. On Ubuntu, I'm doing this with a one-line script in /etc/cron.d:
@reboot root nvidia-smi -pm 1 && nvidia-smi -pl 120

@maxlawwk

This comment has been minimized.

Copy link

commented Nov 18, 2018

I've encountered the same reboot / hard-crashes on a single Nvidia GTX 1080ti with an Ubuntu 17.10 host anytime I run some examples of TF models (e.g. nearest neighbor, linear regression etc.), and I can't come to a conclusion as to why the crashes continue.

I've tried various nvidia drivers, against different kernel versions, all with CUDA 9 + cuDNN 7 + TF 1.5.0:

  • Linux 4.14.x -> Nvidia 384.111, 387.34, 390.25, 390.12
  • Linux 4.13.x -> Nvidia 384.98

I'm also using a 1200W, 80+ Platinum Certified PSU which is more than plenty, so I doubt its a PSU issue. I tried @chrschorn's suggestion to limit the power to 150W instead of the default 250W, but it still crashes with TF jobs.

I don't currently have another machine or GPU to test against, so hoping anyone can provide some clarity or possible next steps. Thanks!

I would suggest installing a (downloadable pre-compiled or compile it yourself) TF based on different versions of cuDNN (not necessary to be the latest version). It may be issues related to glitches happening at certain combinations of software/hardware.

@maxlawwk

This comment has been minimized.

Copy link

commented Nov 20, 2018

I had completely the same syndrome, tensorflow occasionally caused system reboot. Such reboots were greeted by Asus Power-Surge warning. Certain network configurations gave higher chance of reboot (e.g. placing a 1x1x1 conv-kernel at the deepest layer instead of 3x3x3 conv-kernel, I have no idea why). The situation became more and more frequent, to a point that I couldn't do any meaningful training.

The problem started when I upgraded from i5-4570 to the wattage-beast i7-4790k. At the beginning, it rebooted very rarely, like once a week. After 3 weeks, it was rebooting every hour. Limiting the power usage to 150W and underclocking the GPU slightly lowered the chance of reboot, but not eliminating it. Problem finally solved after upgrading my PSU to EVGA G1+ 1000W.

I have 4x120mm fans, 3xHDD, 1xSSD, 1xDVDRW, EVGA 1080ti without OC, Asus Z97-p, i7-4790k with stock cooler (also tried with Coolermaster Hyper 212 evo), with 5 USB flashdrive always plugged-in, Coolermaster 750W PSU (now 1000W PSU). A few online calculators showing my configuration had reached over 90% usage of the total of 3.3V+5V+12V rails of the 750W PSU.

@maxlawwk

This comment has been minimized.

Copy link

commented Nov 20, 2018

My verdict is that the software/firmware wattage limit of GPU isn't a laser-accurate power limit of the hardware. It may be a limit of, say the 5-second-averaged power draw of the GPU. Some hardware ops may momentarily have a burst of current draw, lasting only perhaps a few hundred ms. Limiting GPU power draw won't remove but rather reduce the usage of such ops. How hardware ops triggered according to program codes is related to the driver, cuDNN and TF compilation. Also, whether or not the system survives the voltage fluctuation incurred from the burst of current draw depends on luck.

@Co1dAt0m

This comment has been minimized.

Copy link

commented Nov 21, 2018

I am having similar issue, CPU is i9-7940x, one gtx 1080 ti, motherboard is ASUS WS X299 sage, PSU: Corsair AX 1600i, OS: ubuntu 18.04, CUDA 9.2, CUDNN: 7.1/7.2/7.4. The power is more than sufficient, however, whenever I run a certain tensorflow code(still trying to figure out the problematic part), the computer will reboot, no warning, the screen freeze for a while then reboot. And I cannot find any record related to the reboot. It is just like a regular computer booting.
The same code runs fine on my 5-year old computer with gtx 1080 ti, evga 650w psu, i5-3570, asus sabertooth z77 and ubuntu 16.04.

@maxlawwk

This comment has been minimized.

Copy link

commented Nov 22, 2018

I am having similar issue, CPU is i9-7940x, one gtx 1080 ti, motherboard is ASUS WS X299 sage, PSU: Corsair AX 1600i, OS: ubuntu 18.04, CUDA 9.2, CUDNN: 7.1/7.2/7.4. The power is more than sufficient, however, whenever I run a certain tensorflow code(still trying to figure out the problematic part), the computer will reboot, no warning, the screen freeze for a while then reboot. And I cannot find any record related to the reboot. It is just like a regular computer booting.
The same code runs fine on my 5-year old computer with gtx 1080 ti, evga 650w psu, i5-3570, asus sabertooth z77 and ubuntu 16.04.

I would suggest running stress test using some benchmark software to maximize power consumption of both CPU and GPU. It may not be the issue of PSU but the aging of mobo power supply circuit. You may look for popped capacitor on mobo or the graphic card. Swapping components between the i9 and i5 computers also help to troubleshoot the issue.

@Co1dAt0m

This comment has been minimized.

Copy link

commented Nov 23, 2018

I am having similar issue, CPU is i9-7940x, one gtx 1080 ti, motherboard is ASUS WS X299 sage, PSU: Corsair AX 1600i, OS: ubuntu 18.04, CUDA 9.2, CUDNN: 7.1/7.2/7.4. The power is more than sufficient, however, whenever I run a certain tensorflow code(still trying to figure out the problematic part), the computer will reboot, no warning, the screen freeze for a while then reboot. And I cannot find any record related to the reboot. It is just like a regular computer booting.
The same code runs fine on my 5-year old computer with gtx 1080 ti, evga 650w psu, i5-3570, asus sabertooth z77 and ubuntu 16.04.

I would suggest running stress test using some benchmark software to maximize power consumption of both CPU and GPU. It may not be the issue of PSU but the aging of mobo power supply circuit. You may look for popped capacitor on mobo or the graphic card. Swapping components between the i9 and i5 computers also help to troubleshoot the issue.

Thank you for replying. I have the tested the same GPU on the old computer, it works great, it runs the same code without crash.
The CPU itself also works ok, I used that to compile tensorflow, the CPU usage is over 90%, no problem found so far. I also tested the computer with three 1600w PSU, symptoms are the same. It seems that the computer will reboot when the program involves both CPU and GPU and the power consumption is very low.
I highly suspect that it is the motherboard's problem, it is a new motherboard(asus x299 sage), 18.04 is the only version of Ubuntu I can successfully install.

@maxlawwk

This comment has been minimized.

Copy link

commented Nov 23, 2018

Your rebooting issue recalls my memory of a super bad luck computer building. All components of that computer were new, but it always reboots after 1 or 2 hours of use, guaranteed an unsuccessful windows installation. It was ok when I blew the entire machine under an AC with the side panel off. It passed all kind of memtest and cputest that required no OS to operate. Turn out it was the faulty new RAM from Asus. The memtest would fail at the 3rd complete iterations after about an hour. The components were functional but broke down under certain circumstances.

I guess your situation is similar. PCIe 16x alone provides 75W of power. This current is regulated by mobo circuitry and not connected directly to the PSU 12v rail. That part of circuitry stressed out during GPU full loading, even if the PSU has plenty of reserved juice. Try plugging another wattage beast in your Asus mobo. You may eventually pinpoint the problem, and in the best case, require a replacement from the vendor.

@Co1dAt0m

This comment has been minimized.

Copy link

commented Nov 24, 2018

The problem was fixed by downgrading the bios (asus X299 sage) from 0601 to 0502.

Both the initial and lastest(0601) bios of asus x299 sage are problematic. With 0601, a script of several "tf.load_op_library('custom_cuda_module.so')" may reboot the computer, depends on the PSU and whether 'watch -d -n 0.2 nvidia-smi' is used, computer may reboot at the script's first run, third run or fifth run ... ; the keras example provided by @nanoant will also cause computer rebooting after a few epochs.

After downgrading to 0502, all previous problematic codes seem to work fine. However, I am still a little worried that some unknown motherboard bugs may get triggered by other codes in the future.

@sense-amid-madness

This comment has been minimized.

Copy link

commented Dec 13, 2018

Just wanted to confirm the above comment: this exact issue (random reboots) happened to me on an ASUS x299 Sage with Bios 0601. There was a new Bios available (0701) which fixed it, system has been running stable for 24 hours now with four GPUs on full load.

@sbehuret

This comment has been minimized.

Copy link

commented Dec 28, 2018

If you experience sudden reboots, the most probable cause is that your PSU is underrated. A machine rebooting or shutting off suddenly is actually a good sign, because it suggests that the PSU in cleanly shutting off to prevent damage to the hardware. This happens with high-end PSUs that have overcurrent (OCP) / overload (OLP) protection (these two terms refer to the same thing). By contrast, lower-end PSUs may cause hardware crashes such as system freezes due to voltage instability.

I have seen a dual Titan X/Xp + Xeon E5 v4 145W setup work fine with a 850W PSU when doing simple TensorFlow compute involving both GPUs, but that almost always failed for heavy models that tend to maximize the compute and memory use. This was solved by upgrading to a 1300W PSU. In short, you will be safe with a 800W PSU for a single 250W GPU, and a 1200W PSU for two 250W GPUs. Some people recommend ~1000W PSUs for dual-GPU setups, but in my experience this will not be sufficient.

As for PSU recommendations, I have seen the Seasonic Primes series work flawlessly, specifically the Titanium 850W for a single-GPU setup, and the Platinum 1300W for a dual-GPU setup. There are many other PSU manufacturers out there that do extremely well in these wattage categories.

I though that it would be useful to add some explanations as to why sudden reboots might be happening:

High-end GPUs that are rated for 250W routinely exceed 320W of power draw at stock frequencies and default wattage limit, for what appears to be periods up to hundreds of milliseconds. Looking at NVIDIA’s official recommendations for PSU wattage, 600W is the minimum requirement for a single Titan X/Xp/1080ti GPU, and 850W for two of these GPUs. Wattage recommendation for the newer RTX 2080ti GPU was bumped to 650W, not sure about the Titan RTX. Based on my own experience, these recommendations will definitely lead to sudden reboots with high-end CPUs that are rated for 145W+ (and probably peaking over 200W). The overall system power draw will easily exceed the recommended figures, seemingly randomly during compute and gaming. Good PSUs are accurate in detecting transient overloads, and will shut off at the first opportunity.

High transient currents were mentioned as a possible cause for sudden reboots, and it was also suggested that this might be a corner case of GPU utilization. It is true that some people have reported issues with PSUs shutting off wrongly in response to high transient currents that were supposedly still under the wattage limit of the PSU. In my opinion this was never clearly demonstrated, and for the vast majority of cases, sudden reboots are caused by underrated PSUs as explained above.

It was also mentioned that reboots might be caused by a software issue, as someone here reported that their hardware experienced sudden reboots on Linux but worked well on Windows. Linux and Windows drivers have different implementations, which may change how GPUs draw current. In addition, TensorFlow on Linux appears to run faster than on Windows, suggesting higher current draw on Linux. This is consistent with the observation that sudden reboots occur more frequently on Linux.

Another possible cause is the motherboard. As mentioned earlier in this thread, GPUs will get 75W from the PCIe slot, an additional 150W from the 8-pin connector and the last 75W from the 6-pin connector. For high-end GPUs that are rated for 250W+, this means getting the full (or nearly) 75W from the PCIe slot. Any weakness on the motherboard power circuitry could result in all sorts of crashes, even if the PSU is fine. If you think that your PSU is fine, try a different PCIe slot and a different motherboard BIOS version. If your RAM and CPU are doing fine in other tasks, it is unlikely that the power circuitry of the motherboard is affected.

If you experience random reboots that do not seem to correlate with system load or heat, I would suggest looking at the RAM. Bad RAM may work flawlessly for days in memtest, and then crash a system during idle just 5 minutes after booting. This kind of unpredictable behavior makes it very difficult to understand what is faulty in a system. If you can afford for more expensive server parts, get a Xeon and ECC RAM. This would definitely help if you experience hardware-related issues.

In conclusion, always opt for a premium PSU with plenty of extra watts. The overall power requirements and stress on the PSU will be high during GPU compute. Official wattage recommendations are way too low and will likely cause issues during GPU compute. Last, beware of cheap PSUs. Melted connectors were reported for dual-GPU gaming systems. It is clearly a better option to have a PSU cleanly shut off when a system draws too much current, than find melted wires and connectors.

@amsoftgroup

This comment has been minimized.

Copy link

commented Mar 21, 2019

I had similar issues, random reboots under Tensorflow GPU load with Gigabyte X299 UD4 motherboard shipped with BIOS version F3. Official support for my CPU (Intel Core i7-9800X CPU @ 3.80GHz) came under BIOS version F6j. Updated the BIOS and completely stable. Echoing others before me, make sure your motherboard BIOS explicitly supports the CPU even if all appears to be well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.