Context in use? #526

Closed
noisychannel opened this Issue Dec 16, 2015 · 30 comments

Comments

Projects
None yet
@noisychannel

Running the following with GPU support :

python convolutional.py

throws the error:
F tensorflow/stream_executor/cuda/cuda_driver.cc:383] Check failed: CUDA_SUCCESS == dynload::cuCtxSetCurrent(context) (0 vs. 216)

Aborted

It seems like 216 when calling cuCtxSetCurrent (which I'm assuming assigns the context to the calling CPU thread) corresponds to CUDA_ERROR_CONTEXT_ALREADY_IN_USE.

What may be causing this error? It seems like the script successfully transfers data to the GPU and fails when initialize_all_variables() is called.

@noisychannel

This comment has been minimized.

Show comment
Hide comment

The complete log is here : http://pastebin.com/as0fWvYv

@noisychannel

This comment has been minimized.

Show comment
Hide comment
@noisychannel

noisychannel Dec 16, 2015

And the same issue with tutorials_example_trainer. Log here : http://pastebin.com/vPkFfete

And the same issue with tutorials_example_trainer. Log here : http://pastebin.com/vPkFfete

@zheng-xq

This comment has been minimized.

Show comment
Hide comment
@zheng-xq

zheng-xq Dec 18, 2015

Contributor

@noisychannel, could you provide a bit more information about your running
environment. I see that you have two K20m on your machine. Is this a
dedicated machine, or something shared?

On Fri, Dec 18, 2015 at 11:06 AM, Derek Murray notifications@github.com
wrote:

Assigned #526 #526 to
@zheng-xq https://github.com/zheng-xq.


Reply to this email directly or view it on GitHub
#526 (comment).

Contributor

zheng-xq commented Dec 18, 2015

@noisychannel, could you provide a bit more information about your running
environment. I see that you have two K20m on your machine. Is this a
dedicated machine, or something shared?

On Fri, Dec 18, 2015 at 11:06 AM, Derek Murray notifications@github.com
wrote:

Assigned #526 #526 to
@zheng-xq https://github.com/zheng-xq.


Reply to this email directly or view it on GitHub
#526 (comment).

@noisychannel

This comment has been minimized.

Show comment
Hide comment
@noisychannel

noisychannel Dec 18, 2015

Tried both scenarios :

  1. Dedicated with 2 K20s.
  2. Shared with 2 K20s and 1 available and selected for use.

Same issue in both cases.

Tried both scenarios :

  1. Dedicated with 2 K20s.
  2. Shared with 2 K20s and 1 available and selected for use.

Same issue in both cases.

@digitalsword

This comment has been minimized.

Show comment
Hide comment
@digitalsword

digitalsword Feb 13, 2016

@noisychannel I have the same issue. Did you solve it?

@noisychannel I have the same issue. Did you solve it?

@noisychannel

This comment has been minimized.

Show comment
Hide comment
@noisychannel

noisychannel Feb 16, 2016

No, the issue remains.

No, the issue remains.

@digitalsword

This comment has been minimized.

Show comment
Hide comment
@digitalsword

digitalsword Feb 17, 2016

@zheng-xq Is the bug fixed in the recently released tensorflow 0.7?

@zheng-xq Is the bug fixed in the recently released tensorflow 0.7?

@noisychannel

This comment has been minimized.

Show comment
Hide comment
@noisychannel

noisychannel Feb 22, 2016

Any updates here?

Any updates here?

@zheng-xq

This comment has been minimized.

Show comment
Hide comment
@zheng-xq

zheng-xq Feb 22, 2016

Contributor

I've started an offline conversation with the stream-executor team, since the error originates from stream-executor. Still wait for their response.

@leary-google, @eliben, anything from the stream-executor side?

Contributor

zheng-xq commented Feb 22, 2016

I've started an offline conversation with the stream-executor team, since the error originates from stream-executor. Still wait for their response.

@leary-google, @eliben, anything from the stream-executor side?

@digitalsword

This comment has been minimized.

Show comment
Hide comment
@digitalsword

digitalsword Mar 4, 2016

I am still seeing the same cuda error for tensorflow 0.7. The error is Check failed: CUDA_SUCCESS == dynload::cuCtxSetCurrent(context) (0 vs. 216)

version:

commit b88971051fbc49fa1e0b91ec1b0b60defa11697e
Merge: 5a30c8f 00986d4
Author: Derek Murray <mrry@google.com>
Date:   Fri Feb 26 05:08:35 2016 -0800

error:

++ python cifar10_train.py
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcurand.so.7.5 locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcufft.so.7.5 locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcudnn.so.4 locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcublas.so.7.5 locally
Filling queue with 20000 CIFAR images before starting to train. This will take a few minutes.
I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 0 with properties: 
name: Tesla K40m
major: 3 minor: 5 memoryClockRate (GHz) 0.745
pciBusID 0000:08:00.0
Total memory: 11.25GiB
Free memory: 11.15GiB
I tensorflow/core/common_runtime/gpu/gpu_init.cc:126] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 0:   Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:718] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K40m, pci bus id: 0000:08:00.0)
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:53] Creating bin of max chunk size 256B
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:53] Creating bin of max chunk size 512B
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:53] Creating bin of max chunk size 1.0KiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:53] Creating bin of max chunk size 2.0KiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:53] Creating bin of max chunk size 4.0KiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:53] Creating bin of max chunk size 8.0KiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:53] Creating bin of max chunk size 16.0KiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:53] Creating bin of max chunk size 32.0KiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:53] Creating bin of max chunk size 64.0KiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:53] Creating bin of max chunk size 128.0KiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:53] Creating bin of max chunk size 256.0KiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:53] Creating bin of max chunk size 512.0KiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:53] Creating bin of max chunk size 1.00MiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:53] Creating bin of max chunk size 2.00MiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:53] Creating bin of max chunk size 4.00MiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:53] Creating bin of max chunk size 8.00MiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:53] Creating bin of max chunk size 16.00MiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:53] Creating bin of max chunk size 32.00MiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:53] Creating bin of max chunk size 64.00MiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:53] Creating bin of max chunk size 128.00MiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:53] Creating bin of max chunk size 256.00MiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:107] Allocating 10.60GiB bytes.
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:118] GPU 0 memory begins at 0x13047a0000 extends to 0x15aaa4019a
F tensorflow/stream_executor/cuda/cuda_driver.cc:383] Check failed: CUDA_SUCCESS == dynload::cuCtxSetCurrent(context) (0 vs. 216)
./run.sh: line 4: 33880 Aborted                 python cifar10_train.py

I am still seeing the same cuda error for tensorflow 0.7. The error is Check failed: CUDA_SUCCESS == dynload::cuCtxSetCurrent(context) (0 vs. 216)

version:

commit b88971051fbc49fa1e0b91ec1b0b60defa11697e
Merge: 5a30c8f 00986d4
Author: Derek Murray <mrry@google.com>
Date:   Fri Feb 26 05:08:35 2016 -0800

error:

++ python cifar10_train.py
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcurand.so.7.5 locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcufft.so.7.5 locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcudnn.so.4 locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcublas.so.7.5 locally
Filling queue with 20000 CIFAR images before starting to train. This will take a few minutes.
I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 0 with properties: 
name: Tesla K40m
major: 3 minor: 5 memoryClockRate (GHz) 0.745
pciBusID 0000:08:00.0
Total memory: 11.25GiB
Free memory: 11.15GiB
I tensorflow/core/common_runtime/gpu/gpu_init.cc:126] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 0:   Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:718] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K40m, pci bus id: 0000:08:00.0)
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:53] Creating bin of max chunk size 256B
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:53] Creating bin of max chunk size 512B
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:53] Creating bin of max chunk size 1.0KiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:53] Creating bin of max chunk size 2.0KiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:53] Creating bin of max chunk size 4.0KiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:53] Creating bin of max chunk size 8.0KiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:53] Creating bin of max chunk size 16.0KiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:53] Creating bin of max chunk size 32.0KiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:53] Creating bin of max chunk size 64.0KiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:53] Creating bin of max chunk size 128.0KiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:53] Creating bin of max chunk size 256.0KiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:53] Creating bin of max chunk size 512.0KiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:53] Creating bin of max chunk size 1.00MiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:53] Creating bin of max chunk size 2.00MiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:53] Creating bin of max chunk size 4.00MiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:53] Creating bin of max chunk size 8.00MiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:53] Creating bin of max chunk size 16.00MiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:53] Creating bin of max chunk size 32.00MiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:53] Creating bin of max chunk size 64.00MiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:53] Creating bin of max chunk size 128.00MiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:53] Creating bin of max chunk size 256.00MiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:107] Allocating 10.60GiB bytes.
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:118] GPU 0 memory begins at 0x13047a0000 extends to 0x15aaa4019a
F tensorflow/stream_executor/cuda/cuda_driver.cc:383] Check failed: CUDA_SUCCESS == dynload::cuCtxSetCurrent(context) (0 vs. 216)
./run.sh: line 4: 33880 Aborted                 python cifar10_train.py
@rlrs

This comment has been minimized.

Show comment
Hide comment
@rlrs

rlrs Mar 9, 2016

Indeed, I am seeing the same error sometimes on a shared K40. It seems to happen when someone else has completed a job, and somehow the context is not cleared? I am sure that no job is actually executing on the GPU at the time.

rlrs commented Mar 9, 2016

Indeed, I am seeing the same error sometimes on a shared K40. It seems to happen when someone else has completed a job, and somehow the context is not cleared? I am sure that no job is actually executing on the GPU at the time.

@crscardellino

This comment has been minimized.

Show comment
Hide comment
@crscardellino

crscardellino Mar 13, 2016

I am having the same issue with GPU:1, I can run without problems in GPU:0, but when trying to force the graph to be in GPU:1, using Graph.device(), I get the following: http://pastebin.com/ekTgqJ0U

I am having the same issue with GPU:1, I can run without problems in GPU:0, but when trying to force the graph to be in GPU:1, using Graph.device(), I get the following: http://pastebin.com/ekTgqJ0U

@zxvix

This comment has been minimized.

Show comment
Hide comment
@zxvix

zxvix Apr 3, 2016

I encountered the error Check failed: CUDA_SUCCESS == dynload::cuCtxSetCurrent(context) (0 vs. 216) recently, which turned out to be caused by someone accidently set the compute mode of GPU to be EXCLUSIVE_THREAD. Revert it back to DEFAULT solved my error.

zxvix commented Apr 3, 2016

I encountered the error Check failed: CUDA_SUCCESS == dynload::cuCtxSetCurrent(context) (0 vs. 216) recently, which turned out to be caused by someone accidently set the compute mode of GPU to be EXCLUSIVE_THREAD. Revert it back to DEFAULT solved my error.

@girving

This comment has been minimized.

Show comment
Hide comment
@girving

girving Jun 6, 2016

Contributor

@zheng-xq: Should we contact the Stream Executor folk offline? It looks like they might not have Github notifications turned on.

Contributor

girving commented Jun 6, 2016

@zheng-xq: Should we contact the Stream Executor folk offline? It looks like they might not have Github notifications turned on.

@girving girving added the triaged label Jun 6, 2016

@noisychannel

This comment has been minimized.

Show comment
Hide comment
@noisychannel

noisychannel Jun 16, 2016

Issue still exists in 0.9.

Issue still exists in 0.9.

@noisychannel

This comment has been minimized.

Show comment
Hide comment
@noisychannel

noisychannel Jun 16, 2016

Out of curiosity, how is this not a bigger issue? Is there a specific condition where this failure occurs? It seems like the process crashes whether all or any GPUs are available on the machine.

How do other people get around this?

Out of curiosity, how is this not a bigger issue? Is there a specific condition where this failure occurs? It seems like the process crashes whether all or any GPUs are available on the machine.

How do other people get around this?

@danpovey

This comment has been minimized.

Show comment
Hide comment
@danpovey

danpovey Jun 16, 2016

Regarding the comment of @zxvix, saying: "I encountered the error Check failed: CUDA_SUCCESS == dynload::cuCtxSetCurrent(context) (0 vs. 216) recently, which turned out to be caused by someone accidently set the compute mode of GPU to be EXCLUSIVE_THREAD. Revert it back to DEFAULT solved my error."

Sometimes on shared clusters there are valid reasons for setting exclusive mode for GPUs. Does TensorFlow require particular modes? Is EXCLUSIVE_PROCESS a possibility?

Regarding the comment of @zxvix, saying: "I encountered the error Check failed: CUDA_SUCCESS == dynload::cuCtxSetCurrent(context) (0 vs. 216) recently, which turned out to be caused by someone accidently set the compute mode of GPU to be EXCLUSIVE_THREAD. Revert it back to DEFAULT solved my error."

Sometimes on shared clusters there are valid reasons for setting exclusive mode for GPUs. Does TensorFlow require particular modes? Is EXCLUSIVE_PROCESS a possibility?

@zheng-xq zheng-xq assigned zheng-xq and unassigned zheng-xq Jun 16, 2016

@zheng-xq

This comment has been minimized.

Show comment
Hide comment
@zheng-xq

zheng-xq Jun 16, 2016

Contributor

Adding @henline, who is the owner of stream-executor.

Contributor

zheng-xq commented Jun 16, 2016

Adding @henline, who is the owner of stream-executor.

@aselle aselle removed the triaged label Jul 28, 2016

@kramimus

This comment has been minimized.

Show comment
Hide comment
@kramimus

kramimus Jul 28, 2016

I am seeing a similar issue on 0.9 compiled from source, HEAD:

commit 554ddd9ad2d4abad5a9a31f2d245f0b1012f0d10
Merge: 89e1cc5 a0745a7
Author: yifeif <fengyifei2026@gmail.com>
Date:   Tue Jul 26 16:17:21 2016 -0700

I am on a shared cluster with a scheduler, so I should have exclusive access to the node during my time slice. It looks like exclusive mode is set, but there are no running processes at the time time I try to use it:

[2016-07-28T17:59:19Z]: +------------------------------------------------------+                       
[2016-07-28T17:59:19Z]: | NVIDIA-SMI 352.39     Driver Version: 352.39         |                       
[2016-07-28T17:59:19Z]: |-------------------------------+----------------------+----------------------+
[2016-07-28T17:59:19Z]: | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
[2016-07-28T17:59:19Z]: | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
[2016-07-28T17:59:19Z]: |===============================+======================+======================|
[2016-07-28T17:59:19Z]: |   0  Tesla K40m          Off  | 0000:08:00.0     Off |                    0 |
[2016-07-28T17:59:19Z]: | N/A   25C    P8    19W / 235W |     23MiB / 11519MiB |      0%    E. Thread |
[2016-07-28T17:59:19Z]: +-------------------------------+----------------------+----------------------+
[2016-07-28T17:59:19Z]:                                                                                
[2016-07-28T17:59:19Z]: +-----------------------------------------------------------------------------+
[2016-07-28T17:59:19Z]: | Processes:                                                       GPU Memory |
[2016-07-28T17:59:19Z]: |  GPU       PID  Type  Process name                               Usage      |
[2016-07-28T17:59:19Z]: |=============================================================================|
[2016-07-28T17:59:19Z]: |  No running processes found                                                 |
[2016-07-28T17:59:19Z]: +-----------------------------------------------------------------------------+
[2016-07-28T17:59:21Z]: Using TensorFlow backend.
[2016-07-28T17:59:22Z]: I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so.7.5 locally
[2016-07-28T17:59:22Z]: I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.so.1 locally
[2016-07-28T17:59:22Z]: I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so.7.5 locally
[2016-07-28T17:59:23Z]: I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcudnn.so.4.0.7 locally
[2016-07-28T17:59:23Z]: I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so.7.5 locally
[2016-07-28T17:59:34Z]: I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 0 with properties: 
[2016-07-28T17:59:34Z]: name: Tesla K40m
[2016-07-28T17:59:34Z]: major: 3 minor: 5 memoryClockRate (GHz) 0.745
[2016-07-28T17:59:34Z]: pciBusID 0000:08:00.0
[2016-07-28T17:59:34Z]: Total memory: 11.25GiB
[2016-07-28T17:59:34Z]: Free memory: 11.15GiB
[2016-07-28T17:59:34Z]: I tensorflow/core/common_runtime/gpu/gpu_init.cc:126] DMA: 0 
[2016-07-28T17:59:34Z]: I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 0:   Y 
[2016-07-28T17:59:34Z]: I tensorflow/core/common_runtime/gpu/gpu_device.cc:839] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K40m, pci bus id: 0000:08:00.0)
[2016-07-28T17:59:34Z]: F tensorflow/stream_executor/cuda/cuda_driver.cc:395] Check failed: CUDA_SUCCESS == dynload::cuCtxSetCurrent(cuda_context->context()) (0 vs. 216)
[2016-07-28T17:59:34Z]: X_train shape: (60000, 1, 28, 28)
[2016-07-28T17:59:34Z]: 60000 train samples
[2016-07-28T17:59:34Z]: 10000 test samples
[2016-07-28T17:59:41Z]: /tmp/wrapper5271742824235601482.sh: line 12: 61655 Aborted                 (core dumped) python mnist_cnn.py
[2016-07-28T17:59:41Z]: Exited with code 0

I am seeing a similar issue on 0.9 compiled from source, HEAD:

commit 554ddd9ad2d4abad5a9a31f2d245f0b1012f0d10
Merge: 89e1cc5 a0745a7
Author: yifeif <fengyifei2026@gmail.com>
Date:   Tue Jul 26 16:17:21 2016 -0700

I am on a shared cluster with a scheduler, so I should have exclusive access to the node during my time slice. It looks like exclusive mode is set, but there are no running processes at the time time I try to use it:

[2016-07-28T17:59:19Z]: +------------------------------------------------------+                       
[2016-07-28T17:59:19Z]: | NVIDIA-SMI 352.39     Driver Version: 352.39         |                       
[2016-07-28T17:59:19Z]: |-------------------------------+----------------------+----------------------+
[2016-07-28T17:59:19Z]: | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
[2016-07-28T17:59:19Z]: | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
[2016-07-28T17:59:19Z]: |===============================+======================+======================|
[2016-07-28T17:59:19Z]: |   0  Tesla K40m          Off  | 0000:08:00.0     Off |                    0 |
[2016-07-28T17:59:19Z]: | N/A   25C    P8    19W / 235W |     23MiB / 11519MiB |      0%    E. Thread |
[2016-07-28T17:59:19Z]: +-------------------------------+----------------------+----------------------+
[2016-07-28T17:59:19Z]:                                                                                
[2016-07-28T17:59:19Z]: +-----------------------------------------------------------------------------+
[2016-07-28T17:59:19Z]: | Processes:                                                       GPU Memory |
[2016-07-28T17:59:19Z]: |  GPU       PID  Type  Process name                               Usage      |
[2016-07-28T17:59:19Z]: |=============================================================================|
[2016-07-28T17:59:19Z]: |  No running processes found                                                 |
[2016-07-28T17:59:19Z]: +-----------------------------------------------------------------------------+
[2016-07-28T17:59:21Z]: Using TensorFlow backend.
[2016-07-28T17:59:22Z]: I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so.7.5 locally
[2016-07-28T17:59:22Z]: I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.so.1 locally
[2016-07-28T17:59:22Z]: I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so.7.5 locally
[2016-07-28T17:59:23Z]: I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcudnn.so.4.0.7 locally
[2016-07-28T17:59:23Z]: I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so.7.5 locally
[2016-07-28T17:59:34Z]: I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 0 with properties: 
[2016-07-28T17:59:34Z]: name: Tesla K40m
[2016-07-28T17:59:34Z]: major: 3 minor: 5 memoryClockRate (GHz) 0.745
[2016-07-28T17:59:34Z]: pciBusID 0000:08:00.0
[2016-07-28T17:59:34Z]: Total memory: 11.25GiB
[2016-07-28T17:59:34Z]: Free memory: 11.15GiB
[2016-07-28T17:59:34Z]: I tensorflow/core/common_runtime/gpu/gpu_init.cc:126] DMA: 0 
[2016-07-28T17:59:34Z]: I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 0:   Y 
[2016-07-28T17:59:34Z]: I tensorflow/core/common_runtime/gpu/gpu_device.cc:839] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K40m, pci bus id: 0000:08:00.0)
[2016-07-28T17:59:34Z]: F tensorflow/stream_executor/cuda/cuda_driver.cc:395] Check failed: CUDA_SUCCESS == dynload::cuCtxSetCurrent(cuda_context->context()) (0 vs. 216)
[2016-07-28T17:59:34Z]: X_train shape: (60000, 1, 28, 28)
[2016-07-28T17:59:34Z]: 60000 train samples
[2016-07-28T17:59:34Z]: 10000 test samples
[2016-07-28T17:59:41Z]: /tmp/wrapper5271742824235601482.sh: line 12: 61655 Aborted                 (core dumped) python mnist_cnn.py
[2016-07-28T17:59:41Z]: Exited with code 0
@danpovey

This comment has been minimized.

Show comment
Hide comment
@danpovey

danpovey Jul 28, 2016

Tensorflow guys, if I were you I would change the assert statement that
fails here:

cuda/cuda_driver.cc:395] Check failed: CUDA_SUCCESS ==
dynload::cuCtxSetCurrent(cuda_context->context()) (0 vs. 216)

to some code that prints out the text form of the CUDA exit code, and maybe
for good measure tries to invoke nvidia-smi to get extra information. We
found this necessary in Kaldi in order to ensure that when there are
problems, all the information needed is in the log.
Dan

On Thu, Jul 28, 2016 at 11:52 AM, Mark Whitney notifications@github.com
wrote:

I am seeing a similar issue on 0.9 compiled from source, HEAD:

commit 554ddd9
Merge: 89e1cc5 a0745a7
Author: yifeif fengyifei2026@gmail.com
Date: Tue Jul 26 16:17:21 2016 -0700

I am on a shared cluster with a scheduler, so I should have exclusive
access to the node during my time slice. It looks like exclusive mode is
set, but there are no running processes at the time time I try to use it:

2016-07-28T17:59:19Z: | NVIDIA-SMI 352.39 Driver Version: 352.39 |
2016-07-28T17:59:19Z: |-------------------------------+----------------------+----------------------+
2016-07-28T17:59:19Z: | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
2016-07-28T17:59:19Z: | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
2016-07-28T17:59:19Z: |===============================+======================+======================|
2016-07-28T17:59:19Z: | 0 Tesla K40m Off | 0000:08:00.0 Off | 0 |
2016-07-28T17:59:19Z: | N/A 25C P8 19W / 235W | 23MiB / 11519MiB | 0% E. Thread |
2016-07-28T17:59:19Z: +-------------------------------+----------------------+----------------------+
2016-07-28T17:59:19Z:
2016-07-28T17:59:19Z: +-----------------------------------------------------------------------------+
2016-07-28T17:59:19Z: | Processes: GPU Memory |
2016-07-28T17:59:19Z: | GPU PID Type Process name Usage |
2016-07-28T17:59:19Z: |=============================================================================|
2016-07-28T17:59:19Z: | No running processes found |
2016-07-28T17:59:19Z: +-----------------------------------------------------------------------------+
[2016-07-28T17:59:21Z]: Using TensorFlow backend.
[2016-07-28T17:59:22Z]: I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so.7.5 locally
[2016-07-28T17:59:22Z]: I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.so.1 locally
[2016-07-28T17:59:22Z]: I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so.7.5 locally
[2016-07-28T17:59:23Z]: I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcudnn.so.4.0.7 locally
[2016-07-28T17:59:23Z]: I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so.7.5 locally
[2016-07-28T17:59:34Z]: I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 0 with properties:
[2016-07-28T17:59:34Z]: name: Tesla K40m
[2016-07-28T17:59:34Z]: major: 3 minor: 5 memoryClockRate (GHz) 0.745
[2016-07-28T17:59:34Z]: pciBusID 0000:08:00.0
[2016-07-28T17:59:34Z]: Total memory: 11.25GiB
[2016-07-28T17:59:34Z]: Free memory: 11.15GiB
[2016-07-28T17:59:34Z]: I tensorflow/core/common_runtime/gpu/gpu_init.cc:126] DMA: 0
[2016-07-28T17:59:34Z]: I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 0: Y
[2016-07-28T17:59:34Z]: I tensorflow/core/common_runtime/gpu/gpu_device.cc:839] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K40m, pci bus id: 0000:08:00.0)
[2016-07-28T17:59:34Z]: F tensorflow/stream_executor/cuda/cuda_driver.cc:395] Check failed: CUDA_SUCCESS == dynload::cuCtxSetCurrent(cuda_context->context()) (0 vs. 216)
[2016-07-28T17:59:34Z]: X_train shape: (60000, 1, 28, 28)
[2016-07-28T17:59:34Z]: 60000 train samples
[2016-07-28T17:59:34Z]: 10000 test samples
[2016-07-28T17:59:41Z]: /tmp/wrapper5271742824235601482.sh: line 12: 61655 Aborted (core dumped) python mnist_cnn.py
[2016-07-28T17:59:41Z]: Exited with code 0


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#526 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ADJVu5zDQGnFRHmS-sFWKkA2dnlN3v1Cks5qaPqBgaJpZM4G2yCN
.

Tensorflow guys, if I were you I would change the assert statement that
fails here:

cuda/cuda_driver.cc:395] Check failed: CUDA_SUCCESS ==
dynload::cuCtxSetCurrent(cuda_context->context()) (0 vs. 216)

to some code that prints out the text form of the CUDA exit code, and maybe
for good measure tries to invoke nvidia-smi to get extra information. We
found this necessary in Kaldi in order to ensure that when there are
problems, all the information needed is in the log.
Dan

On Thu, Jul 28, 2016 at 11:52 AM, Mark Whitney notifications@github.com
wrote:

I am seeing a similar issue on 0.9 compiled from source, HEAD:

commit 554ddd9
Merge: 89e1cc5 a0745a7
Author: yifeif fengyifei2026@gmail.com
Date: Tue Jul 26 16:17:21 2016 -0700

I am on a shared cluster with a scheduler, so I should have exclusive
access to the node during my time slice. It looks like exclusive mode is
set, but there are no running processes at the time time I try to use it:

2016-07-28T17:59:19Z: | NVIDIA-SMI 352.39 Driver Version: 352.39 |
2016-07-28T17:59:19Z: |-------------------------------+----------------------+----------------------+
2016-07-28T17:59:19Z: | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
2016-07-28T17:59:19Z: | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
2016-07-28T17:59:19Z: |===============================+======================+======================|
2016-07-28T17:59:19Z: | 0 Tesla K40m Off | 0000:08:00.0 Off | 0 |
2016-07-28T17:59:19Z: | N/A 25C P8 19W / 235W | 23MiB / 11519MiB | 0% E. Thread |
2016-07-28T17:59:19Z: +-------------------------------+----------------------+----------------------+
2016-07-28T17:59:19Z:
2016-07-28T17:59:19Z: +-----------------------------------------------------------------------------+
2016-07-28T17:59:19Z: | Processes: GPU Memory |
2016-07-28T17:59:19Z: | GPU PID Type Process name Usage |
2016-07-28T17:59:19Z: |=============================================================================|
2016-07-28T17:59:19Z: | No running processes found |
2016-07-28T17:59:19Z: +-----------------------------------------------------------------------------+
[2016-07-28T17:59:21Z]: Using TensorFlow backend.
[2016-07-28T17:59:22Z]: I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so.7.5 locally
[2016-07-28T17:59:22Z]: I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.so.1 locally
[2016-07-28T17:59:22Z]: I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so.7.5 locally
[2016-07-28T17:59:23Z]: I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcudnn.so.4.0.7 locally
[2016-07-28T17:59:23Z]: I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so.7.5 locally
[2016-07-28T17:59:34Z]: I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 0 with properties:
[2016-07-28T17:59:34Z]: name: Tesla K40m
[2016-07-28T17:59:34Z]: major: 3 minor: 5 memoryClockRate (GHz) 0.745
[2016-07-28T17:59:34Z]: pciBusID 0000:08:00.0
[2016-07-28T17:59:34Z]: Total memory: 11.25GiB
[2016-07-28T17:59:34Z]: Free memory: 11.15GiB
[2016-07-28T17:59:34Z]: I tensorflow/core/common_runtime/gpu/gpu_init.cc:126] DMA: 0
[2016-07-28T17:59:34Z]: I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 0: Y
[2016-07-28T17:59:34Z]: I tensorflow/core/common_runtime/gpu/gpu_device.cc:839] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K40m, pci bus id: 0000:08:00.0)
[2016-07-28T17:59:34Z]: F tensorflow/stream_executor/cuda/cuda_driver.cc:395] Check failed: CUDA_SUCCESS == dynload::cuCtxSetCurrent(cuda_context->context()) (0 vs. 216)
[2016-07-28T17:59:34Z]: X_train shape: (60000, 1, 28, 28)
[2016-07-28T17:59:34Z]: 60000 train samples
[2016-07-28T17:59:34Z]: 10000 test samples
[2016-07-28T17:59:41Z]: /tmp/wrapper5271742824235601482.sh: line 12: 61655 Aborted (core dumped) python mnist_cnn.py
[2016-07-28T17:59:41Z]: Exited with code 0


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#526 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ADJVu5zDQGnFRHmS-sFWKkA2dnlN3v1Cks5qaPqBgaJpZM4G2yCN
.

@henline

This comment has been minimized.

Show comment
Hide comment
@henline

henline Jul 29, 2016

Hi, I'm the owner of StreamExecutor. Sorry for arriving late to this discussion.

I believe this problem is caused in all cases by GPUs with their compute mode set to EXCLUSIVE_THREAD (just as mentioned by @zxvix). The solution is to set the compute mode to DEFAULT or EXCLUSIVE_PROCESS, which can be done via one of the following commands:

$ nvidia-smi --compute-mode=0 # for DEFAULT
$ nvidia-smi --compute-mode=3 # for EXCLUSIVE_PROCESS

The nvidia-smi -q command can also be used to query the current compute mode of the device.

If anyone is seeing this error when the device compute mode is either DEFAULT or EXCLUSIVE_PROCESS, please let me know, because I don't think that should be possible.

StreamExecutor will not work in either EXCLUSIVE_THREAD or PROHIBITED compute mode, but in response to @danpovey's question about shared clusters, EXCLUSIVE_PROCESS mode should be fine.

There are no plans in StreamExecutor to support EXCLUSIVE_THREAD mode because it is listed as deprecated in the nvidia-smi help message. There are also no plans to support PROHIBITED mode because I think that mode prevents the creation of contexts, and StreamExecutor cannot function with that restriction.

In response to @danpovey's suggestion about adding a better error message for this case, I think that's a good idea. I will work on getting a patch up to warn about the device compute mode if cuCtxSetCurrent fails and the compute mode is set to an unsupported setting.

henline commented Jul 29, 2016

Hi, I'm the owner of StreamExecutor. Sorry for arriving late to this discussion.

I believe this problem is caused in all cases by GPUs with their compute mode set to EXCLUSIVE_THREAD (just as mentioned by @zxvix). The solution is to set the compute mode to DEFAULT or EXCLUSIVE_PROCESS, which can be done via one of the following commands:

$ nvidia-smi --compute-mode=0 # for DEFAULT
$ nvidia-smi --compute-mode=3 # for EXCLUSIVE_PROCESS

The nvidia-smi -q command can also be used to query the current compute mode of the device.

If anyone is seeing this error when the device compute mode is either DEFAULT or EXCLUSIVE_PROCESS, please let me know, because I don't think that should be possible.

StreamExecutor will not work in either EXCLUSIVE_THREAD or PROHIBITED compute mode, but in response to @danpovey's question about shared clusters, EXCLUSIVE_PROCESS mode should be fine.

There are no plans in StreamExecutor to support EXCLUSIVE_THREAD mode because it is listed as deprecated in the nvidia-smi help message. There are also no plans to support PROHIBITED mode because I think that mode prevents the creation of contexts, and StreamExecutor cannot function with that restriction.

In response to @danpovey's suggestion about adding a better error message for this case, I think that's a good idea. I will work on getting a patch up to warn about the device compute mode if cuCtxSetCurrent fails and the compute mode is set to an unsupported setting.

@kramimus

This comment has been minimized.

Show comment
Hide comment
@kramimus

kramimus Jul 29, 2016

Thanks, good to know it should work in EXCLUSIVE_PROCESS.

Thanks, good to know it should work in EXCLUSIVE_PROCESS.

@alextp

This comment has been minimized.

Show comment
Hide comment
@alextp

alextp Aug 15, 2016

Member

So it sounds like this is working as intended, since StreamExecutor doesn't plan on supporting modes other than EXCLUSIVE_PROCESS.

Member

alextp commented Aug 15, 2016

So it sounds like this is working as intended, since StreamExecutor doesn't plan on supporting modes other than EXCLUSIVE_PROCESS.

@alextp alextp closed this Aug 15, 2016

@noisychannel

This comment has been minimized.

Show comment
Hide comment
@noisychannel

noisychannel Aug 16, 2016

Just to confirm, StreamExecutor works with EXCLUSIVE_PROCESS. Hopefully, @danpovey's suggestion about better error messages will be added in soon. It may be hard for people to search for this issue.

Just to confirm, StreamExecutor works with EXCLUSIVE_PROCESS. Hopefully, @danpovey's suggestion about better error messages will be added in soon. It may be hard for people to search for this issue.

@MidoAssran

This comment has been minimized.

Show comment
Hide comment
@MidoAssran

MidoAssran Mar 16, 2017

For anyone using GPU based tensorflow on Compute Canada resources, submitting the job by specifying EXCLUSIVE_PROCESS worked for me.

For anyone using GPU based tensorflow on Compute Canada resources, submitting the job by specifying EXCLUSIVE_PROCESS worked for me.

@zafarali

This comment has been minimized.

Show comment
Hide comment
@zafarali

zafarali Apr 17, 2017

@MidoAssran Thanks for that! Will try it out.

@MidoAssran Thanks for that! Will try it out.

@201power

This comment has been minimized.

Show comment
Hide comment
@201power

201power Oct 3, 2017

I meet this error
image
change compute_mode does not solve the issue, any guidance?

201power commented Oct 3, 2017

I meet this error
image
change compute_mode does not solve the issue, any guidance?

@danpovey

This comment has been minimized.

Show comment
Hide comment
@danpovey

danpovey Oct 3, 2017

danpovey commented Oct 3, 2017

@201power

This comment has been minimized.

Show comment
Hide comment
@201power

201power Oct 4, 2017

The GPU might be in use. This happened in a TF session followed immediately by another TF session.
I am not sure how to tell that if GPU is in use, since TF won't release memory after the first session finished.

201power commented Oct 4, 2017

The GPU might be in use. This happened in a TF session followed immediately by another TF session.
I am not sure how to tell that if GPU is in use, since TF won't release memory after the first session finished.

@dee6600

This comment has been minimized.

Show comment
Hide comment
@dee6600

dee6600 Feb 27, 2018

changing compute mode didn't solve the problem, i am getting the same error, my cards are two tesla k80 's
and driver version 375 cuda version 8.0 tensorflow version 1.1

dee6600 commented Feb 27, 2018

changing compute mode didn't solve the problem, i am getting the same error, my cards are two tesla k80 's
and driver version 375 cuda version 8.0 tensorflow version 1.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment