GPU sync failed #1450

Duum · 2016-03-10T07:44:09Z

`I` tensorflow/core/common_runtime/gpu/gpu_init.cc:126] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 0:   Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:755] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 970, pci bus id: 0000:01:00.0)
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:53] Creating bin of max chunk size 256B
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:53] Creating bin of max chunk size 512B
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:53] Creating bin of max chunk size 1.0KiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:53] Creating bin of max chunk size 2.0KiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:53] Creating bin of max chunk size 4.0KiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:53] Creating bin of max chunk size 8.0KiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:53] Creating bin of max chunk size 16.0KiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:53] Creating bin of max chunk size 32.0KiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:53] Creating bin of max chunk size 64.0KiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:53] Creating bin of max chunk size 128.0KiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:53] Creating bin of max chunk size 256.0KiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:53] Creating bin of max chunk size 512.0KiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:53] Creating bin of max chunk size 1.00MiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:53] Creating bin of max chunk size 2.00MiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:53] Creating bin of max chunk size 4.00MiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:53] Creating bin of max chunk size 8.00MiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:53] Creating bin of max chunk size 16.00MiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:53] Creating bin of max chunk size 32.00MiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:53] Creating bin of max chunk size 64.00MiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:53] Creating bin of max chunk size 128.00MiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:53] Creating bin of max chunk size 256.00MiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:107] Allocating 3.51GiB bytes.
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:118] GPU 0 memory begins at 0x503dc0000 extends to 0x5e45e3000
E tensorflow/stream_executor/cuda/cuda_driver.cc:1099] could not synchronize on CUDA context: CUDA_ERROR_ILLEGAL_ADDRESS :: No stack trace available
F tensorflow/core/common_runtime/gpu/gpu_util.cc:370] GPU sync failed

hi!
what's wrong with this, how i can solve this.I'm using cuda7.5 cudnn7.0 and all they are ok running on CPU. but when run GPU ,it occur wrong.
And I can local the operation which can't run on GPU

with tf.device("/cpu:0"):
 optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(cost)

when i remvoe "tf.device("/cpu:0"),it ocure the bug reported above

The text was updated successfully, but these errors were encountered:

aymericdamien · 2016-03-10T17:19:42Z

It might be because GTX970 has some memory issues if you are allocating more than 3.5Gb (see http://wccftech.com/nvidia-geforce-gtx-970-memory-issue-fully-explained/), you can try to allocate less than 3.5gb memory and check if it corrects the issue:

gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.7)
sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options))

vrv · 2016-03-10T17:48:00Z

Yikes. Good to know @aymericdamien, thanks!

Duum · 2016-03-11T02:34:35Z

it' make no sense.....
when i change to this

gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.7)
with tf.Session(config=tf.ConfigProto(gpu_options=gpu_options)) as sess:
     sess.run(init)
     step=1

it had the same error:

I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:900] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 0 with properties: 
name: GeForce GTX 970
major: 5 minor: 2 memoryClockRate (GHz) 1.1775
pciBusID 0000:01:00.0
Total memory: 4.00GiB
Free memory: 3.60GiB
I tensorflow/core/common_runtime/gpu/gpu_init.cc:126] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 0:   Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:755] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 970, pci bus id: 0000:01:00.0)
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:53] Creating bin of max chunk size 256B
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:53] Creating bin of max chunk size 512B
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:53] Creating bin of max chunk size 1.0KiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:53] Creating bin of max chunk size 2.0KiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:53] Creating bin of max chunk size 4.0KiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:53] Creating bin of max chunk size 8.0KiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:53] Creating bin of max chunk size 16.0KiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:53] Creating bin of max chunk size 32.0KiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:53] Creating bin of max chunk size 64.0KiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:53] Creating bin of max chunk size 128.0KiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:53] Creating bin of max chunk size 256.0KiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:53] Creating bin of max chunk size 512.0KiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:53] Creating bin of max chunk size 1.00MiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:53] Creating bin of max chunk size 2.00MiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:53] Creating bin of max chunk size 4.00MiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:53] Creating bin of max chunk size 8.00MiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:53] Creating bin of max chunk size 16.00MiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:53] Creating bin of max chunk size 32.00MiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:53] Creating bin of max chunk size 64.00MiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:53] Creating bin of max chunk size 128.00MiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:53] Creating bin of max chunk size 256.00MiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:107] Allocating 2.52GiB bytes.
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:118] GPU 0 memory begins at 0x503dc0000 extends to 0x5a54564cc
E tensorflow/stream_executor/cuda/cuda_driver.cc:1099] could not synchronize on CUDA context: CUDA_ERROR_ILLEGAL_ADDRESS :: No stack trace available
F tensorflow/core/common_runtime/gpu/gpu_util.cc:370] GPU sync failed

vrv · 2016-03-11T03:32:19Z

Are you building from source, or did you install the pip package? What's your environment? E.g., all the information the template had but you removed :). If you built from source, what command line did you use?

Duum · 2016-03-11T04:07:38Z

I build from source, and I build a whl, install it by pip. and i have test ok on

https://github.com/Duum/TensorFlow-Examples/blob/master/examples/3%20-%20Neural%20Networks/convolutional_network.py

on GPU. but it's not ok work on my code.
my computer is ubuntu15.10 gcc version is 5.2.1 , my cuda is 7.5 ,and so i comment the error of gcc version error in cuda code
my cudnn verison is 7.0

 sudo cp lib64/* /usr/local/cuda/lib64/
 sudo cp include/cudnn.h /usr/local/cuda/include/

my command are:

Please specify the location of python. [Default is /usr/bin/python]: 
Do you wish to build TensorFlow with GPU support? [y/N] y
GPU support will be enabled for TensorFlow
Please specify the Cuda SDK version you want to use, e.g. 7.0. [Leave empty to use system default]: 7.5
Please specify the location where CUDA 7.5 toolkit is installed. Refer to README.md for more details. [Default is /usr/local/cuda]: 
Please specify the Cudnn version you want to use. [Leave empty to use system default]: 7.0
Please specify the location where cuDNN 7.0 library is installed. Refer to README.md for more details. [Default is /usr/local/cuda]: 
Please specify a list of comma-separated Cuda compute capabilities you want to build with.
You can find the compute capability of your device at: https://developer.nvidia.com/cuda-gpus.
Please note that each additional compute capability significantly increases your build time and binary size.
[Default is: "3.5,5.2"]: 
Setting up Cuda include
Setting up Cuda lib64
Setting up Cuda bin
Setting up Cuda nvvm
Configuration finished

And i live in china mainland, so I change some code in the WORKSPACE file:


 git_repository(
     name = "grpc",
     commit = "403cd6c", init_submodules = True,
    remote = "https://github.com/melody-rain/grpc.git",    )

and i also change .gitmudles file:

   [submodule "google/protobuf"]
      path = google/protobuf
       url = https://github.com/google/protobuf.git
   [submodule "third_party/boringssl"]
       path = third_party/boringssl
      url = https://github.com/doubler/boringssl.git

my build command are:

bazel build -c opt --config=cuda //tensorflow/tools/pip_package:build_pip_package
bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg
 pip install /tmp/tensorflow_pkg/tensorflow-0.7.1-py2-none-linux_x86_64.whl

girving · 2016-06-08T22:03:33Z

@zffchen78: Can you take a look? Is there any relationship between this and #2471?

girving · 2016-06-09T15:54:42Z

@vrv: Reassigning to you per @zffchen78's request.

vrv · 2016-06-09T15:59:08Z

Pretty sure this is going to be hard for us to debug without being able to reproduce this.

I would suggest:

Upgrading your nvidia drivers
2a) Updating cuda to 7.5 and cudnn v4 and installing TensorFlow r0.9 or
2b) Updating cuda to 8.0 and cudnn v5 and installing TensorFlow from sources

and then try again.

aselle · 2016-06-28T18:25:35Z

Automatically closing because there was no response. Please reopen if it is still an issue.

kbrems · 2016-09-15T18:25:30Z

I am getting the same error when I create a simple custom operator that operates on a list of input tensors of type int32. My input tensor is 5 elements, so this is clearly not a memory limitation issue.

Specifics:
ubuntu 14.04
GeForce GTX TITAN driver version 367.44
cuda 7.5, cudnn v4
binary pip install tensrflow gpu version 0.10.0rc0
python 2.7

Build and run the attached source code:
$ python cuda_op_unittest.py
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so locally
I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 0 with properties:
name: GeForce GTX TITAN
major: 3 minor: 5 memoryClockRate (GHz) 0.928
pciBusID 0000:05:00.0
Total memory: 5.94GiB
Free memory: 5.45GiB
I tensorflow/core/common_runtime/gpu/gpu_init.cc:126] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 0: Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:839] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX TITAN, pci bus id: 0000:05:00.0)
Device mapping:
/job:localhost/replica:0/task:0/gpu:0 -> device: 0, name: GeForce GTX TITAN, pci bus id: 0000:05:00.0
I tensorflow/core/common_runtime/direct_session.cc:175] Device mapping:
/job:localhost/replica:0/task:0/gpu:0 -> device: 0, name: GeForce GTX TITAN, pci bus id: 0000:05:00.0

int32: /job:localhost/replica:0/task:0/gpu:0
I tensorflow/core/common_runtime/simple_placer.cc:818] int32: /job:localhost/replica:0/task:0/gpu:0
int32/input_0: /job:localhost/replica:0/task:0/gpu:0
I tensorflow/core/common_runtime/simple_placer.cc:818] int32/input_0: /job:localhost/replica:0/task:0/gpu:0
*** running on GPU ***
E tensorflow/stream_executor/cuda/cuda_driver.cc:1140] could not synchronize on CUDA context: CUDA_ERROR_ILLEGAL_ADDRESS :: No stack trace available
F tensorflow/core/common_runtime/gpu/gpu_util.cc:370] GPU sync failed
Aborted (core dumped)

Key info:
If I run this with tf.device('/cpu:0') it works.
If I make my inputs and outputs a single tensor instead of a list of 1 tensor it also works. (This took a while to figure out!). ie instead of .Input("input: in_types") in the REGISTER_OP use .Input("input: int32")

Notes:
based on the response to #4387, some research led me here: http://stackoverflow.com/questions/37439299/no-gpu-kernel-for-an-int32-variable-op. It seems that tensorflow does not really support GPU operators on integer tensors and adding that support is difficult. In the interim though, better documentation on integer tensor support and a meaningful error message would be preferable to a core dump :).

issue1450.zip

vrv · 2016-09-15T18:33:47Z

Does it work if you define the input type as int64?

kbrems · 2016-09-15T18:43:11Z

I can't build a custom operator with type int64:

karenbre@karenZ820:~/workspace/issue1450$ ./build.sh
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so locally
In file included from /usr/local/lib/python2.7/dist-packages/tensorflow/include/tensorflow/core/framework/type_traits.h:22:0,
from /usr/local/lib/python2.7/dist-packages/tensorflow/include/tensorflow/core/framework/allocator.h:25,
from /usr/local/lib/python2.7/dist-packages/tensorflow/include/tensorflow/core/framework/op_kernel.h:22,
from cuda_op_kernel.cc:17:
/usr/local/lib/python2.7/dist-packages/tensorflow/include/tensorflow/core/framework/types.h: In instantiation of ‘struct tensorflow::DataTypeToEnum’:
/usr/local/lib/python2.7/dist-packages/tensorflow/include/tensorflow/core/framework/tensor.h:587:46: required from ‘typename tensorflow::TTypes<T, NDIMS>::ConstTensor tensorflow::Tensor::shaped(tensorflow::gtl::ArraySlice) const [with T = long int; long unsigned int NDIMS = 1ul; typename tensorflow::TTypes<T, NDIMS>::ConstTensor = Eigen::TensorMap<Eigen::Tensor<const long int, 1, 1, long int>, 16>]’
/usr/local/lib/python2.7/dist-packages/tensorflow/include/tensorflow/core/framework/tensor.h:354:40: required from ‘typename tensorflow::TTypes::ConstFlat tensorflow::Tensor::flat() const [with T = long int; typename tensorflow::TTypes::ConstFlat = Eigen::TensorMap<Eigen::Tensor<const long int, 1, 1, long int>, 16>]’
cuda_op_kernel.cc:64:45: required from ‘void AddOneOp::Compute(tensorflow::OpKernelContext_) [with Device = Eigen::ThreadPoolDevice]’
cuda_op_kernel.cc:80:80: required from here
/usr/local/lib/python2.7/dist-packages/tensorflow/include/tensorflow/core/framework/types.h:136:3: error: static assertion failed: Specified Data Type not supported
static_assert(IsValidDataType::value, "Specified Data Type not supported");
^
In file included from /usr/local/lib/python2.7/dist-packages/tensorflow/include/tensorflow/core/framework/device_base.h:23:0,
from /usr/local/lib/python2.7/dist-packages/tensorflow/include/tensorflow/core/framework/op_kernel.h:25,
from cuda_op_kernel.cc:17:
/usr/local/lib/python2.7/dist-packages/tensorflow/include/tensorflow/core/framework/tensor.h: In instantiation of ‘typename tensorflow::TTypes<T, NDIMS>::ConstTensor tensorflow::Tensor::shaped(tensorflow::gtl::ArraySlice) const [with T = long int; long unsigned int NDIMS = 1ul; typename tensorflow::TTypes<T, NDIMS>::ConstTensor = Eigen::TensorMap<Eigen::Tensor<const long int, 1, 1, long int>, 16>]’:
/usr/local/lib/python2.7/dist-packages/tensorflow/include/tensorflow/core/framework/tensor.h:354:40: required from ‘typename tensorflow::TTypes::ConstFlat tensorflow::Tensor::flat() const [with T = long int; typename tensorflow::TTypes::ConstFlat = Eigen::TensorMap<Eigen::Tensor<const long int, 1, 1, long int>, 16>]’
cuda_op_kernel.cc:64:45: required from ‘void AddOneOp::Compute(tensorflow::OpKernelContext_) [with Device = Eigen::ThreadPoolDevice]’
cuda_op_kernel.cc:80:80: required from here
/usr/local/lib/python2.7/dist-packages/tensorflow/include/tensorflow/core/framework/tensor.h:587:46: error: ‘v’ is not a member of ‘tensorflow::DataTypeToEnum’
CheckTypeAndIsAligned(DataTypeToEnum::v());
^
/usr/local/lib/python2.7/dist-packages/tensorflow/include/tensorflow/core/framework/tensor.h: In instantiation of ‘typename tensorflow::TTypes<T, NDIMS>::Tensor tensorflow::Tensor::shaped(tensorflow::gtl::ArraySlice) [with T = long int; long unsigned int NDIMS = 1ul; typename tensorflow::TTypes<T, NDIMS>::Tensor = Eigen::TensorMap<Eigen::Tensor<long int, 1, 1, long int>, 16>]’:
/usr/local/lib/python2.7/dist-packages/tensorflow/include/tensorflow/core/framework/tensor.h:284:40: required from ‘typename tensorflow::TTypes::Flat tensorflow::Tensor::flat() [with T = long int; typename tensorflow::TTypes::Flat = Eigen::TensorMap<Eigen::Tensor<long int, 1, 1, long int>, 16>]’
cuda_op_kernel.cc:70:57: required from ‘void AddOneOp::Compute(tensorflow::OpKernelContext*) [with Device = Eigen::ThreadPoolDevice]’
cuda_op_kernel.cc:80:80: required from here
/usr/local/lib/python2.7/dist-packages/tensorflow/include/tensorflow/core/framework/tensor.h:546:46: error: ‘v’ is not a member of ‘tensorflow::DataTypeToEnum’
CheckTypeAndIsAligned(DataTypeToEnum::v());
^

kbrems · 2016-09-15T21:51:25Z

Note, it does work with int16.

lhao0301 · 2016-09-18T13:42:10Z

I just met the same problem whether I installed tensorflow from source or official binary(the installation procedure was of no problems).
GPU: gtx titan x(12G)
cuda: 7.5 + cudnnv5.1
E tensorflow/stream_executor/cuda/cuda_driver.cc:1140] could not synchronize on CUDA context: CUDA_ERROR_ILLEGAL_ADDRESS :: No stack trace available
F tensorflow/core/common_runtime/gpu/gpu_util.cc:370] GPU sync failed
Aborted (core dumped)

matpalm · 2016-09-22T02:30:50Z

also saw this last night; after 2hrs running at 80% GPU util

E tensorflow/stream_executor/cuda/cuda_driver.cc:1140] could not synchronize on CUDA context: CUDA_ERROR_LAUNCH_FAILED :: No stack trace available
F tensorflow/core/common_runtime/gpu/gpu_util.cc:370] GPU sync failed

TITAN X (Pascal)

$ ls -l /usr/local/cuda/lib64/libcud*
-rw-r--r-- 1 root root   560184 Sep 15 19:53 /usr/local/cuda/lib64/libcudadevrt.a
lrwxrwxrwx 1 root root       16 Sep 15 19:53 /usr/local/cuda/lib64/libcudart.so -> libcudart.so.8.0
lrwxrwxrwx 1 root root       19 Sep 15 19:53 /usr/local/cuda/lib64/libcudart.so.8.0 -> libcudart.so.8.0.27
-rwxr-xr-x 1 root root   394472 Sep 15 19:53 /usr/local/cuda/lib64/libcudart.so.8.0.27
-rw-r--r-- 1 root root   737516 Sep 15 19:53 /usr/local/cuda/lib64/libcudart_static.a
-rwxr-xr-x 1 root root 79337624 Sep 15 20:08 /usr/local/cuda/lib64/libcudnn.so
-rwxr-xr-x 1 root root 79337624 Sep 15 20:08 /usr/local/cuda/lib64/libcudnn.so.5
-rwxr-xr-x 1 root root 79337624 Sep 15 20:08 /usr/local/cuda/lib64/libcudnn.so.5.1.5

$ cd ~/dev/tensorflow/
$ git rev-parse HEAD
503a202761877250f1b268041a5bab14dad2b2ca

$ bazel version
.
Build label: 0.3.1
Build target: bazel-out/local-fastbuild/bin/src/main/java/com/google/devtools/build/lib/bazel/BazelServer_deploy.jar
Build time: Fri Jul 29 09:09:52 2016 (1469783392)
Build timestamp: 1469783392
Build timestamp as int: 1469783392

yash0307 · 2016-09-26T19:13:35Z

I am getting something similar during back-propagation. Bottleneck generation works fine.

E tensorflow/stream_executor/cuda/cuda_driver.cc:1140] could not synchronize on CUDA context: CUDA_ERROR_ILLEGAL_ADDRESS :: No stack trace available
E tensorflow/stream_executor/cuda/cuda_event.cc:49] Error polling for event status: failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS
F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:198] Unexpected Event status: 1
F tensorflow/core/common_runtime/gpu/gpu_util.cc:370] GPU sync failed
Aborted (core dumped)

vrv · 2016-10-13T17:12:20Z

@matpalm: Has it happened consistently since? These kind of one off failures can happen if there's some GPU hardware issues. @yash0307 same question: does it happen immediately or only after a while?

@kbrems can you include the int64 code? int64 should definitely compile, and I can't figure out the error from the compiler output alone.

matpalm · 2016-10-13T17:19:07Z

I've not seen it again & have been running similar jobs (i.e. in terms of GPU util & mem load) almost every night since

kbrems · 2016-10-13T22:01:32Z

Here is the example with int64. I just pulled the latest source from tensorflow master this morning and tried again and it still does not compile.
issue1450int64.zip

vrv · 2016-10-13T23:06:21Z

Your in_types looks to be int16, not int64, not sure if this is the only problem though. Other than that, this does seem like something we do all the time in other kernels, so I'm not sure why it's not compiling.

kbrems · 2016-10-13T23:56:13Z

My search/replace failed to catch that. I changed the in_types to int64, but it still does not compile.

vrv · 2016-10-14T00:09:32Z

Even though we typedef int64 to int64_t, I think you need to use int64, not int64_t. The following simpler code (which doesn't add one, but for illustration) compiled for me:

REGISTER_OP("AddOne")
    .Input("input: int64")
    .Output("output: int64")
    .Doc(R"doc(
Adds 1 to all elements of the tensor.

output: A Tensor.
  output = input + 1
)doc");

typedef Eigen::ThreadPoolDevice CPUDevice;
typedef Eigen::GpuDevice GPUDevice;

template <typename Device>
class AddOneOp : public OpKernel {
 public:
  explicit AddOneOp(OpKernelConstruction* context) : OpKernel(context) {}

  void Compute(OpKernelContext* context) override {
    // Grab the input tensor
    const Tensor& input_tensor = context->input(0);
    auto input = input_tensor.flat<int64>();

    // Create an output tensor
    Tensor* output_tensor = NULL;
    OP_REQUIRES_OK(context, context->allocate_output(0, input_tensor.shape(),
                                                     &output_tensor));
    auto output = output_tensor->template flat<int64>();
    output = input;
  }
};

REGISTER_KERNEL_BUILDER(Name("AddOne").Device(DEVICE_CPU), AddOneOp<CPUDevice>);

kbrems · 2016-10-14T20:03:28Z

It seems that somewhere deep within Eigen, int64 is defined as a long long int, but on 64 bit ubuntu, int64_t is defined as a long int in stdint.h, so the 2 are not compatible. I can work around that in this simple example, but it means that all our custom cuda kernels would then have to depend on Eigen types instead of the standard types for linux. .

On the plus side, my original issue with in32_t generating the GPU sync error and core dump seems to have gone away with release 0.11.0rc0 (built from latest source). Although I have also upgraded to CUDA 8.0 since the original problem, so perhaps that fixed something.

angup143 · 2017-06-19T16:24:09Z

I have a similar problem:
Cuda 8, cudnn v5.1 on a titan X
using keras with tensorflow-gpu==1.1.0

E tensorflow/stream_executor/cuda/cuda_driver.cc:1067] could not synchronize on CUDA context: CUDA_ERROR_ILLEGAL_ADDRESS :: No stack trace available
2017-06-19 17:12:19.722285: F tensorflow/core/common_runtime/gpu/gpu_util.cc:370] GPU sync failed
Aborted (core dumped)

It occurs intermittently on training (usually after a few epochs)

felixthewhale · 2017-12-04T00:33:14Z

Same problem
2017-12-04 03:27:19.316336: E C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\stream_executor\cuda\cuda_driver.cc:1110] could not synchronize on CUDA context: CUDA_ERROR_ILLEGAL_ADDRESS :: No stack trace available Traceback (most recent call last): File "C:\Python36\lib\site-packages\tensorflow\python\client\session.py", line 1323, in _do_call return fn(*args) File "C:\Python36\lib\site-packages\tensorflow\python\client\session.py", line 1302, in _run_fn status, run_metadata) File "C:\Python36\lib\site-packages\tensorflow\python\framework\errors_impl.py", line 473, in __exit__ c_api.TF_GetCode(self.status.status)) tensorflow.python.framework.errors_impl.InternalError: GPU sync failed
Code:


original =  tf.Variable(img, dtype = tf.float32)
x =         tf.Variable(img, dtype = tf.float32)
clique = x*x # tf.multiply(x,x) # tf.square(x,x)

optimizer = tf.train.GradientDescentOptimizer(1e-2)
train = optimizer.minimize(clique)
init = tf.global_variables_initializer()#tf.initialize_all_variables()
optimize()

It works perfectly on CPU (when cuda visible devices = -1). It is strange: when i use tf.add() it works on GPU, but tf.multiply(), tf.square() (not tested on another math functions) gives an error.

CUDA and CuDNN 8, Win10, 1050Ti, tensorflow 1.4 pip install.

fmbao · 2018-04-04T06:00:58Z

I met same problem ,and I final solve it by decreasing the size of batch.It's so strange that I can run this program with bigger batch size before

AbhinavBijalwan · 2019-05-03T04:18:34Z

with tf.device("/cpu:0"):
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(cost)

Q. What should be the learning_rate?

wzhings · 2019-12-09T22:47:06Z

I meet the same issue when I running codes with Keras with GPU. I solved it after release the memory. It is highly probably that you have no enough memory to be used. That is also why someone said to reduce the batch size will also work. Good luck.

wuwu-0502 · 2021-10-18T07:23:40Z

I had the same problem. My batch-size is 64, and I changed it for 32. It had run.

mrry added the cuda label Mar 14, 2016

girving assigned zffchen78 Jun 8, 2016

girving added the triaged label Jun 8, 2016

girving assigned vrv and unassigned zffchen78 Jun 9, 2016

vrv added the stat:awaiting response Status - Awaiting response from author label Jun 9, 2016

aselle closed this as completed Jun 28, 2016

kbrems mentioned this issue Sep 15, 2016

Add support for integer types for OVL input and output tensors opveclib/opveclib#9

Closed

aselle removed the stat:awaiting response Status - Awaiting response from author label Sep 22, 2016

aselle reopened this Sep 22, 2016

aselle added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Sep 22, 2016

vrv added stat:awaiting response Status - Awaiting response from author and removed stat:awaiting tensorflower Status - Awaiting response from tensorflower labels Oct 13, 2016

vrv closed this as completed Oct 14, 2016

MInner mentioned this issue Oct 26, 2016

tf.nn.softmax on GPU causes CUDA_ERROR_ILLEGAL_ADDRESS (okay on CPU) #5221

Closed

gskulkarni mentioned this issue Apr 20, 2018

Slow matrix multiplication using Tensorflow 1.7.0 on a GPU #18537

Closed

Saduf2019 mentioned this issue Sep 17, 2020

Kernel crash with complex matrices on Windows #43297

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU sync failed #1450

GPU sync failed #1450

Duum commented Mar 10, 2016

aymericdamien commented Mar 10, 2016

vrv commented Mar 10, 2016

Duum commented Mar 11, 2016

vrv commented Mar 11, 2016

Duum commented Mar 11, 2016

girving commented Jun 8, 2016

girving commented Jun 9, 2016

vrv commented Jun 9, 2016

aselle commented Jun 28, 2016

kbrems commented Sep 15, 2016

vrv commented Sep 15, 2016

kbrems commented Sep 15, 2016

kbrems commented Sep 15, 2016

lhao0301 commented Sep 18, 2016

matpalm commented Sep 22, 2016 •

edited

Loading

yash0307 commented Sep 26, 2016 •

edited

Loading

vrv commented Oct 13, 2016

matpalm commented Oct 13, 2016

kbrems commented Oct 13, 2016

vrv commented Oct 13, 2016

kbrems commented Oct 13, 2016

vrv commented Oct 14, 2016

kbrems commented Oct 14, 2016

angup143 commented Jun 19, 2017 •

edited

Loading

felixthewhale commented Dec 4, 2017 •

edited

Loading

fmbao commented Apr 4, 2018

AbhinavBijalwan commented May 3, 2019

wzhings commented Dec 9, 2019

wuwu-0502 commented Oct 18, 2021

GPU sync failed #1450

GPU sync failed #1450

Comments

Duum commented Mar 10, 2016

aymericdamien commented Mar 10, 2016

vrv commented Mar 10, 2016

Duum commented Mar 11, 2016

vrv commented Mar 11, 2016

Duum commented Mar 11, 2016

girving commented Jun 8, 2016

girving commented Jun 9, 2016

vrv commented Jun 9, 2016

aselle commented Jun 28, 2016

kbrems commented Sep 15, 2016

vrv commented Sep 15, 2016

kbrems commented Sep 15, 2016

kbrems commented Sep 15, 2016

lhao0301 commented Sep 18, 2016

matpalm commented Sep 22, 2016 • edited Loading

yash0307 commented Sep 26, 2016 • edited Loading

vrv commented Oct 13, 2016

matpalm commented Oct 13, 2016

kbrems commented Oct 13, 2016

vrv commented Oct 13, 2016

kbrems commented Oct 13, 2016

vrv commented Oct 14, 2016

kbrems commented Oct 14, 2016

angup143 commented Jun 19, 2017 • edited Loading

felixthewhale commented Dec 4, 2017 • edited Loading

fmbao commented Apr 4, 2018

AbhinavBijalwan commented May 3, 2019

wzhings commented Dec 9, 2019

wuwu-0502 commented Oct 18, 2021

matpalm commented Sep 22, 2016 •

edited

Loading

yash0307 commented Sep 26, 2016 •

edited

Loading

angup143 commented Jun 19, 2017 •

edited

Loading

felixthewhale commented Dec 4, 2017 •

edited

Loading