Error polling for event status: failed to query event: CUDA_ERROR_MISALIGNED_ADDRESS #3224

Waffleboy · 2016-07-07T18:04:44Z

Summary:

Trying inceptionv3, was working fine all the way until I downgraded gcc 5+ to gcc4.9 to use Theano with keras: following this example http://deeplearning.net/software/theano/install_ubuntu.html

Now hitting this error before training starts (bottlenecks generate fine) whenever i run

bazel-bin/tensorflow/examples/image_retraining/retrain --image_dir

E tensorflow/stream_executor/cuda/cuda_event.cc:49] Error polling for event status: failed to query event: CUDA_ERROR_MISALIGNED_ADDRESS
F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:198] Unexpected Event status: 1

Cant figure out the problem. Sidenote that might help: bottlenecks generated alot faster when i used gcc 4.9 instead, but now the training crashes and i cant even run.

Environment info

Operating System:
Ubuntu 16.04

Installed version of CUDA and cuDNN:
(please attach the output of ls -l /path/to/cuda/lib/libcud*):
ls: cannot access '/path/to/cuda/lib/libcud*': No such file or directory
It's installed in /usr/local/cuda and /usr/local/cuda-7.5 instead.

CUDA 7.5, CuDNN v4.

Install steps:
CUDA:
bash cuda_7.5.18_linux.run --override
CUDNN:
Tried both:


tar xvzf cudnn-7.0-linux-x64-v4.0-prod.tgz
sudo cp cuda/include/cudnn.h /usr/local/cuda-7.5/include
sudo cp -r cuda/lib64/. /usr/local/cuda-7.5/lib64

and from here:
http://askubuntu.com/questions/767269/how-can-i-install-cudnn-on-ubuntu-16-04

If installed from binary pip package, provide:

Which pip package you installed.

$ export TF_BINARY_URL=https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow-0.9.0-cp35-cp35m-linux_x86_64.whl
pip install --upgrade $TF_BINARY_URL

The output from python -c "import tensorflow; print(tensorflow.__version__)".

I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so locally
0.9.0

If installed from sources, provide the commit hash:

Steps to reproduce

bazel build -c opt --copt=-mavx tensorflow/examples/image_retraining:retrain
bazel-bin/tensorflow/examples/image_retraining/retrain --image_dir
scratch head

What have you tried?

1.literally every other stack overflow / github question. eg, #2810

2.reinstalling cuda 7.5 and cudnn v4, running ./configure. no luck.

Logs or other output that would be helpful

(If logs are large, please upload as attachment).

The text was updated successfully, but these errors were encountered:

zheng-xq · 2016-07-07T18:43:37Z

@Waffleboy, what type of GPU do you have?

Waffleboy · 2016-07-08T03:34:06Z

@zheng-xq GeForce GTX 860M/PCIe/SSE2, thanks!

zheng-xq · 2016-07-08T04:54:04Z

Please look at my comment in the other thread, and see if that fixes your problem. Thanks.

#2810 (comment)

Waffleboy · 2016-07-08T06:56:44Z

Hi, thanks for your reply!

I ran ./configure to do what you said, but now i get this strange error:

Please specify the location of python. [Default is /storage/programfiles/anaconda3/bin/python]: 
Do you wish to build TensorFlow with Google Cloud Platform support? [y/N] n
No Google Cloud Platform support will be enabled for TensorFlow
Do you wish to build TensorFlow with GPU support? [y/N] y
GPU support will be enabled for TensorFlow
Please specify which gcc nvcc should use as the host compiler. [Default is /usr/bin/gcc]: 
Please specify the Cuda SDK version you want to use, e.g. 7.0. [Leave empty to use system default]: 7.5
Please specify the location where CUDA 7.5 toolkit is installed. Refer to README.md for more details. [Default is /usr/local/cuda]: 
Please specify the Cudnn version you want to use. [Leave empty to use system default]: v4
Please specify the location where cuDNN v4 library is installed. Refer to README.md for more details. [Default is /usr/local/cuda]: 
Invalid path to cuDNN  toolkit. Neither of the following two files can be found:
/usr/local/cuda-7.5/lib64/libcudnn.so.v4
/usr/local/cuda-7.5/libcudnn.so.v4
/usr/local/cuda/lib64/libcudnn.so
/usr/local/cudnn/lib64/libcudnn.so
/usr/lib/x86_64-linux-gnu/libcudnn.so.v4
Please specify the Cudnn version you want to use. [Leave empty to use system default]:

I have double and triple checked that the files are there, and that it was cuDNN v4. Should i ignore this and select default?

zheng-xq · 2016-07-08T07:01:58Z

Normally the versions are like 4.0.7. You can check the name of the library for the version number.

ls /usr/local/cuda-7.5/lib64/libcuda.so.*

Waffleboy · 2016-07-08T07:07:09Z

libcuda returns nothing, but libcudnn returns 2 files. Running with system default (ie, not manually typing v4) allows me to continue and pick 5.0 for compute capability. is this ok?

➜  ~ ls /usr/local/cuda-7.5/lib64/libcuda.so.*
ls: cannot access '/usr/local/cuda-7.5/lib64/libcuda.so.*': No such file or directory
➜  ~ ls /usr/local/cuda-7.5/lib64/libcudnn.so.*
/usr/local/cuda-7.5/lib64/libcudnn.so.4  /usr/local/cuda-7.5/lib64/libcudnn.so.4.0.7
➜  ~

zheng-xq · 2016-07-08T08:13:12Z

Either system default or 4.0.7 is fine. Please let us know whether that makes a difference for you.

Waffleboy · 2016-07-08T08:40:14Z

I can't even generate the bottlenecks now.. It instantly fails =/

➜  tensorflow git:(master) ✗ bazel-bin/tensorflow/examples/image_retraining/retrain --image_dir ../DSG_2016/data/ORIGINAL/train_group/
Traceback (most recent call last):
  File "/storage/git/tensorflow/bazel-bin/tensorflow/examples/image_retraining/retrain.runfiles/org_tensorflow/tensorflow/examples/image_retraining/retrain.py", line 78, in <module>
    import tensorflow as tf
  File "/storage/git/tensorflow/bazel-bin/tensorflow/examples/image_retraining/retrain.runfiles/org_tensorflow/tensorflow/__init__.py", line 23, in <module>
    from tensorflow.python import *
  File "/storage/git/tensorflow/bazel-bin/tensorflow/examples/image_retraining/retrain.runfiles/org_tensorflow/tensorflow/python/__init__.py", line 48, in <module>
    from tensorflow.python import pywrap_tensorflow
  File "/storage/git/tensorflow/bazel-bin/tensorflow/examples/image_retraining/retrain.runfiles/org_tensorflow/tensorflow/python/pywrap_tensorflow.py", line 28, in <module>
    _pywrap_tensorflow = swig_import_helper()
  File "/storage/git/tensorflow/bazel-bin/tensorflow/examples/image_retraining/retrain.runfiles/org_tensorflow/tensorflow/python/pywrap_tensorflow.py", line 24, in swig_import_helper
    _mod = imp.load_module('_pywrap_tensorflow', fp, pathname, description)
  File "/storage/programfiles/anaconda3/lib/python3.5/imp.py", line 242, in load_module
    return load_dynamic(name, filename, file)
  File "/storage/programfiles/anaconda3/lib/python3.5/imp.py", line 342, in load_dynamic
    return _load(spec)
ImportError: /storage/git/tensorflow/bazel-bin/tensorflow/examples/image_retraining/retrain.runfiles/org_tensorflow/tensorflow/python/_pywrap_tensorflow.so: undefined symbol: _ZNK6google8protobuf7Message11GetTypeNameEv

zheng-xq · 2016-07-08T16:54:57Z

@martinwicke, @vrv, have you seen this error before, on Ubuntu 16.04?

ImportError: /storage/git/tensorflow/bazel-bin/tensorflow/examples/image_retraining/retrain.runfiles/org_tensorflow/tensorflow/python/_pywrap_tensorflow.so: undefined symbol: _ZNK6google8protobuf7Message11GetTypeNameEv

From the following link, it seems to be a compiler version related issue.

szagoruyko/loadcaffe#45

@Waffleboy, which gcc version do you have?

Waffleboy · 2016-07-09T04:12:27Z

I dowongraded to 4.9 to use another library. Is there a way to link tensorflow only to gcc5?

zheng-xq · 2016-07-09T07:29:37Z

In the "configure", you should be able to specify which version of gcc you want to use.

Waffleboy · 2016-07-12T16:16:28Z

Thanks, that worked :)

suiyuan2009 · 2017-08-04T12:17:55Z

I also meet this error many times recently. I'm using 4 old titanx cards to run tf benchmark code. I use the version from a patch #11392 . I'm using cuda 8.0 and cudnn 6.0 on ubuntu 16.04.

2017-08-04 17:37:51.480269: E tensorflow/stream_executor/cuda/cuda_event.cc:49] Error polling for event status: failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS
2017-08-04 17:37:51.480350: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:203] Unexpected Event status: 1

byronyi · 2017-08-04T12:23:36Z

Though I know that Titan X doesn't support GPU Direct RDMA, but could you confirm from your log? Successful GDR initialisation will print a line of log like Instrumenting GPU allocator with bus_id 2. Then we will see if we could isolate the problem from GDR.

Reproducing the issue using gRPC will do the same work.

suiyuan2009 · 2017-08-04T12:25:03Z

I have meet same error when using official gRPC protocal.

suiyuan2009 · 2017-08-05T07:09:33Z

I tried gcc-4.9, but still got CUDA_ERROR_ILLEGAL_ADDRESS error, my nvidia driver version is 375.51.

zhanglistar · 2017-09-06T10:39:30Z

I have the same error:

2017-09-06 18:35:49.879762: E tensorflow/stream_executor/cuda/cuda_event.cc:49] Error polling for event status: failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS
2017-09-06 18:35:49.879768: E tensorflow/stream_executor/cuda/cuda_blas.cc:543] failed to run cuBLAS routine cublasSgemm_v2: CUBLAS_STATUS_INTERNAL_ERROR
2017-09-06 18:35:49.879809: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:203] Unexpected Event status: 1

nvidia driver version: 375.20
cudnn version: 5.0
compile with gcc4.8.2

ilyaivensky · 2018-10-15T20:00:26Z

The same issue when running on AWS with DL AMI, python3.6, TF 1.8.0

aselle added the stat:awaiting response Status - Awaiting response from author label Jul 9, 2016

MartinThoma mentioned this issue Jul 11, 2016

Got CUDA_ERROR_MISALIGNED_ADDRESS error TensorVision/TensorVision#46

Open

Waffleboy closed this as completed Jul 12, 2016

malcolmreynolds mentioned this issue Oct 12, 2017

The current version does not support Windows? google-deepmind/sonnet#63

Closed

walkerlala mentioned this issue Apr 12, 2018

[deeplab] CUDA_ERROR_LAUNCH_FAILED while using demo script tensorflow/models#3753

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error polling for event status: failed to query event: CUDA_ERROR_MISALIGNED_ADDRESS #3224

Error polling for event status: failed to query event: CUDA_ERROR_MISALIGNED_ADDRESS #3224

Waffleboy commented Jul 7, 2016 •

edited

Loading

zheng-xq commented Jul 7, 2016

Waffleboy commented Jul 8, 2016 •

edited

Loading

zheng-xq commented Jul 8, 2016

Waffleboy commented Jul 8, 2016

zheng-xq commented Jul 8, 2016

Waffleboy commented Jul 8, 2016

zheng-xq commented Jul 8, 2016

Waffleboy commented Jul 8, 2016

zheng-xq commented Jul 8, 2016

Waffleboy commented Jul 9, 2016

zheng-xq commented Jul 9, 2016

Waffleboy commented Jul 12, 2016

suiyuan2009 commented Aug 4, 2017

byronyi commented Aug 4, 2017

suiyuan2009 commented Aug 4, 2017

suiyuan2009 commented Aug 5, 2017 •

edited

Loading

zhanglistar commented Sep 6, 2017 •

edited

Loading

ilyaivensky commented Oct 15, 2018

Error polling for event status: failed to query event: CUDA_ERROR_MISALIGNED_ADDRESS #3224

Error polling for event status: failed to query event: CUDA_ERROR_MISALIGNED_ADDRESS #3224

Comments

Waffleboy commented Jul 7, 2016 • edited Loading

Summary:

Environment info

Steps to reproduce

What have you tried?

Logs or other output that would be helpful

zheng-xq commented Jul 7, 2016

Waffleboy commented Jul 8, 2016 • edited Loading

zheng-xq commented Jul 8, 2016

Waffleboy commented Jul 8, 2016

zheng-xq commented Jul 8, 2016

Waffleboy commented Jul 8, 2016

zheng-xq commented Jul 8, 2016

Waffleboy commented Jul 8, 2016

zheng-xq commented Jul 8, 2016

Waffleboy commented Jul 9, 2016

zheng-xq commented Jul 9, 2016

Waffleboy commented Jul 12, 2016

suiyuan2009 commented Aug 4, 2017

byronyi commented Aug 4, 2017

suiyuan2009 commented Aug 4, 2017

suiyuan2009 commented Aug 5, 2017 • edited Loading

zhanglistar commented Sep 6, 2017 • edited Loading

ilyaivensky commented Oct 15, 2018

Waffleboy commented Jul 7, 2016 •

edited

Loading

Waffleboy commented Jul 8, 2016 •

edited

Loading

suiyuan2009 commented Aug 5, 2017 •

edited

Loading

zhanglistar commented Sep 6, 2017 •

edited

Loading