Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error polling for event status: failed to query event: CUDA_ERROR_MISALIGNED_ADDRESS #3224

Closed
Waffleboy opened this issue Jul 7, 2016 · 18 comments
Labels
stat:awaiting response Status - Awaiting response from author

Comments

@Waffleboy
Copy link

Waffleboy commented Jul 7, 2016

Summary:

Trying inceptionv3, was working fine all the way until I downgraded gcc 5+ to gcc4.9 to use Theano with keras: following this example http://deeplearning.net/software/theano/install_ubuntu.html

Now hitting this error before training starts (bottlenecks generate fine) whenever i run

bazel-bin/tensorflow/examples/image_retraining/retrain --image_dir

E tensorflow/stream_executor/cuda/cuda_event.cc:49] Error polling for event status: failed to query event: CUDA_ERROR_MISALIGNED_ADDRESS
F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:198] Unexpected Event status: 1

Cant figure out the problem. Sidenote that might help: bottlenecks generated alot faster when i used gcc 4.9 instead, but now the training crashes and i cant even run.

Environment info

Operating System:
Ubuntu 16.04

Installed version of CUDA and cuDNN:
(please attach the output of ls -l /path/to/cuda/lib/libcud*):
ls: cannot access '/path/to/cuda/lib/libcud*': No such file or directory
It's installed in /usr/local/cuda and /usr/local/cuda-7.5 instead.

CUDA 7.5, CuDNN v4.

Install steps:
CUDA:
bash cuda_7.5.18_linux.run --override
CUDNN:
Tried both:


tar xvzf cudnn-7.0-linux-x64-v4.0-prod.tgz
sudo cp cuda/include/cudnn.h /usr/local/cuda-7.5/include
sudo cp -r cuda/lib64/. /usr/local/cuda-7.5/lib64

and from here:
http://askubuntu.com/questions/767269/how-can-i-install-cudnn-on-ubuntu-16-04

If installed from binary pip package, provide:

  1. Which pip package you installed.
$ export TF_BINARY_URL=https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow-0.9.0-cp35-cp35m-linux_x86_64.whl
pip install --upgrade $TF_BINARY_URL
  1. The output from python -c "import tensorflow; print(tensorflow.__version__)".
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so locally
0.9.0

If installed from sources, provide the commit hash:

Steps to reproduce

  1. bazel build -c opt --copt=-mavx tensorflow/examples/image_retraining:retrain
  2. bazel-bin/tensorflow/examples/image_retraining/retrain --image_dir
  3. scratch head

What have you tried?

1.literally every other stack overflow / github question. eg, #2810

2.reinstalling cuda 7.5 and cudnn v4, running ./configure. no luck.

Logs or other output that would be helpful

(If logs are large, please upload as attachment).

@zheng-xq
Copy link
Contributor

zheng-xq commented Jul 7, 2016

@Waffleboy, what type of GPU do you have?

@Waffleboy
Copy link
Author

Waffleboy commented Jul 8, 2016

@zheng-xq GeForce GTX 860M/PCIe/SSE2, thanks!

@zheng-xq
Copy link
Contributor

zheng-xq commented Jul 8, 2016

Please look at my comment in the other thread, and see if that fixes your problem. Thanks.

#2810 (comment)

@Waffleboy
Copy link
Author

Hi, thanks for your reply!

I ran ./configure to do what you said, but now i get this strange error:

Please specify the location of python. [Default is /storage/programfiles/anaconda3/bin/python]: 
Do you wish to build TensorFlow with Google Cloud Platform support? [y/N] n
No Google Cloud Platform support will be enabled for TensorFlow
Do you wish to build TensorFlow with GPU support? [y/N] y
GPU support will be enabled for TensorFlow
Please specify which gcc nvcc should use as the host compiler. [Default is /usr/bin/gcc]: 
Please specify the Cuda SDK version you want to use, e.g. 7.0. [Leave empty to use system default]: 7.5
Please specify the location where CUDA 7.5 toolkit is installed. Refer to README.md for more details. [Default is /usr/local/cuda]: 
Please specify the Cudnn version you want to use. [Leave empty to use system default]: v4
Please specify the location where cuDNN v4 library is installed. Refer to README.md for more details. [Default is /usr/local/cuda]: 
Invalid path to cuDNN  toolkit. Neither of the following two files can be found:
/usr/local/cuda-7.5/lib64/libcudnn.so.v4
/usr/local/cuda-7.5/libcudnn.so.v4
/usr/local/cuda/lib64/libcudnn.so
/usr/local/cudnn/lib64/libcudnn.so
/usr/lib/x86_64-linux-gnu/libcudnn.so.v4
Please specify the Cudnn version you want to use. [Leave empty to use system default]: 

I have double and triple checked that the files are there, and that it was cuDNN v4. Should i ignore this and select default?

@zheng-xq
Copy link
Contributor

zheng-xq commented Jul 8, 2016

Normally the versions are like 4.0.7. You can check the name of the library for the version number.

ls /usr/local/cuda-7.5/lib64/libcuda.so.*

@Waffleboy
Copy link
Author

libcuda returns nothing, but libcudnn returns 2 files. Running with system default (ie, not manually typing v4) allows me to continue and pick 5.0 for compute capability. is this ok?

➜  ~ ls /usr/local/cuda-7.5/lib64/libcuda.so.*
ls: cannot access '/usr/local/cuda-7.5/lib64/libcuda.so.*': No such file or directory
➜  ~ ls /usr/local/cuda-7.5/lib64/libcudnn.so.*
/usr/local/cuda-7.5/lib64/libcudnn.so.4  /usr/local/cuda-7.5/lib64/libcudnn.so.4.0.7
➜  ~ 

@zheng-xq
Copy link
Contributor

zheng-xq commented Jul 8, 2016

Either system default or 4.0.7 is fine. Please let us know whether that makes a difference for you.

@Waffleboy
Copy link
Author

I can't even generate the bottlenecks now.. It instantly fails =/

➜  tensorflow git:(master) ✗ bazel-bin/tensorflow/examples/image_retraining/retrain --image_dir ../DSG_2016/data/ORIGINAL/train_group/
Traceback (most recent call last):
  File "/storage/git/tensorflow/bazel-bin/tensorflow/examples/image_retraining/retrain.runfiles/org_tensorflow/tensorflow/examples/image_retraining/retrain.py", line 78, in <module>
    import tensorflow as tf
  File "/storage/git/tensorflow/bazel-bin/tensorflow/examples/image_retraining/retrain.runfiles/org_tensorflow/tensorflow/__init__.py", line 23, in <module>
    from tensorflow.python import *
  File "/storage/git/tensorflow/bazel-bin/tensorflow/examples/image_retraining/retrain.runfiles/org_tensorflow/tensorflow/python/__init__.py", line 48, in <module>
    from tensorflow.python import pywrap_tensorflow
  File "/storage/git/tensorflow/bazel-bin/tensorflow/examples/image_retraining/retrain.runfiles/org_tensorflow/tensorflow/python/pywrap_tensorflow.py", line 28, in <module>
    _pywrap_tensorflow = swig_import_helper()
  File "/storage/git/tensorflow/bazel-bin/tensorflow/examples/image_retraining/retrain.runfiles/org_tensorflow/tensorflow/python/pywrap_tensorflow.py", line 24, in swig_import_helper
    _mod = imp.load_module('_pywrap_tensorflow', fp, pathname, description)
  File "/storage/programfiles/anaconda3/lib/python3.5/imp.py", line 242, in load_module
    return load_dynamic(name, filename, file)
  File "/storage/programfiles/anaconda3/lib/python3.5/imp.py", line 342, in load_dynamic
    return _load(spec)
ImportError: /storage/git/tensorflow/bazel-bin/tensorflow/examples/image_retraining/retrain.runfiles/org_tensorflow/tensorflow/python/_pywrap_tensorflow.so: undefined symbol: _ZNK6google8protobuf7Message11GetTypeNameEv

@zheng-xq
Copy link
Contributor

zheng-xq commented Jul 8, 2016

@martinwicke, @vrv, have you seen this error before, on Ubuntu 16.04?

ImportError: /storage/git/tensorflow/bazel-bin/tensorflow/examples/image_retraining/retrain.runfiles/org_tensorflow/tensorflow/python/_pywrap_tensorflow.so: undefined symbol: _ZNK6google8protobuf7Message11GetTypeNameEv

From the following link, it seems to be a compiler version related issue.

szagoruyko/loadcaffe#45

@Waffleboy, which gcc version do you have?

@aselle aselle added the stat:awaiting response Status - Awaiting response from author label Jul 9, 2016
@Waffleboy
Copy link
Author

I dowongraded to 4.9 to use another library. Is there a way to link tensorflow only to gcc5?

@zheng-xq
Copy link
Contributor

zheng-xq commented Jul 9, 2016

In the "configure", you should be able to specify which version of gcc you want to use.

@Waffleboy
Copy link
Author

Thanks, that worked :)

@suiyuan2009
Copy link
Contributor

I also meet this error many times recently. I'm using 4 old titanx cards to run tf benchmark code. I use the version from a patch #11392 . I'm using cuda 8.0 and cudnn 6.0 on ubuntu 16.04.

2017-08-04 17:37:51.480269: E tensorflow/stream_executor/cuda/cuda_event.cc:49] Error polling for event status: failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS
2017-08-04 17:37:51.480350: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:203] Unexpected Event status: 1

@byronyi
Copy link
Contributor

byronyi commented Aug 4, 2017

Though I know that Titan X doesn't support GPU Direct RDMA, but could you confirm from your log? Successful GDR initialisation will print a line of log like Instrumenting GPU allocator with bus_id 2. Then we will see if we could isolate the problem from GDR.

Reproducing the issue using gRPC will do the same work.

@suiyuan2009
Copy link
Contributor

I have meet same error when using official gRPC protocal.

@suiyuan2009
Copy link
Contributor

suiyuan2009 commented Aug 5, 2017

I tried gcc-4.9, but still got CUDA_ERROR_ILLEGAL_ADDRESS error, my nvidia driver version is 375.51.

@zhanglistar
Copy link

zhanglistar commented Sep 6, 2017

I have the same error:

2017-09-06 18:35:49.879762: E tensorflow/stream_executor/cuda/cuda_event.cc:49] Error polling for event status: failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS
2017-09-06 18:35:49.879768: E tensorflow/stream_executor/cuda/cuda_blas.cc:543] failed to run cuBLAS routine cublasSgemm_v2: CUBLAS_STATUS_INTERNAL_ERROR
2017-09-06 18:35:49.879809: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:203] Unexpected Event status: 1

nvidia driver version: 375.20
cudnn version: 5.0
compile with gcc4.8.2

@ilyaivensky
Copy link

The same issue when running on AWS with DL AMI, python3.6, TF 1.8.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stat:awaiting response Status - Awaiting response from author
Projects
None yet
Development

No branches or pull requests

7 participants